G06F16/24545

DATA LAKE WORKLOAD OPTIMIZATION THROUGH EXPLAINING AND OPTIMIZING INDEX RECOMMENDATIONS
20210382897 · 2021-12-09 ·

Methods, systems and computer program products are described herein that enable data workload optimization through “what-if” modeling of indexes and index recommendation. In an example aspect, a system is configured to accept a workload comprising a plurality of queries directed at data having a first physical data layout, generate a set of candidate indexes based on the plurality of queries, enumerate index configurations based of the set of candidate indexes, each index configuration comprising a subset on the set of candidate indexes, generate a hierarchical graph of the index configurations, search the hierarchical graph for a recommended index configuration comprising an index configuration with the lowest estimated cost while pruning index configurations not considered from the graph of index configurations to generate a pruned graph, execute a graph query against the pruned graph generating a graph query result and perform an optimization operation based on the graph query result.

OPTIMIZING LIMIT QUERIES OVER ANALYTICAL FUNCTIONS

A relational database management system (RDBMS) optimizes limit queries over analytical functions, wherein the limit queries include an output clause comprising a LIMIT, TOP and SAMPLE clause with an expression specifying a limit that is a number K or a percentage α %. The optimizations of the limit queries include: (1) static compile-time optimizations, and (2) dynamic run-time optimizations, based on semantic properties of “granularity” and “input-to-output cardinality” for the analytical functions.

Reusing sub-query evaluation results in evaluating query for data item having multiple representations in graph

Sub-queries for a query are determined. The query is for retrieving a data item of a data graph. The data graph stores representations of the data item. Each representation of the data item stores knowledge represented by the data item in a different way or manner. Each sub-query corresponds to a different representation by which the data graph stores the data item. The sub-queries are evaluated to determine an appropriate representation of the data item in fulfillment of the query without duplicatively traversing the data graph, such as by reusing evaluation results of the sub-queries that overlap one another.

Enhanced Techniques For Bias Analysis

A fairness metric of decisions pertaining to a plurality of candidates indicated in a data set is estimated. Using a Hamiltonian Monte Carlo sampling algorithm, sample sets corresponding to random variables of a null model and an alternate model are obtained. A respective kernel density estimator is fitted on at least some sample sets, and importance sampling is implemented on additional samples generated using the kernel density estimators. The estimated fairness metric is provided via one or more programmatic interfaces.

RUNTIME METRIC ESTIMATIONS FOR FUNCTIONS

In some examples, a system receives function descriptors for different types of functions to be used when processing database queries, each function descriptor of the function descriptors comprising information relating to a respective function of the different types of functions. The system computes, based on a first function descriptor for a first function of the different types of functions, an estimate of a runtime metric associated with execution of the first function for processing a database query.

Systems and methods for rapid data analysis

A method for rapid data analysis includes receiving and interpreting a first query operating on a first dataset partitioned into shards by a first field; collecting a first data sample from a first set of data shards; calculating a first result to the first query based on analysis of the first data sample; and partitioning a second dataset into shards by a second field based on the first result.

SELECTIVITY COMPUTATION FOR FILTERING BASED ON METADATA

In some examples, the database system maintains metadata for a plurality of data objects, the metadata containing ranges of values of an attribute for the plurality of data objects, where the ranges of values of the attribute comprise a respective range of values of the attribute for each corresponding data object of the plurality of data objects. The database system generates a data structure tracking quantities of ranges of values of the attribute that have a specified relationship with respect to corresponding different values of the attribute. The database system receives a database query comprising a predicate specifying a condition on a given value of the attribute, and computes, for the database query, a selectivity of filtering based on the metadata, the selectivity computed based on the data structure.

Query processing method, data source registration method, and query engine

A query processing method includes decomposing an SQL into logical plans based on data source feature information, to obtain a logical plan set, where the data source feature information is stored in an internal data source feature library of a query engine, and the internal data source feature library is stored in cache space of the query engine; generating physical plans for the logical plan set based on the data source feature information, to obtain a physical plan set; determining query costs of the physical plan set based on the data source feature information, to obtain a physical plan with a highest priority; and executing the physical plan with the highest priority, to obtain a query result queried by a user. A data source registration method and a query engine is further disclosed.

DATABASE QUERY PROCESSING FOR DATA IN A REMOTE DATA STORE

In some examples, a database system identifies a plurality of query portions in a database query that contain references to a first external table, the first external table being based on data from a remote data store coupled to the database system over a network. The database system creates a common spool portion that includes projections and selections of the plurality of query portions, and rewrites the plurality of query portions into rewritten query portions that refer to a spool containing an output of the common spool portion. For execution of the database query, the database system determines, as part of optimizer planning, whether to use the plurality of query portions or the common spool portion and the rewritten query portions.

INFERRED PREDICATES FOR QUERY OPTIMIZATION
20220179854 · 2022-06-09 ·

A system includes reception of a query comprising a join operation on a first table and a second table and a join condition associated with the join operation, determination of a first table column of the first table and a second table column of the second table associated with the join condition, determination of an inferred predicate of the query, the inferred predicate associated with a first column dictionary of the first table column and a second column dictionary of the second table column, determination of a cost of using the inferred predicate to perform the join operation, determination of a plurality of query execution plans to execute the join operation using the inferred predicate, and determination of a cost of each of the plurality of query execution plans based on the cost of using the inferred predicate to perform the join operation.