G06F16/24545

Automated provisioning for database performance

Embodiments utilize trained query performance machine learning (QP-ML) models to predict an optimal compute node cluster size for a given in-memory workload. The QP-ML models include models that predict query task runtimes at various compute node cardinalities, and models that predict network communication time between nodes of the cluster. Embodiments also utilize an analytical model to predict overlap between predicted task runtimes and predicted network communication times. Based on this data, an optimal cluster size is selected for the workload. Embodiments further utilize trained data capacity machine learning (DC-ML) models to predict a minimum number of compute nodes needed to run a workload. The DC-ML models include models that predict the size of the workload dataset in a target data encoding, models that predict the amount of memory needed to run the queries in the workload, and models that predict the memory needed to accommodate changes to the dataset.

PROJECTIONS DETERMINATION FOR COLUMN-BASED DATABASES

The present subject matter relates to determining a set of projections for optimizing query execution on a column-based database. In an example implementation, a plurality of historical queries executed on the column-based database is obtained, and the set of projections is determined based on the plurality of historical queries. The set of projections is determined in a manner such that a total cost of the plurality of historical queries over the set of projections is minimum. The total cost is a sum of a cost of each of the plurality of historical queries. The cost of each historical query is computed based on number of columns in a smallest projection from the set of projections which is used for execution of the respective historical query.

Finding optimal query plans

Systems and methods for optimizing a query, and more particularly, systems and methods for finding optimal plans for graph queries by casting the task of finding the optimal plan as an integer programming (ILP) problem. A method for optimizing a query, comprises building a data structure for a query, the data structure including a plurality of components, wherein each of the plurality of components corresponds to at least one graph pattern, determining a plurality of flows of query variables between the plurality of components, and determining a combination of the plurality of flows between the plurality of components that results in a minimum cost to execute the query.

DATABASE STATISTICAL HISTOGRAM FORECASTING

A method and system for forecasting a histogram in a database system is provided. The method includes determining that database table statistics and historical statistical histograms associated with specified subject matter have been previously retrieved. The database table statistics and historical statistical histograms are retrieved and determined to be frequency based histograms. Historical target values associated with the historical statistical histograms are identified and new target values associated with the historical target values are identified. A value identifying a number of occurrences for identified target values comprising the new target values and the historical target values is forecast and database table histograms comprising the identified target values are stored.

Salient sampling for query size estimation
09779137 · 2017-10-03 · ·

Salient sampling for query size estimation includes identifying two or more columns in a database table that have corresponding columns in one or more other tables. One or more hash functions are applied to domains of each of the identified columns. A first hash function is applied to a domain of the first column and a second hash function to a domain of the second column. A subset of the rows in the database table is selected. The selecting includes selecting rows in the database table where results of the first hash function meet a first numeric threshold and selecting rows in the database table where results of the second hash function meet a second numeric threshold. A sample database table corresponding to the database table is created. The sample database table includes the selected subset of the rows in the database table.

SKEW SENSITIVE ESTIMATING OF RECORD CARDINALITY OF A JOIN PREDICATE FOR RDBMS QUERY OPTIMIZER ACCESS PATH SELECTION
20170249360 · 2017-08-31 ·

A query optimizer receives a relational database management system (RDBMS) query having a join predicate with a join between a first and a second table. The query optimizer determines a high skew value for a first variable joining the first and second tables at columns per the join predicate. A count query on one of the first and second tables is constructed and run only using the high skew value as a substitution for the first variable. A quantity of records for the join of the first and second tables is estimated using results of the count query. Different access paths (e.g., query plans) are used by the query optimizer depending on whether the estimated quantity of records exceeds a previously determined threshold or not.

Group-by size result estimation
09747337 · 2017-08-29 · ·

A method and system for accurately estimating a result size of a Group-By operation in a relational database. The estimate utilizes the probability of union of the columns involved in the operation, as well as the relative cardinality of each column with respect to the other columns in the operation. In addition, the estimate incorporates the use of table filters when indicated such that table filters are applied prior to determining the size of the tables in the operation, as well as including equivalent columns into the list of columns that are a part of the Group-By operation. Accordingly, the estimate of the result size of the operation includes influencing factors that provide an accurate estimation of system memory requirements.

Method and Apparatus for Determining SQL Execution Plan
20170242884 · 2017-08-24 ·

A method and an apparatus for determining a structured query language (SQL) execution plan are provided to optimize determining of the SQL execution plan and improve execution efficiency of the SQL execution plan. The SQL execution plan corresponds to at least one relation table. During an N.sup.th iteration, the method includes obtaining a first iteration parameter generated after a first plan tree is executed on the at least one relation table during an (N−1).sup.th iteration, where N is a natural number greater than 1, establishing a second plan tree according to the first iteration parameter, and determining the first plan tree or the second plan tree as the SQL execution plan when a difference between the second plan tree and the first plan tree is not greater than a first threshold.

Geo-scale analytics with bandwidth and regulatory constraints

Various technologies described herein pertain to controlling geo-scale analytics with bandwidth and regulatory constraints. An analytical query (e.g., a recurrent analytical query, a non-recurrent analytical query, etc.) to be executed over distributed data in data partitions stored in a plurality of data centers can be received. Moreover, a query execution plan for the analytical query can be generated, where the query execution plan includes tasks. Further, replication strategies for the data partitions can be determined. A replication strategy for a particular data partition can specify one or more data centers to which the particular data partition is to be replicated if the particular data partition is to be replicated. The tasks of the query execution plan for the analytical query can further be scheduled to the data centers based on the replication strategies for the data partitions. The analytical query can be part of a workload of analytical queries.

Systems and methods for rapid data analysis

A method for rapid data analysis comprising receiving and interpreting a query, collecting a first data sample from the first set of data shards, calculating an intermediate result to the query based on analysis of the first data sample, identifying a second set of data shards based on the intermediate result, collecting a second data sample from the second set of data shards, and calculating a final result to the query based on analysis of the second data sample.