Patent classifications
G06F16/24545
GENERATION OF OPTIMIZED LOGIC FROM A SCHEMA
A method includes accessing a schema that specifies relationships among datasets, computations on the datasets, or transformations of the datasets, selecting a dataset from among the datasets, and identifying, from the schema, other datasets that are related to the selected dataset. Attributes of the datasets are identified, and logical data representing the identified attributes and relationships among the attributes is generated. The logical data is provided to a development environment, which provides access to portions of the logical data representing the identified attributes. A specification that specifies at least one of the identified attributes in performing an operation is received from the development environment. Based on the specification and the relationships among the identified attributes represented by the logical data, a computer program is generated to perform the operation by accessing, from storage, at least one dataset having the at least one of the attributes specified in the specification.
AUTOMATED PROVISIONING FOR DATABASE PERFORMANCE
Embodiments utilize trained query performance machine learning (QP-ML) models to predict an optimal compute node cluster size for a given in-memory workload. The QP-ML models include models that predict query task runtimes at various compute node cardinalities, and models that predict network communication time between nodes of the cluster. Embodiments also utilize an analytical model to predict overlap between predicted task runtimes and predicted network communication times. Based on this data, an optimal cluster size is selected for the workload. Embodiments further utilize trained data capacity machine learning (DC-ML) models to predict a minimum number of compute nodes needed to run a workload. The DC-ML models include models that predict the size of the workload dataset in a target data encoding, models that predict the amount of memory needed to run the queries in the workload, and models that predict the memory needed to accommodate changes to the dataset.
PRUNING CUTOFFS FOR DATABASE SYSTEMS
The subject technology receives, during a query compilation process, a query directed to a set of source tables, each source table from the set of source tables being organized into at least one micro-partition and the query including at least one pruning operation. The subject technology performs, during the query compilation process, a modification of the query for adjusting the at least one pruning operation, the modification being based at least in part on a set of statistics collected for previous pruning operations on at least a portion of the set of source tables and a set of heuristics. The subject technology compiles the query including the modification of the query. The subject technology provides the compiled query to an execution node of a database system for execution.
QUERY PROCESSING USING A PREDICATE-OBJECT NAME CACHE
In some examples, a database system includes a memory to store a predicate-object name cache, where the predicate-object name cache contains predicates mapped to respective object names. The database system further includes at least one processor to receive a query containing a given predicate, identify, based on accessing the predicate-object name cache, one or more object names indicated by the predicate-object name cache as being relevant for the given predicate, retrieve one or more objects identified by the one or more object names from a remote data store, and process the query with respect to data records of the one or more objects retrieved from the remote data store.
Segment trend analytics query processing using event data
A method, system, and computer program product for conserving resources in segment trend analytics query processing using event data. A set of events of an entity is aggregated and sorted from earliest to last, and sequentially processed to incrementally set a subset therefrom. A predicate function for determining segment membership is applied respective of a linear timeline of events of the subset represented by a time of an event processed. A data record comprising identification of the entity, time, and respective segment is generated and stored. Data records are aggregated by respective identification of a segment and a time comprised therein, and at least one analytic measure respective of entities which identification thereof is comprised therein, is calculated and stored. An indication of the at least one analytic measure calculated respective of a segment and a time queried is returned, whereby determination of a trend of the segment is enabled.
Scaling query processing resources for efficient utilization and performance
Scaling of query processing resources for efficient utilization and performance is implemented for a database service. A query is received via a network endpoint associated with a database managed by a database service. Respective response times predicted for the query using different query processing configurations available to perform the query are determined. Those query processing configurations with response times that exceed a variability threshold determined for the query may be excluded. A remaining query processing configuration may then be selected to perform the query.
Dynamic access paths
Embodiments are disclosed for a method for dynamic access paths. The method includes generating real-time statistics (RTS) estimates based on a log of a database. Further, the method includes generating access paths based on a structured query language command and the RTS estimates. The method also includes training a machine learning model to map the RTS estimates to the access paths.
COMPRESSING DATA SETS FOR STORAGE IN A DATABASE SYSTEM
A method includes determining a data set for storage that includes a plurality of uncompressed data slabs in accordance with a serialized data slab ordering. A storage data set that includes a plurality of compressed data slabs is created based on the data set in accordance with the serialized data slab ordering. Each compressed data slab of the plurality of compressed data slabs is generated from at least one corresponding uncompressed data slab of the plurality of uncompressed data slabs that includes a plurality of values based on generating compressed data for each compressed data slab based on the at least one corresponding uncompressed data slab, and generating compression information for each compressed data slab. The storage data set is stored via a plurality of computing devices.
Automated query predicate selectivity prediction using machine learning models
A method, a computer system, and a computer program product for cardinality estimation is provided. Embodiments of the present invention includes accessing database relations. The database relations are utilized to collect a random sample from each of the database relations. Training data is then generated from the random sample. The training data is used to build a cumulative frequency function (CFF) model. The cumulative frequency function (CFF) model then provides a cardinality estimation for an output for SQL operators.
Metadata-based statistics-oriented processing of queries in an on-demand environment
In accordance with embodiments, there are provided mechanisms and methods for facilitating metadata-based statistics-oriented query processing for large datasets in an on-demand services environment. In one embodiment and by way of example, a method comprises evaluating metadata associated with a query placed on behalf of a tenant in a multi-tenant environment, and computing process statistics for the query based on the metadata, where the process statistics reveal an estimation of resources needed for execution of the query within a predictable amount of time and using fewer than or equal to an allocated number of scans of a database. The method may further include associating, based on the process statistics, a set of rules and the estimated resources to process the query, and executing the query based on the set of rules and using the estimated resources such that the query is processed within the predictable amount of time and using fewer than or equal to the allocated number of scans of the database.