G06F16/24545

Generating a subquery for an external data system using a configuration file
11636105 · 2023-04-25 · ·

Systems and methods are disclosed for receiving, at a data intake and query system, a query that includes an indication to process data managed by a third-party data storage and processing system that supports a different query language than the data intake and query system. The data intake and query system identifies a third-party data storage and processing system that manages the data to be processed and generates a subquery for execution by the third-party data storage and processing system, generates instructions for one or more worker nodes to receive and process results of the subquery from the third-party data storage and processing system, and instructs the worker nodes to provide results of the processing to the data intake and query system.

Determining records generated by a processing task of a query
11599541 · 2023-03-07 · ·

Systems and methods are described for determining a quantity of records generated by a processing task of a query executed in a data intake and query. The system receives a query and identifies a processing task of the query and a quantity of records to be processed according to the query. The system determines the number of records generated by the processing task based on the number of records to be processed and a record generation estimate. The system can allocate compute resources or determine a query execution time for at least a portion of the query based on the determined quantity of records generated.

End user configuration of cost thresholds in a database system and methods for use therewith

A method for execution by a query processing system includes receiving a query request from a requesting entity. Query cost data is generated based on the query request by utilizing a query pricing scheme. Minimum query cost compliance data is generated for the query request based on determining whether the minimum query cost data complies with a minimum query cost rule. When the minimum query cost compliance data indicates the minimum query cost data complies with the minimum query cost rule a query result is generated based on facilitating execution of the query by executing at least one query function of the query against a database system, and the query result is transmitted to the requesting entity. When the minimum query cost compliance data indicates the query cost data does not comply with the minimum query cost rule, the query result is not transmitted to the requesting entity.

Federated query optimization
11636108 · 2023-04-25 · ·

A method builds a regression model for predicting processing times for federated queries using a variety of data sources. The method includes obtaining federated queries (e.g., from benchmarks), and generates a plurality of federated query plans for each federated query. Each federated query plan corresponds to executing a respective federated query using a respective data source as the federation engine. The method includes forming feature vectors for each federated query plan based on cost estimations for executing the respective federated query plan and cost estimations for data transfer. The method further includes training a regression model, using the feature vectors for the plurality of federated query plans, to predict runtimes for executing federated queries using the variety of data sources as a federation engine. Some implementations use the trained regression model to determine a suitable federation engine for a given federated query.

METHODS AND APPARATUS TO ESTIMATE CARDINALITY THROUGH ORDERED STATISTICS
20230120709 · 2023-04-20 ·

Methods, apparatus, systems, and articles of manufacture to estimate cardinality through ordered statistics are disclosed. In an example, an apparatus includes processor circuitry to selects a sample dataset from a first reference dataset of media assets and partitions the sample dataset into m mutually exclusive subsets of approximately equal size. The processor circuitry then estimates a ratio of a sample weighted average and empirical cumulative distribution of an approximately largest order statistic from at least one of the m subsets and generates an estimate of a total cardinality of the first reference dataset by multiplying the ratio by approximately m.

USING STATISTICAL DISPERSION IN DATA PROCESS GENERATION

Methods and systems are described herein for facilitating data integrity processes using measures of statistical dispersion (e.g., gini impurities) of dataset features. The described mechanism may be also be used for selection and dimensionality reduction. Dimensionality reduction may enable storing the dataset using less storage space or performing other operations on the dataset using less resources. In some embodiments, the above described mechanism may be used for supervised categorial clustering and/or categorical classification.

METHOD AND SYSTEM FOR ESTIMATING THE CARDINALITY OF INFORMATION
20230069313 · 2023-03-02 ·

A computer-implemented method for efficiently estimating the number of unique elements in a collection of elements comprises generating, via hash logic, hash values for each element of the collection of elements. The method further comprises specifying, in a sketch-frequency table, a set of discrete statistical values associated with the hash values and, for each discrete statistical value of the set of discrete statistical values, information indicative of a frequency at which binary representations of the hash values are associated with the discrete statistical value. The cardinality of the collection of elements is estimated based on the sketch-frequency table.

METHOD AND SYSTEM FOR ESTIMATING THE CARDINALITY OF INFORMATION
20230063709 · 2023-03-02 ·

A computer-implemented method for efficiently estimating the number of unique elements in a collection of elements comprises generating, via hash logic, hash values associated with the elements. The hash values specify bit positions within an array of bits. Hash values output from the hash logic conform to a geometric distribution such that bit positions of the array of bits corresponding to lower orders bits are more likely to be generated than bit positions corresponding to higher-order bits. Bits of the array of bits corresponding to the bit positions are set. The number of bits of the array of bits that are set is counted. Estimation logic estimates the number of unique elements of the collection of elements as a function of the number of bits of the array of bits that are set.

Distinct value estimation for query planning
11663213 · 2023-05-30 · ·

The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.

Constraint data statistics for dynamic partition pruning

Disclosed herein are system, method, and computer program product embodiments for performing dynamic partition pruning using data statistic objects as data integrity constraints. An embodiment operates by partitioning a database table into a plurality of partitions based on a partition criterion. The embodiment creates a data statistics object for a partition in the plurality of partitions. The embodiment receives a query for the database table. The embodiment determines the data statistics object is consistent with data in the partition. The embodiment processes the query for the partition based on the data statistics object.