Patent classifications
G06F16/24532
QUERY PERFORMANCE
An approach is provided for improving query performance. A query is received whose execution includes a first join of tables having sets of records and includes a second join with a next table whose set of records is smaller than a set of transient records resulting from the first join. A threshold for a number of records in the next table is received. A first count of the transient records resulting from the first join is estimated. A second count of a number of records in the next table is determined. It is determined that the second count is less than the threshold. Based on the second count being less than the threshold and without using the first count, a query execution plan is generated to include a broadcast of the records in the next table to data slices without including a broadcast of the transient records.
Per-node custom code engine for distributed query processing
Distributed query processing is often performed by a set of nodes that apply MapReduce to a data set and materialize partial results to storage, which are then aggregated to produce the query result. However, this architecture requires a preconfigured set of database nodes; can only fulfill queries that utilize MapReduce processing; and may be slowed down by materializing partial results to storage. Instead, distributed query processing can be achieved by choosing a node for various portions of the query, and generating customized code for the node that only performs the query portion that is allocated to the node. The node executes the code to perform the query portion, and rather than materializing partial results to storage, streams intermediate query results to a next selected node in the distributed query. Nodes selection may be involve matching the details of the query portion with the characteristics and capabilities of the available nodes.
Parallel query execution
A system includes reception of a first fragment of a first result set of a first one of a plurality of queries, storage of the first fragment of the first result set in a first local buffer associated with the first one of the plurality of queries, reception of a first fragment of a second result set of a second one of a plurality of queries, storage the first fragment of the second result set in a second local buffer associated with the second one of the plurality of queries, determination to flush the first local buffer, and, in response to the determination, transmit all fragments currently stored in the first local buffer to a client from which the plurality of queries was received with an identifier of the first one of the plurality of queries, before receiving all fragments of the first result set.
Unified data processing across streaming and indexed data sets
Systems and methods are described for unified processing of indexed and streaming data. A system enables users to query indexed data or specify processing pipelines to be applied to streaming data. In some instances, a user may specify a query intended to be run against indexed data, but may specify criteria that includes not-yet-indexed data (e.g., a future time frame). The system may convert the query into a data processing pipeline applied to not-yet-indexed data, thus increasing the efficiency of the system. Similarly, in some instances, a user may specify a data processing pipeline to be applied to a data stream, but specify criteria including data items outside the data stream. For example, a user may wish to apply the pipeline retroactively, to data items that have already exited the data stream. The system can convert the pipeline into a query against indexed data to satisfy the users processing requirements.
Storage level parallel query processing
- Gopi Krishna Attaluri ,
- Dhruva Ranjan Chakrabarit ,
- Volodymyr Verovkin ,
- Kamal Kant Gupta ,
- Shriram Sridharan ,
- Aakash Shah ,
- Aleksandr Valerevich Feinberg ,
- Yuri Volobuev ,
- Tengiz Kharatishvili ,
- Saileshwar Krishnamurthy ,
- Anurag Windlass Gupta ,
- Murali Brahmadesam ,
- Namrata Bapat ,
- Alexandre Olegovich Verbitski ,
- Jeffrey Davis ,
- Debanjan Saha
Storage level query processing may be implemented for processing database queries. Nodes that can access a database may perform parallel processing for at least a portion of a database query. An indication may be received that specifies parallel processing for the database query. The nodes can then be caused to perform the portion of the query as part of providing a result in response to the database query instead of a node, such as a query engine node, that received the database query.
Lightweight database pipeline scheduler
A database scheduler system can be implemented on a distributed database system. The system schedules operations in a lightweight approach that reduces idling and increases parallel processing of database operations for a query on data of the database. The system performs restarts of individual operators or fragments of a query without restarting the entire query.
Low latency ingestion into a data system
Described herein are techniques for improving transfer of metadata from a metadata database to a database stored in a data system, such as a data warehouse. The metadata may be written into the metadata database with a version stamp, which is monotonic increasing register value, and a partition identifier, which can be generated using attribute values of the metadata. A plurality of readers can scan the metadata database based on version stamp and partition identifier values to export the metadata to a cloud storage location. From the cloud storage location, the exported data can be auto ingested into the database, which includes a journal and snapshot table.
MASTER DATA INCONSISTENCY EVALUATOR (MICE)
Systems, methods, and computer products are described herein for identifying data inconsistencies within database tables associated with an application. A master data inconsistency evaluator receives data including at least one selection parameter within at least one database table. By the master data inconsistency evaluator evaluates the at least one selection parameter by comparing the at least one selection parameter with other database tables associated with the application to identify data inconsistencies. The master data inconsistency evaluator repairs the data inconsistencies to further facilitate an error free transaction.
A DATA EXTRACTION METHOD
Described herein is a method (100) of extracting data from a dataset of files stored in a database (109). The method (100) including step (101) of executing a conversion procedure to convert the dataset of files into a plurality (N) of structured binary files. At step (102) the structured binary files are stored in memory. At step (103) a query is received from a user input to extract queried data from the dataset. The query includes a plurality of query arguments. At step (104), the query arguments are input to a data query procedure. The query procedure includes the substeps of: (104a) accessing the structured to binary files in memory; (104b) loading a reference data structure into memory, the reference data structure specifying a list of data classes; (104c) executing a data query algorithm to retrieve a subset of the data determined by the query arguments; and (104d) returning the subset of the data as one or more files having a predetermined file type.
DATABASE QUERY SPLITTING
A determination is made whether a received database query is to be processed by either a first database, a second database, or at least in part by both the first and second databases including by determining whether the query meets criteria to split the query for processing across the first and second databases. The first and second databases store shared synchronized records, the first database configured to store the records in a column-oriented format and the second database configured to store the records in a row-oriented format. In response to a determination that the query meets the criteria to split the query, a first and second component query of the database query are generated for the first and second databases, respectively, the second component query based at least in part on a result of the first component query. The execution of the first and second component queries is pipelined.