Patent classifications
G06F16/221
Columnar storage and query method and system for time series data
A columnar storage method and a query method and system for time series data. The storage method includes: dividing a column of time series data into a plurality of pages, wherein each page stores a part of data points of the column of time series data and the sum of the data points stored in all the pages is all the data points in the column of time series data (S1); and setting two parts, i.e., a page header and a page body, for each page, storing summary index information of all the data points in the page in the page header of the page and storing data value information of all the data points in the page in the page body of the page (S2).
Method and system for identifying duplicate columns using statistical, semantics and machine learning techniques
With the availability of huge amount of data, it has becoming difficult to identify and manage duplicate data, especially when the data is in a plurality of columns. A method and system for identifying duplicate columns using statistical, semantics and machine learning techniques have been provided. The system provides a design framework to compare huge datasets at column level and identify potential duplicate columns, not based on the column title, but based on all of its values. The disclosure has ability to compare values in multiple columns and identify potential duplicate columns wherein comparison of values is not only for the exact match, but for semantic match, smart match, fuzzy match, and match after UOM conversion etc. using Statistical, semantics and machine learning techniques.
Database indexing using structure-preserving dimensionality reduction to accelerate database operations
Embodiments of the present disclosure are directed to systems and methods for managing a database. In one or more examples, the system obtains input data comprising one or more data entries, where each data entry comprises one or more data items, and each data item comprises a field name and a field value. The system can generate a key-value set for each data item to obtain a plurality of key-value sets. Each key-value set includes at least a first key element comprising the field name of the respective data item and a second key element comprising the field value of the respective data item. The system can sort and store the plurality of key-value sets in the database. The system can further receive a query indicative of a field name or a field value, and generate, for display, an output based on retrieved key elements sets based on the query.
Inferring location attributes from data entries
A system and method are provided for inferring location attributes from data entries. The method comprises for data entries in a structured data set format, a computer system selecting a sample of rows. The computer system then identifies columns containing geospatial and temporal information based on the column headings. The computer system next identifies location information within the structured data set. The computer system determines implied location information based on the identified location information. The computer system derives location values based on the identified and implied location information using consolidation rules, resulting in a final set of location attributes for the data entries. The computer system then associates the final set of location attributes with the data entries.
Multiplexing data operation
Embodiments of the present invention relate to a method, system, and computer program product for multiplexing data operation. In some embodiments, a method is disclosed. A query for at least one table comprising a plurality of data records is received. The query indicating a plurality of data operations to be performed on the plurality of data records. The plurality of data operations are combined into a target data operation. An intermediate result of the query is generated by performing the target data operation on the plurality of data records. A final result of the query is determined based on the intermediate result. In other embodiments, a system and a computer program product are disclosed.
Processing queries using an index generated based on data segments
A table organized into a set of batch units is accessed. A set of N-grams are generated for a data value in the source table. The set of N-grams include a first N-gram of a first length and a second N-gram of a second length where the first N-gram corresponds to a prefix of the second N-gram. A set of fingerprints are generated for the data value based on the set of N-grams. The set of fingerprints include a first fingerprint generated based on the first N-gram and a second fingerprint generated based on the second N-gram and the first fingerprint. A pruning index that indexes distinct values in each column of the source table is generated based on the set of fingerprints and stored in a database with an association with the source table.
Query plan generation and execution based on single value columns
Aspects of the current subject matter are directed to executing queries on tables in which one or more columns contain a single value. Upon execution of a query, columns in which a single value is contained are identified, and a pre-compiled code entry containing relevant identifying information is compiled as part of a query execution plan. The query execution plan is used for subsequent query executions, alleviating the need to access the columns during the subsequent query executions that involve the columns. A fingerprint value may be used to track if changes to relevant tables occur.
METADATA CLASSIFICATION
Systems and method are disclosed that retrieve data from a data set organized in a plurality of columns. For each column in the plurality of columns, the systems and method generate one or more candidate semantic categories for the column, where each of the one or more candidate semantic categories has a corresponding probability. The systems and method create a feature vector for the column from the one or more candidate semantic categories and the corresponding probabilities. The systems and method determine a semantic category type of the column based on the feature vector. The systems and method anonymize the data in the column based on the semantic category type, which includes replacing more specific data in the column with less specific data based on a data hierarchy that relates the more specific data to the less specific data.
Data compression techniques
Techniques and solutions are described for compressing data and facilitating access to compressed data. Compression can be applied to proper data subsets of a data set, such as to columns of a table. Using various methods, the proper data subsets can be evaluated to be included in a group of proper data subsets to be compressed using a first compression technique, where unselected proper data subsets are not compressed using the first compression technique. Data in the data set can be reordered based on a reordering sequence for the proper data subsets. Reordering data in the data set can improve compression when at least a portion of the proper data subsets are compressed. A data structure is provided that facilitates accessing specified data stored in a compressed format.
System and method for generating a column-oriented data structure repository for columns of single data types
A system and method for generating a column-oriented data structure repository for columns of single data types. The method includes: receiving instructions to generate a new column of a single data type for a first data structure, wherein the first data structure is a column oriented data structure; and storing, based on the instructions, the new column within the column-oriented data structure repository, wherein the column-oriented data structure repository is accessible to at least a second user account.