Patent classifications
G06F16/2462
DETERMINING DOMAIN AND MATCHING ALGORITHMS FOR DATA SYSTEMS
A computer-implemented method for configuring data deduplication is disclosed. The computer-implemented method includes receiving source data. The computer-implemented method further includes analyzing the source data, wherein analyzing the source data includes generating data profiling statistics from the source data and classifying attributes of the source data. The computer-implemented method further includes determining at least one data domain associated with the source data based, at least in part, on the data profiling statistics, the classified attributes, and ontology data. The computer-implemented method further includes determining, for the at least one data domain associated with the source data, a number of required matching algorithms for a data matching engine to execute data deduplication within the source data.
Empirically providing data privacy with reduced noise
An empirical approach to providing differential privacy includes applying a common statistical query to a set of databases to produce sample values, both with and without any particular entity's data. The probability density is empirically estimated by sorting the sample values to generate an empirical cumulative distribution function. The cumulative distribution function is differenced across approximately the square root of the number of sample points to get an empirical density function. The statistical query is empirically (ε,δ)-private if the empirical densities with and without any particular individual differ by a factor of no more than exp(ε), with the exception of a set for which the densities exceed that bound by a total of no more than δ.
Efficient storage of columns with inappropriate data types in relational databases
A computer-implemented method, a computer program product, and a computer system for detecting an inappropriate data type of a column in a database and correcting an encoding for the column. The computer system detects in a table a candidate column that has a mismatching type definition, using database usage statistics. The computer system determines whether conversion of the candidate column is possible. In response to determining that the conversion of the candidate column is possible, the computer system converts values in the candidate column with a first data type to values in a new column with a second data type. The computer system appends the new column in the table. The computer system registers the new column and the second data type in a metadata catalog. The computer system generates a query plan operator for processing a query for the new column.
Efficient time-range queries on databases in distributed computing systems
The present disclosure relates to querying data cores for data items that correspond to a specified time range. Probabilistic data structures corresponding to associated data cores are used to filter the plurality of data cores to identify a subset of data cores that contain data items corresponding to the specified time range. Only the subset of the plurality of data cores determined to contain relevant data items are searched.
Smart data warehouse protocols
Systems, methods and apparatus are provided for AI-based generation of data warehouse quality protocols. An attribute classifier may quantify relationships between source data and target data from an enterprise data warehouse. A data quality engine may apply these relationships to identify specific data quality concerns and generate customized data quality metrics. A user interface may enable a user to enter parameters for the classification protocols and corresponding rule-based generation of data quality metrics.
Information processing apparatus for searching database
A memory stores therein, with respect to a database containing records each having a first data item and a second data item, an index that includes, in association with each candidate value that is used as the first data item, record specification information specifying two or more records with the candidate value and a statistical value obtained from values of the second data item registered in the two or more records. A processor receives a query including a search condition specifying a requested value of the first data item and a command requesting statistical processing of values of the second data item registered in records satisfying the search condition, retrieves the statistical value associated with a candidate value satisfying the search condition from the index, and outputs a processing result based on the retrieved statistical value for the query.
Technologies for tuning performance and/or accuracy of similarity search using stochastic associative memories
Technologies for tuning performance and/or accuracy of similarity search using stochastic associative memories (SAM). Under a first subsampling approach, columns associated with set bits in a search key comprising a binary bit vector are subsampled. Matching set bits for the subsampled columns are aggregated on a row-wise basis to generate similarity scores, which are then ranked. A similar scheme is applied for all the columns with set bits in the search key and the results for top ranked rows are compared to evaluate a tradeoff between throughput boost versus lost accuracy. A second approach called continuous column read, and iterative approach is employed that continuously scores the rows as each new column read is complete. The similarity scores for an N-1 and Nth-1 iteration are ranked, a rank correlation is calculated, and a determination is made to whether the rank correlation meets or exceeds a threshold.
METHOD AND SYSTEM FOR MANAGING DATA CONTRACTS
A method for facilitating automated enforcement of a data publication and usage contract is provided. The method includes capturing a data contract from data that is published by a data service provider, the data contract including a data contract element; converting the captured data contract into a predetermined file format; retrieving metadata that correspond to the data, the metadata including usage information that relates to a consumption of the data by a data consumer; validating the retrieved metadata based on the converted data contract; and automatically initiating an enforcement action based on a result of the validating.
Combinators
A method, according to one embodiment, includes identifying data to be stored in one or more tables within a predetermined portion of a partitioned storage in one of a plurality of nodes, the predetermined portion having at least one replica, and where no two identical replicas reside on a single node; assigning an identifier and a data storage hierarchical level to the data; mapping the data to an index and storing the data in accordance with the index and the data storage hierarchical level, the storing including writing the data to a row in one of the one or more tables on the predetermined portion and recording a write operation into a transaction log of the node; receiving a plurality of write operations; and combining a plurality of write tasks of the predetermined portion for a predetermined time period.
Information processing apparatus, information processing method, and computer program
Provided is a mechanism capable of easily grasping or analyzing the relation between three or more variables. An information processing apparatus (100) includes a control unit (130) that accepts the designation of a variable of interest among multiple variables including three or more variables with respect to data including values of the multiple variables, and outputs first information indicating the strength of the relation between the variable of interest and the combinations of explanatory variables including two or more explanatory variables among the multiple variables.