ARTIFICIAL INTELLIGENCE (AI) BASED DATA MATCHING AND ALIGNMENT
20220358336 · 2022-11-10
Assignee
Inventors
- Neda ABOLHASSSANI (San Mateo, CA, US)
- Maziyar Baran Pouyan (Emeryville, CA, US)
- Teresa Sheausan Tung (Tustin, CA, US)
- Andrew FANO (Lincolnshire, IL, US)
- Sayantan MITRA (Bangalore, IN)
Cpc classification
G06F18/24147
PHYSICS
G06N5/01
PHYSICS
G06F18/24143
PHYSICS
International classification
Abstract
An Artificial Intelligence (AI)-based data matching and alignment system identifies similar data sources for a target data source from a data corpus and generates a knowledge graph that enables downstream applications seamless access to data in the data corpus. The system extracts column features at different levels for the target data source and a plurality of data sources from the data corpus. Feature matrices are built from the features of the target data source and the plurality of data sources. Candidate data sources similar to the target data source are filtered from the plurality of data sources using the feature matrices. The tree-based similarity is estimated and K Nearest Neighbor (KNN) graphs are built to identify columns from the candidate data sources that are similar to columns of the target data source to build the knowledge graph.
Claims
1. An Artificial Intelligence (AI) based data matching and alignment system, comprising: at least one processor; a non-transitory processor-readable medium storing machine-readable instructions that cause the processor to: receive a request for identifying similar data for a target data source from a plurality of data sources; generate corresponding feature matrices for the target data source and each of the plurality of data sources, wherein each feature matrix of the corresponding feature matrices includes respective features of the target data source and the plurality of data sources; identify at least one candidate data source that is similar to the target data source from the plurality of data sources, wherein the at least one candidate data source is identified based on the corresponding feature matrices of the plurality of data sources and the feature matrix of the target data source; identify columns of the at least one candidate data source that are similar to columns of the target data source; and enable a display of one or more of the columns of the at least one candidate data source that are similar to the columns of the target data source.
2. The data matching and alignment system of claim 1, wherein the processor is to further: generate a knowledge graph that represents the similar columns of the at least one candidate data source and the target data source.
3. The data matching and alignment system of claim 1, wherein the features include character level features, semantic level features, and data dependency features.
4. The data matching and alignment system of claim 1, wherein the character level features include at least column data type features and character distribution features.
5. The data matching and alignment system of claim 1, wherein the semantic level features include at least semantic text features and numeric distribution comparison features.
6. The data matching and alignment system of claim 1, wherein to identify the at least one candidate data source similar to the target data source, the processor is to: generate a K Nearest Neighbor (KNN) graph on implementing a distance metric to the corresponding feature matrices of the plurality of data sources and the feature matrix of the target data source.
7. The data matching and alignment system of claim 6, wherein the distance metric includes Mahanalobis distance.
8. The data matching and alignment system of claim 1, wherein to identify the at least one candidate data source similar to the target data source, the processor is to: select nearest N neighbors from the KNN graph as a plurality of candidate data sources, wherein N is a natural number and the at least one candidate data source includes the plurality of candidate data sources.
9. The data matching and alignment system of claim 1, wherein to identify the columns of the at least one candidate data source that are similar to the columns of the target data source, the processor is to: apply tree-based similarity calculations to the corresponding feature matrix of the at least one candidate data source and the feature matrix of the target data source.
10. The data matching and alignment system of claim 9, wherein to identify the columns of the at least one candidate data source that are similar to the columns of the target data source, the processor is to: generate K Nearest Neighbor (KNN) graphs from the tree-based similarity calculations; and identify the columns of the at least one candidate data source that are similar to the columns of the target data source from the KNN graphs.
11. A method of generating similarity mappings between data sources comprising: receiving a request for identifying matching data for a target data source of a plurality of data sources from a data corpus; extracting column features of the target data source and the plurality of data sources, wherein the column features are stored as corresponding feature matrices; identifying one or more candidate data sources from the plurality of data sources that are similar to the target data source, wherein the candidate data sources are identified based on a distance measure obtained for the feature matrix of the target data source and the corresponding feature matrices of the plurality of data sources; identifying columns of the one or more candidate data sources that are similar to columns of the target data source, wherein the similarities between the columns are determined based at least on feature matrices of the target data source and the features of one of the one or more candidate data sources; generating a knowledge graph representing the similarities of the columns of the one or more candidate data sources and the columns of the target data source; and enabling functioning of a downstream application by enabling the downstream application to access the knowledge graph.
12. The method of claim 11, further comprising: providing a display of the columns of the one or more candidate data sources that are similar to the columns of the target data source.
13. The method of claim 12, further comprising: providing via the display, a percentage of similarity between each of the similar columns of the one or more candidate data sources and the columns of the target data source.
14. The method of claim 12, further comprising: receiving user input selecting one or more of the similar columns for generating the knowledge graph, wherein the user input is received via the display.
15. The method of claim 11, wherein the column features include at least character level features, semantic level features, and dependency level features.
16. The method of claim 15, further comprising: generating the feature matrices for the plurality of data sources including the target data source, by stacking the character level features, semantic level features, and dependency level features of each of the columns adjacent to each other.
17. The method of claim 11, further comprising: outputting reasons for identifying the columns of the one or more candidate data sources as being similar to the columns of the target data source.
18. A non-transitory processor-readable storage medium comprising machine-readable instructions that cause a processor to: receiving a request for identifying matching data for a target data source of a plurality of data sources from a data corpus; extracting column features of the target data source and the plurality of data sources, wherein the column features are stored as corresponding feature matrices; identifying one or more candidate data sources from the plurality of data sources that are similar to the target data source, wherein the candidate data sources are identified based on a distance measure obtained for the feature matrix of the target data source and the corresponding feature matrices of the plurality of data sources; identifying columns of the one or more candidate data sources that are similar to columns of the target data source, wherein the similarities are determined based at least on feature matrices of the target data source and the one or more candidate data sources; generating a knowledge graph representing the similarities of the columns of the one or more candidate data sources and the columns of the target data source; and enabling functioning of a downstream application by enabling the downstream application to access the knowledge graph.
19. The non-transitory processor-readable storage medium of claim 18, wherein the instructions to identify the at least one candidate data source as similar to the target data source, further cause the processor to: apply K Nearest Neighbor (KNN) methodology on implementing the distance measure to the corresponding feature matrices of the plurality of data sources and the feature matrix of the target data source.
20. The non-transitory processor-readable storage medium of claim 18, wherein the instructions to identify the columns of the at least one candidate data source that are similar to the columns of the target data source, cause the processor to: apply tree-based similarity calculations to the corresponding feature matrices of the at least one candidate data source and the feature matrix of the target data source.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0003] Features of the present disclosure are illustrated by way of examples shown in the following figures. In the following figures, like numerals indicate like elements, in which:
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
DETAILED DESCRIPTION
[0015] For simplicity and illustrative purposes, the present disclosure is described by referring to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
[0016] An AI-based data matching and alignment system that generates similarity mappings for a target data source from a plurality of data sources in a data corpus is disclosed. In an example, the plurality of data sources can be columnar data sources with data arranged in arrays of rows and columns, e.g., spreadsheets, database tables, database views, etc. When a request for identifying similar data sources with a reference to a target data source is received, the plurality of data sources from the data corpus are initially filtered to identify candidate data sources that are similar to the target data source. The candidate data sources are further analyzed to identify columns from the candidate data sources that are similar to the columns of the target data source. A knowledge graph representing similar columns is generated. The knowledge graph provides structured, well-defined data to downstream applications.
[0017] The plurality of data sources including the target data source can be initially preprocessed for converting the data into a uniform format, extracting the data structure, parsing, cleaning, outlier detection, deduplication, etc. Features are extracted at different levels for the columns of the plurality of data sources including the target data source. The extracted features may include but are not limited to, character level features, semantic level features, and dependency level features. Feature matrices are generated from the corresponding features for each of the plurality of data sources including the target data source. The features of each data source may be thus stored as feature matrices wherein the features are arranged column-wise by stacking the character level features, the semantic level features and the dependency level features adjacent to each other in the feature matrix.
[0018] The feature matrices are used to identify the candidate data sources and similar columns. A distance metric may be initially estimated for the feature matrices of each of the plurality of data sources with the feature matrix of the target data source. In an example, Mahanalobis distance can be used as the distance metric. Further, similarity determination techniques such as, K Nearest Neighbor (KNN) techniques may be employed on the feature matrices to determine the candidate data sources from the plurality of data sources that are similar to the target data source. The candidate data sources obtained by filtering the plurality of data sources are further analyzed for column similarity determinations.
[0019] Feature matrices of the candidate data sources and the target data source are analyzed using a tree-based similarity calculation. In an example, random forest distance (RFD) may be applied to the feature matrices of the candidate data sources and the target data source. Fused KNN graphs are further built to better identify the similar columns from the candidate data sources for the columns of the target data source. Thus, a relational graph wherein each node/vertex denotes one data object and wherein similar data objects are connected via edges is generated. A ranked list of similar mappings can be generated from the relational graph (e.g., the fused KNN graphs) showing the mappings of similar columns and the extent of similarity between the columns. In an example, the ranked list of mappings can be further represented as a knowledge graph wherein the nodes represent the columns and similar columns are connected by the graph edges. The knowledge graph provides uniformly formatted, structured data to downstream applications.
[0020] The AI-based data matching and alignment systems and methods disclosed herein provide a technical solution to the technical problem of arranging siloed data from disparate data sources into a comprehensive data structure such as a knowledge graph which enables downstream applications to gain insights and provide heretofore unavailable functionality. Industry-specific data, for example, seismic engineering data can include structured data stored in existing applications. The industry-specific data may also include unstructured data e.g., Internet of Things (IoT) data, received from various sources such as hardware in the wells, plants, etc. To be usable for analysis by different applications, the IoT data has to be processed/converted to structured data, i.e., it has to be normalized into industry-specific formats (e.g., a tag from a sensor data has to be mapped to a process) and then labeled and mapped. Existing tools that enable data mapping and data preparation may allow for syntactic data matching using string comparison functions, etc., however, they do not enable determining relationships between the data in different formats across various data sources in a data corpus. As a result, data may be duplicated or valuable insights may be lost. To a certain extent, these problems can be mitigated by domain experts who may be able to examine data and identify duplicate data or provide insights. However, this can be very laborious and time-consuming and makes sub-optimal use of the expert resources. For example, a temperature measurement along a process can happen at different points, via different tools, in different units, and at different time intervals. If this data can be accurately captured in a standard format it may help enable an expert to accurately understand the changes occurring along the process.
[0021] Disclosed herein are AI-based data matching and content alignment systems and methods that provide for data integration across unstructured or poorly structured data to be migrated into structured data thereby enabling data integration from multiple sources into a single new data source (e.g., data warehouse). Additionally, the AI-based data matching and alignment systems and methods determine relationships between data that go beyond syntactic data matching and string comparison functions. The AI-based data matching and alignment system can estimate matching data from different data sources based on the relationships determined through AI techniques described herein. Furthermore, the determined matches and relationships can be used to build the knowledge graph for the data from the plurality of data sources, which in turn can drive more efficient and accurate analytics by downstream applications.
[0022]
[0023] The data matching and alignment system 100 includes a data preprocessor 102, a feature extractor 104, a data source filter 106, and an unsupervised recommender 108 employing unsupervised machine learning functions to determine predictions of matching data. In addition, the data matching and alignment system 100 may put forth 1/O user interfaces (UIs) 110 to receive the input 182 with the reference to the target data source 190 or to provide output such as the ranked list 150. On receiving the input 182, the data preprocessor 102 accesses the plurality of data sources 192, 194, . . . 198 to preprocess the data for further analysis. The preprocessing may include but is not limited to, parsing and cleaning the data from the plurality of data sources 192, 194, . . . 198 and extraction of metadata such as date/time, address, etc. The processed data 122 is provided to the feature extractor 104.
[0024] The functioning of the feature extractor 104, the data source filter 106, and the unsupervised recommender 108 is discussed below with reference to
[0025] The feature matrices 142 are provided to the data source filter 106 to initially identify those data sources or data sets that are similar to the target data source 190. In an example, the data source filter 106 may implement methods such as Mahalanobis distance in conjunction with K Nearest Neighbor (KNN) to identify similar data sources or candidate data sources 162. The candidate data sources 162 are provided to the unsupervised recommender 108 for column similarity identification. The identification of the candidate data sources 162 provides for preliminary filtering so that a subset of the plurality of data sources 192, 194, . . . 198 are initially selected for deeper analysis. Therefore, further column identification and knowledge graph building processes are made more efficient since the columns of the target data source 190 need to be matched with only a subset of the plurality of data sources 192, 194, . . . 198 i.e., the candidate data sources 162 to identify similar columns thereby saving time and processing resources.
[0026] The unsupervised recommender 108 accesses the feature matrices corresponding to the candidate data sources 162 for column similarity determination. The feature space may be highly skewed which may cause the feature dependency to affect the distance calculation when implemented for column similarity determination. The unsupervised recommender 108, therefore, outputs a ranked list of similarity mappings 150. The unsupervised recommender 108 can also build a knowledge graph 172 from the similarity column mappings. The knowledge graph 172 includes nodes representing the columns and edges connecting the similar columns as determined by the unsupervised recommender. In an example, the ranked list of similarity mappings 150 may be provided to a reviewer (e.g., a domain expert) for validation, and the knowledge graph 172 may be built from validated column similarity mappings. Therefore, the expert knowledge is encoded into the knowledge graph 172 which can be reused across the downstream applications 180. The knowledge graph 172 when used with the plurality of data sources 192, 194, . . . 198, forms a knowledge graph-enabled data mesh.
[0027] In some examples, the data matching and alignment system 100 includes a reason generator 128 to generate logical explanations for matching specific columns based on the data present in the tables, thereby making it easier for the end users to understand the reasons for the matchings. The explanations can be generated by classifying the columns that are matched into multiple categories, such as but not limited to, date, object (string) and numeric types. The explanations/reasons for the date type columns can be generated based on the date ranges of the corresponding columns. For example, consider two matching date type columns, DATEPRD and DATEPRD_2 of Table 1 and Table 2 respectively. As shown below, DATEPRD_2 and DATEPRD are similar because the date range of the former is a subset of the date range of the later:
[0028] Date range of DATEPRD is 2008-02-12 00:00:00 to 2016-09-17 00:00:00
[0029] Date range of DATEPRD_2 is 2008-05-2 00:00:00 to 2014-01-22 00:00:00
[0030] For object (string) type matching columns, the semantic matches are initially identified between the data present in the matching columns and the explanations/reasons are generated based on the semantic matches. For example, two columns WELL_KIND and WELL_KIND2 of Table 1 and Table 2 respectively having binary values represented as [‘True’, ‘False’] and [‘Yes’, No’] respectively can be identified as matching columns due to the mappings between the binary values as shown below:
[0031] True->Yes
[0032] False->No.
[0033] Lastly, if the matching columns are of numeric type, the explanations can be generated by showing that the distance between the distributions of the two columns is minimum as compared to other non-matching, numeric columns. The Kolmogorov-Smimov test may be used for determining the distance between the columns. For example, NPD_FACILITY and NPD_FACILITY_CODE_2 may be matched from Tables 1 and 2 respectively and the explanation/reason may be provided by the reason generator 128 as shown below:
[0034] Distribution distance between NPD_FACILITY_CODE and NPD_FACILITY_CODE_2 is 0.0.
[0035]
[0036] In an example, the feature extractor 104 may also include a feature matrix builder 112. The extracted features may be used to build feature matrices for the data sources 190, 192, . . . , 198. For example, if the target data source 190 includes N columns and M features, namely, F.sub.1, F.sub.2, . . . , F.sub.M, are the features, then the feature matrix builder 112 builds a N*M feature matrix 144. In an example, the features of different layers may be arranged in the feature matrix 144 sequentially adjacent to each other as shown in the matrix representation 146. In an example, the features may be arranged in N rows wherein each row corresponds to features of one of the columns of the data source. Accordingly, the first row 124 of the feature matrix 144 may correspond to feature values of Column 1 of the data source, the second row 126 of the feature matrix 144 may correspond to Column 2, . . . etc. Within each row, the feature matrix builder 112 can be further configured to arrange the features so that layer 1 features 202 of a column are initially arranged followed by layer 2 and layer 3 features so that features of different layers are stacked adjacent to each other. In an example, the feature matrix builder 112 may be configured to generate a feature matrix of predetermined dimensions so that if the data source has fewer columns, the corresponding rows may be padded with some default values, e.g., zero.
[0037]
[0038] An example of the output generated by the data source filter 106 is shown in table 210 wherein candidate data sets, Data 6 and Data 11, are identified as being similar with 83% and 70% respectively for Data 1 which represents the target data set. Similarly, data sets may also be identified for a plurality of target data sources in parallel by the data matching and alignment system 100 so that similar data sets are identified for each of Data 2, . . . Data K, etc., and are listed out with the corresponding probabilities. In an example, top N data sources (wherein N is a natural number) may be selected as the candidate data sources for a given data source. The candidate data sources 162 identified by the data source filter 106 are processed for column similarity to generate specific column similarity mappings by the unsupervised recommender 108.
[0039]
[0040] The RFD measure is used by the graph analyzer 186 that can implement a classification algorithm (such as the fused K Nearest Neighbor (KNN)) to classify column pairs as being similar to each other or dissimilar to each other. The column pairs that were analyzed and similarities that were obtained are provided to the output generator 136. A column similarity output 320 may be generated showing the columns of the target data source 190, the columns of the candidate data sources 190 determined to be similar to the target data source 190, and the scores. For example, a target column labeled Sensor 1 has another column labeled Sensor 3 identified as being similar with a match score of 91%. Other similar columns are identified for Power, Time, Date target columns with the corresponding match scores.
[0041]
[0042] At 248, the feature matrices 142 including the features of each data source of the plurality of data sources 192, 194, . . . 198, including the target data source 190 are generated. A distance measure is calculated at 252 between the feature matrices of each of the data sources 192, 194, . . . 198, and the target data source 190. In an example, the Mahanalobis distance technique can be used to measure the distance between the feature matrices of the data sources. At 254, the KNN technique can be applied to identify the nearest neighbors or the most similar data sets or the candidate data sources 162 for the target data source 190. Based on the KNN graphs, candidate data sources that are sufficiently similar to the target data source 190 can be shortlisted for further column similarity analysis at 256. The filtering of the data sources saves time and processing power for the data matching and alignment system 100 as it mitigates the need to test each data source from the plurality of data sources 192, 194, . . . 198 for column similarity.
[0043]
[0044] In an example, a ranked list of similarity mappings 150 can be obtained at 270. In an example, similar column mappings may be represented in the knowledge graph 172. The columns are represented as nodes and similar columns are connected by the edges of the knowledge graph 172. In an example, the distance between the nodes or the length of the edges in the knowledge graph 172 may signify the extent of column similarity so that similar columns are represented by closer nodes while columns of lower similarity are represented farther apart by edges with greater length. The knowledge graph 172 can be used by the downstream applications 180 for information extraction. Therefore, instead of accessing different data sources having various formats, different column/field names, at various remote locations, which may require data conversions, etc., the downstream applications 180 may obtain the required data from the knowledge graph 172 wherein it is uniformly represented thereby improving the ease of data access.
[0045] The ranked list of similar mappings 150 provided to the user at 270 can include a list of individual columns from different data sources that are similar to the columns of the target data source 190. A list of data sources similar to the target data source 190 may also be included in the ranked list of similar mappings along with the confidence values. In an example, the ranked list of similar mappings 150 can also include reasons why certain columns and data sources are selected by the unsupervised recommender 108 as being similar to the target data source 190. These reasons may be generated using features of explainable Artificial Intelligence (AI).
[0046]
[0047]
[0048]
[0049]
[0050] Tables 610 and 620 include example data of an oil and gas company having plants in the United States and Europe utilizing the same type of equipment. While the equipment components may be similar, the data each plant collects has differences due to equipment model names, settings, and regional requirements. To make data-driven decisions, an integrated master data file that can be used across the plants in both continents needs to be developed. The disclosed AI-based data matching and alignment system 100 and methods can learn domain-specific metrics to accelerate the data mapping so that like equipment can be clustered from the information encoded in the portion 350 thereby enabling experts to identify similar data sources across the plants.
[0051] Furthermore, the AI-powered schema matching and content alignment techniques can help in creating a digital twin of the energy plants to integrate product data, supply chain data, maintenance, and monitoring data. For example, the Well_0_1/Log 1 file represented by Table 610 and Well_0_1/Log 2 file represented by Table 620 are different with different attribute labels (i.e., column names) but include related data such as the first columns Well USW and WellID which have different column names but similar well data. The relationships between the two tables can be derived by identifying the patterns and distribution of the characters that make the attributes in these files similar, the semantical and statistical features that are in common among them, and the dependencies between the files that can filter pairwise attribute comparisons as executed by the AI-based data matching and alignment system 100.
[0052]
[0053] The computer system 700 includes processor(s) 702, such as a central processing unit, ASIC or another type of processing circuit, input/output devices 712, such as a display, mouse keyboard, etc., a network interface 704, such as a Local Area Network (LAN), a wireless 802.11x LAN, a 3G, 4G or 7G mobile WAN or a WiMax WAN, and a processor-readable medium 706. Each of these components may be operatively coupled to a bus 708. The processor-readable or computer-readable medium 706 may be any suitable medium that participates in providing instructions to the processor(s) 702 for execution. For example, the processor-readable medium 706 may be a non-transitory or non-volatile medium, such as a magnetic disk or solid-state non-volatile memory, or a volatile medium such as RAM. The instructions or modules stored on the processor-readable medium 706 may include machine-readable instructions 774 executed by the processor(s) 702 that cause the processor(s) 702 to perform the methods and functions of the AI-based data matching and alignment system 100.
[0054] The AI-based data matching and alignment system 100 may be implemented as software or machine-readable instructions stored on a non-transitory processor-readable medium and executed by one or more processors 702. For example, the processor-readable medium 706 may store an operating system 772, such as MAC OS, MS WINDOWS, UNIX, or LINUX, and code 774 for the AI-based data matching and alignment system 100. The operating system 772 may be multi-user, multiprocessing, multitasking, multithreading, real-time, and the like. For example, during runtime, the operating system 772 is running and the code for the AI-based data matching and alignment system 100 is executed by the processor(s) 702.
[0055] The computer system 700 may include a data storage 710, which may include non-volatile data storage. The data storage 710 stores any data used by the AI-based data matching and alignment system 100. The data storage 710 may be used as the data storage 170 to store the features, the calculated similarities, the KNN graphs, and other data elements which are generated and/or used during the operation of the AI-based data matching and alignment system 100.
[0056] The network interface 704 connects the computer system 700 to internal systems for example, via a LAN. Also, the network interface 704 may connect the computer system 700 to the Internet. For example, the computer system 700 may connect to web browsers and other external applications and systems via the network interface 704.
[0057] What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions, and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims and their equivalents.