PATIENT DATA VISUALIZATION METHOD AND SYSTEM FOR ASSISTING DECISION MAKING IN CHRONIC DISEASES
20220157468 · 2022-05-19
Inventors
- Jingsong LI (Hangzhou City, CN)
- Shiqiang ZHU (Hangzhou City, CN)
- Tianshu ZHOU (Hangzhou City, CN)
- Yu TIAN (Hangzhou City, CN)
Cpc classification
G06N5/01
PHYSICS
G16H50/20
PHYSICS
G16H50/70
PHYSICS
G16H10/60
PHYSICS
International classification
G16H10/60
PHYSICS
G16H50/20
PHYSICS
G16H50/30
PHYSICS
G16H50/70
PHYSICS
Abstract
Provided is a patient data visualization method and system for assisting decision making in chronic diseases. According to the present application, a management data model diagram of a patient on a hyperplane is constructed by constructing a chronic disease knowledge graph, and combining static data and dynamic data of the patient, and then the management data model diagram is projected onto a two-dimensional plane. The difference of the Euclidean distance between features of a patient information model on a two-dimensional plane graph from the distance of standard features is compared, and a management plan is generated and recommended in combination with path node concepts and an attribute relationship between the concepts.
Claims
1. A patient data visualization method for assisting decision making in chronic diseases, comprising the following steps: (1) constructing a chronic disease knowledge graph: with clinical guidelines and knowledge literature related to chronic diseases as knowledge sources of the knowledge graph, performing unique identification on data semantics through SNOMED CT, manually constructing categories, attributes and instances, adding a data relationship and an attribute relationship, and generating a knowledge graph prototype file; (2) establishing a patient information model: collecting patient information; performing RDF conversion on patient data, so as to convert data in a patient database into an RDF triple relationship that meets the OWL language specification; identifying nodes of the patient information model by using SNOMED CT, so as to achieve semantic extension of the patient data to domain knowledge, and fusing the patient information with the chronic disease knowledge graph to construct the patient information model; (3) drawing a hyperplane feature map: converting the patient information model into the hyperplane feature map through a distributed representation, wherein the distributed representation adopts a translation-based model between entity vectors and relationship vectors; (4) mapping a two-dimensional plane: the position information of a two-dimensional plane node corresponding to a two-dimensional position of the hyperplane feature map of the patient information model after dimensionality reduction, distinguishing different information categories in the knowledge graph by using the colors of the nodes, using a feature importance ranking of the Regularized Gradient Boosted Decision Tree algorithm as the ranking of correlation between each node and disease progression, and using a feature weight value as a calculation weight of an Euclidean distance; and (5) decision making support feedback: taking a domain expert marking result as a standard for the patient information model with an ideal chronic disease management effect, drawing a two-dimensional plane mapping image of the patient data through distributed representation and dimensionality reduction visualization, and calculating the Euclidean distance between geometric centers of various feature areas in the mapping image in combination with the feature weight value to serve as a standardized management target; calculating the Euclidean distance between the features of a patient who requires decision making support feedback in the two-dimensional plane mapping image, and comparing the Euclidean distance with a standard numerical value in combination with the feature weight value calculated from the Euclidean distance, so as to find a path of similar distance; and obtaining knowledge in the knowledge graph according to the distance information of the features.
2. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein the knowledge content of the knowledge graph covers disease diagnosis, inspection items, physical sign states, related diseases, therapeutic drugs, living habits, measurement units, and detection quantities.
3. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein the patient information collected in step (2) comprises patient health data manually input in a daily mobile terminal or collected by a wearable device, and patient electronic medical record data recorded by a regional chronic disease management center.
4. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein during the RDF conversion process of the patient data in step (2), the data in a relational database are mapped into a RDF format by using the D2R semantic mapping technology; D2R comprises a D2R Server, a D2RQ Engine and a D2RQ Mapping language; the D2RQ Mapping language defines Mapping rules for converting the relational data into in the RDF format; and the D2RQ Engine uses a customized D2RQ Mapping file to complete data mapping, which specifically refers to mapping tables and fields in the relational database into categories and attributes in an OWL file respectively, and obtaining the relationship between the categories from a table that expresses the relationship.
5. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein drawing the hyperplane feature map in step (3) specifically comprises the following sub-steps: (3.1) encoding triples into spatial distributed vectors by using a TransH model, specifically: the knowledge in the patient information model is stored in a form of a triple (h, r, t), where h represents a head entity vector, r represents a relationship vector, and t represents a tail entity vector; sets of triples form a directed graph, graph nodes represent entities, edges represent different types of relationships, and the edges are directed to indicate that the relationships are asymmetric; entity distributed vectors of reflexive relationship, many-to-one, one-to-many and many-to-many relationships are constructed by means of the TransH model; (3.2) optimizing an objective function, specifically: for each relationship r in the TransH model, it is assumed that there is a corresponding hyperplane, a relationship projection of r on the hyperplane is expressed as d.sub.r, a normal vector of the hyperplane is expressed as ω.sub.r, and ∥ω.sub.r∥.sub.2=1, h.sub.⊥ and t.sub.⊥ represent projections of h and t on the hyperplane, respectively, then:
h.sub.⊥=h−ω.sub.r.sup.Thω.sub.r, t.sub.⊥=t−ω.sub.r.sup.Ttω.sub.r defining a scoring function as:
ƒ.sub.r(h,t)=∥h.sub.⊥+d.sub.r−t.sub.⊥∥.sub.2.sup.2 to obtain the objective function:
6. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein the step (4) of mapping the two-dimensional plane specifically comprises the following sub-steps: (4.1) using the t-SNE algorithm to perform dimensionality reduction visualization, specifically: step i: assuming that a data set X has a total of N data points, and the dimension of each data point x.sub.i is D, reducing the dimensions to two dimensions, that is, expressing all data on the plane; calculating a conditional probability of similarity between the data points in a high-dimensional space; converting the high-dimensional Euclidean distance between the data points into the conditional probability representative of similarity, wherein the conditional probability P.sub.j|i of similarity between high-dimensional data points x.sub.i and x.sub.j is as follows:
7. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein during the feature importance ranking process in step (4), the Regularized Gradient Boosted Decision Tree parameter training is performed by using a grid search method, comprising a general parameter, an improving parameter, and a learning target parameter; and the general parameter controls macro parameters, the improving parameter controls the improvement in each step, and the learning target parameter controls the performance of a training target.
8. The patient data visualization method for assisting decision making in chronic diseases according to claim 1, wherein in step (5), the knowledge in the knowledge graph is obtained according to the distance information of the features by using the SPARQL query language and Jena rule reasoning, so as to generate a personalized management plan for a patient.
9. The patient data visualization method for assisting decision making in chronic diseases according to claim 8, wherein in step (5), a SPARQL query statement comprises conditions to which query information and names should conform, the conditions appear in a form of triples and are arranged in an order of <subject, predicate, object>, that is, subject, predicate, and object, and a query result is actually a matching result of the condition triples and RDF triples in the data file.
10. A patient data visualization system for assisting decision making in chronic diseases, comprising: a chronic disease knowledge graph construction module: with clinical guidelines and knowledge literature related to chronic diseases as knowledge sources of a knowledge graph, performing unique identification on data semantics through SNOMED CT, manually constructing categories, attributes and instances, adding a data relationship and an attribute relationship, and generating a knowledge graph prototype file; a patient information model construction module: collecting patient information, and converting data in a patient database into an RDF triple relationship that meets the OWL language specification; identifying nodes of a patient information model by using SNOMED CT, so as to achieve semantic extension of the patient data to domain knowledge, and fusing the patient information with the chronic disease knowledge graph to construct the patient information model; a hyperplane feature map drawing module: converting the patient information model into a hyperplane feature map through a distributed representation, wherein the distributed representation adopts a translation-based model between entity vectors and relationship vectors; a two-dimensional plane mapping module: the position information of a two-dimensional plane node corresponding to a two-dimensional position of the hyperplane feature map of the patient information model after dimensionality reduction, distinguishing different information categories in the knowledge graph by using the colors of the nodes, using a feature importance ranking of the Regularized Gradient Boosted Decision Tree algorithm as the ranking of correlation between each node and disease progression, and using a feature weight value as a calculation weight of the Euclidean distance; and a decision making support feedback module: taking a domain expert marking result as a standard for the patient information model with an ideal chronic disease management effect, drawing a two-dimensional plane mapping image of the patient data through distributed representation and dimensionality reduction visualization, and calculating the Euclidean distance between geometric centers of various feature areas in the mapping image in combination with the feature weight value to serve as a standardized management target; calculating the Euclidean distance between the features of a patient who requires decision making support feedback in the two-dimensional plane mapping image, and comparing the Euclidean distance with a standard numerical value in combination with the feature weight value calculated from the Euclidean distance, so as to find a path of similar distance; and obtaining knowledge in the knowledge graph according to the distance information of the features.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0051]
[0052]
[0053]
DESCRIPTION OF EMBODIMENTS
[0054] In order to make the above-mentioned objectives, features and advantages of the present application more obvious and understandable, the specific embodiments of the present application will be described in detail below with reference to the drawings.
[0055] In the following description, many specific details are set forth in order to fully understand the present application, but the present application can also be implemented in other ways different from those described here, and those skilled in the art can make similar promotion without departing from the connotation of the present application, therefore, the present application is not limited by the specific embodiments disclosed below.
[0056] The present application proposes a patient data visualization method and system for assisting decision making in chronic diseases, which can help a patient to better understand personal health conditions and disease intervention conditions, and help a doctor more efficiently view the conditions of the patient and formulate a health management plan. As shown in
[0057] (1) Constructing a Chronic Disease Knowledge Graph
[0058] Clinical guidelines and knowledge literature related to chronic diseases are used as knowledge sources of the knowledge graph, the knowledge content covers disease diagnosis, inspection items, physical signs, related diseases, treatment drugs, living habits, and the like, and also includes medical auxiliary words such as measurement units and detection quantities. The SNOMED CT (Systematized Nomenclature of Medicine-Clinical Terms, Systematized Nomenclature of Medicine-Clinical Terms) is selected as a standardized encoding system, unique identification is performed on data semantics through SNOMED CT, categories, attributes, instances and other information are manually constructed, a data relationship and an attribute relationship are added, and a knowledge graph prototype file is generated.
[0059] (2) Establishing a Patient Information Model
[0060] (2.1) Collecting Patient Information
[0061] There are two main sources of patient information: one type is patient health data manually input in a daily mobile terminal or collected by a wearable device; and the other type is patient electronic medical record data recorded by a regional chronic disease management center.
[0062] (2.2) RDF Conversion of Patient Data
[0063] Data in the formats of XML, JSON and the like in a patient database need to be converted into RDF (Resource Description Framework) triple relationships that conform to the OWL (Web Ontology Language) language specification. Herein, the data in a relational database are mapped into the RDF format by using the D2R (Database to RDF) semantic mapping technology. D2R mainly includes a D2R Server, a D2RQ Engine and a D2RQ Mapping language. The D2RQ Mapping language defines Mapping rules for converting the relational data into in the RDF format. The D2RQ Engine uses a customized D2RQ Mapping file to complete data mapping, which specifically refers to mapping tables and fields in the relational database into categories and attributes in an OWL file respectively, and obtaining the relationship between the categories from a table that expresses the relationship. Like the chronic disease knowledge graph, the SNOMED CT is used for identifying nodes of the patient data model, thereby realizing the semantic extension of the patient data to domain knowledge, and the patient information is fused with the chronic disease knowledge graph to construct the patient information model.
[0064] (3) Drawing a Hyperplane Feature Map
[0065] The patient information model is converted into the hyperplane feature map through a distributed representation, and the distributed representation adopts a translation-based model between entity vectors and relationship vectors.
[0066] Step i: encoding triples into spatial distributed vectors by using a TransH model, as shown in
[0067] The TransH replaces head and tail entities with different probabilities according to the type of the relationship r (one-to-one, one-to-many, many-to-one, and many-to-many). For example, for the one-to-many relationship, replacing the head entity is more likely to obtain a legal negative sample than replacing the tail entity, so the head entity can be replaced with a greater probability. For the triple corresponding to the relationship r, the TransH first counts the number of tail entities tph corresponding to each head entity on average and the number of head entities hpt corresponding to each tail entity on average, and then defines a Bernoulli distribution to replace the head entity with a probability
and to replace the tail entity with a probability
[0068] The knowledge in the patient information model is stored in the form of a triple (h, r, t), wherein h represents a head entity vector, and r represents a relationship vector, and t represents a tail entity vector. Sets of triples form a directed graph, graph nodes represent entities, edges represent different types of relationships, and the edges are directed to indicate that the relationships are asymmetric. Entity distributed vectors of reflexive relationship, many-to-one, one-to-many and many-to-many relationships can be constructed by means of the TransH model.
[0069] Step ii: optimizing an objective function. For each relationship r in the TransH model, it is assumed that there is a corresponding hyperplane (the relationship r falls in the hyperplane), a relationship projection of r on the hyperplane is expressed as d.sub.r, a normal vector of the hyperplane is expressed as ω.sub.r, and ∥ω.sub.r∥.sub.2=1, h.sub.195 and t.sub.⊥ respectively represent projections of h and t on the hyperplane, then:
h.sub.⊥=h−ω.sub.r.sup.Thω.sub.r, t.sub.⊥=t−ω.sub.r.sup.Ttω.sub.r
[0070] defining a scoring function as:
ƒ.sub.r(h,t)=∥h.sub.⊥+d.sub.r−t.sub.⊥∥.sub.2.sup.2
[0071] to obtain the objective function:
[0072] where, S represents a triple in a knowledge base, S′ represents a triple of negative sampling, and ε represents an interval distance parameter with a value greater than 0, and is a hyperparameter, [x].sub.+ represents a positive value function, that is, when x>0, [x].sub.+=x, and when x≤0, [x].sub.+=0. if the value of the scoring function value of the two nodes is relatively low, it means that the distance is relatively small, and vice versa. During the process of optimizing the objective function, it is necessary to specify that the value of a positive example triple is small and the value of a negative example triple is large, that is, the ranking loss is minimized. After the training of the TransH model is completed by using a stochastic gradient descent (Stochastic Gradient Descent, SGD) training method, vector representations of the entities and the relationships can be obtained.
[0073] (4) Mapping a Two-Dimensional Plane
[0074] The position information of a two-dimensional plane node corresponds to a two-dimensional position of the hyperplane feature map of the patient information model after dimensionality reduction, different information categories in the knowledge graph are distinguished by using the colors of the nodes, a feature importance ranking of the Regularized Gradient Boosted Decision Tree algorithm is used as the ranking of correlation between each node and disease progression, and a feature weight value is used as a calculation weight of the Euclidean distance.
[0075] (4.1) Dimensionality Reduction Visualization
[0076] Dimensionality reduction visualization is performed by using the t-SNE algorithm (t-distributed Stochastic Neighbor Embedding, t-distributed stochastic neighborhood embedding algorithm).
[0077] The t-SNE algorithm is a machine learning method for dimensionality reduction, which can help us identify associated patterns. The main advantage of t-SNE is the ability to maintain a local structure. This means that the projections of points with close distances in a high-dimensional data space are still close to each other after being projected into a low-dimensional data space. The t-SNE can also generate beautiful visualization.
[0078] The t-SNE algorithm models the distribution of the neighbors of each data point, wherein the neighbor refers to a set of data points close to each other. In the original high-dimensional space, we model the high-dimensional space as a Gaussian distribution, and in a two-dimensional output space, we can model it as a t-distribution. The goal of this process is to find a transformation of mapping the high-dimensional space into a two-dimensional space, and to minimize the gap of all points between the two distributions. Compared with the Gaussian distribution, the t distribution has a longer tail, which is beneficial for more uniform distribution of the data points in the two-dimensional space.
[0079] Step i: assuming that a data set X has a total of N data points, and the dimension of each data point x.sub.i is D, reducing the dimensions to d dimensions, wherein d is taken as 2 here, that is, expressing all data on the plane; calculating a conditional probability of similarity between the data points in the high-dimensional space; converting the high-dimensional Euclidean distance between the data points into the conditional probability representative of similarity, wherein the conditional probability P.sub.j|i of similarity between high-dimensional data points x.sub.i and x.sub.j is as follows:
[0080] where, σ.sub.i represents the Gaussian variance centered on the data point x.sub.i.
[0081] Step ii: calculating the conditional probability of similarity between the data points in the low-dimensional space; and for low-dimensional corresponding points y.sub.i and y.sub.j of the high-dimensional data points x.sub.i and x.sub.j, the conditional probability Q.sub.j|i is calculated as follows:
[0082] Step iii: minimizing a difference in the conditional probabilities, that is, making the conditional probability Q.sub.j|i approximate to P.sub.j|i. This step is achieved by minimizing the Kullback-Leibler divergence (KL divergence) between the two conditional probability distributions. Iterative updating is performed in this process by using gradient descent, and a loss function is as follows, that is, the loss function is minimized:
[0083] The schematic diagram of mapping the two-dimensional plane is shown in
[0084] (4.2) Feature Importance Ranking
[0085] The Regularized Gradient Boosted Decision Tree algorithm (eXtreme Gradient Boosting, extreme gradient boosting algorithm) is used to achieve the importance ranking of each entity in the knowledge graph and to obtain the feature weight value. The data set is a patient information model with known chronic disease management effects or outcomes, each sample contains n-dimensional features (the number of entities in the patient information model). The objective function L of the Regularized Gradient Boosted Decision Tree includes a loss function and complexity, which are defined as:
[0086] where i represents the i.sup.th sample, k represents the k.sup.th tree, ŷ.sub.i represents a predicted output, y.sub.i represents a label value, T represents the number of leaf nodes, and ω represents a leaf weight value; γ represents a leaf tree penalty regular term, which has a function of pruning; λ represents a leaf weight penalty regular term, which prevents over-fitting; l(ŷ.sub.i,y.sub.i) represents a prediction error of the i.sup.th sample, and the smaller the error value, the better; Σ.sub.il(ŷ.sub.i,y.sub.i) represents the loss function; Σ.sub.kΩ(ƒ.sub.k) represents the complexity function of the tree, and the lower the complexity is, the stronger the generalization ability of the model is.
[0087] During the growth of the tree, by comparing the values of the objective function before and after splitting, the splitting with the minimum value of the objective function after the splitting is an optimal splitting point. The Gain here can be regarded as subtracting left and right objective function values after splitting from the objective function value before splitting, therefore if Gain<0, this leaf node will not be split. δ represents a complexity cost introduced by adding a new leaf node, G.sub.L represents a left sub-tree gradient value, and H.sub.L represents is a second derivative of a left sub-tree sample set; G.sub.R represents a right sub-tree gradient value, and H.sub.R represents the second derivative of a right sub-tree sample set; and
can evaluate the structure of a tree.
[0088] A feature importance score is realized by calculating a total gain, that is, total gain, brought by a certain feature in all trees every time when the node is split. The score evaluate the value of the feature in improving the construction of the decision making tree, and thus can be used as an indicator of feature importance ranking. Finally, a corresponding feature weight value is obtained by calling a get_score method of a booster parameter.
[0089] In this step, the Regularized Gradient Boosted Decision Tree parameter training is performed by using a grid search method, including a general parameter, an improving parameter, and a learning target parameter; and the general parameter controls macro parameters, the improving parameter controls the improvement in each step, and the learning target parameter controls the performance of a training target.
[0090] (5) Decision Making Support Feedback
[0091] A domain expert marking result is taken as the standard for the patient information model with an ideal chronic disease management effect, a two-dimensional plane mapping image of the patient data is drawn through distributed representation and dimensionality reduction visualization, and the Euclidean distance between geometric centers of various feature areas in the mapping image is calculated in combination with the feature weight value to serve as a standardized management target. The Euclidean distance between the features of a patient who requires decision making support feedback in the two-dimensional plane mapping image is calculated, and the Euclidean distance is compared with a standard numerical value in combination with the feature weight value calculated from the Euclidean distance, so as to find a path of similar distance. Knowledge in the knowledge graph is obtained according to the distance information of the features by using SPARQL (SPARQL
[0092] Protocol and RDF Query Language, SPARQL Protocol and RDF Query Language) query language and Jena rule reasoning, and a personalized management plan is generated for the patient, including exercise suggestions, dietary suggestions, medication suggestions, inspection suggestions, lifestyle suggestions, and the like. The SPARQL query statement includes conditions to which query information and names should conform, the conditions appear in the form of triples and are arranged in the order of <subject, predicate, object>, that is, subject, predicate, and object, the query condition becomes a pattern, and the query result is actually a matching result of the condition triples and the RDF triples in the data file. The Jena reasoning is based on rules, and the rules are defined by Rule objects.
[0093] The present application further provides a patient data visualization system for assisting decision making in chronic diseases, including the following modules:
[0094] a chronic disease knowledge graph construction module: with clinical guidelines and knowledge literature related to chronic diseases as knowledge sources of a knowledge graph, performing unique identification on data semantics through SNOMED CT, manually constructing categories, attributes and instances, adding a data relationship and an attribute relationship, and generating a knowledge graph prototype file;
[0095] a patient information model construction module: collecting patient information, and converting data in a patient database into an RDF triple relationship that meets the OWL language specification; identifying nodes of a patient information model by using SNOMED CT, so as to achieve semantic extension of the patient data to domain knowledge, and fusing the patient information with the chronic disease knowledge graph to construct the patient information model;
[0096] a hyperplane feature map drawing module: converting the patient information model into a hyperplane feature map through a distributed representation, wherein the distributed representation adopts a translation-based model between entity vectors and relationship vectors;
[0097] a two-dimensional plane mapping module: the position information of a two-dimensional plane node corresponding to a two-dimensional position of the hyperplane feature map of the patient information model after dimensionality reduction, distinguishing different information categories in the knowledge graph by using the colors of the nodes, using a feature importance ranking of the Regularized Gradient Boosted Decision Tree algorithm as the ranking of correlation between each node and disease progression, and using a feature weight value as a calculation weight of the Euclidean distance; and
[0098] a decision making support feedback module: taking a domain expert marking result as the standard for the patient information model with an ideal chronic disease management effect, drawing a two-dimensional plane mapping image of the patient data through distributed representation and dimensionality reduction visualization, and calculating the Euclidean distance between geometric centers of various feature areas in the mapping image in combination with the feature weight value to serve as a standardized management target; calculating the Euclidean distance between the features of a patient who requires decision making support feedback in the two-dimensional plane mapping image, and comparing the Euclidean distance with a standard numerical value in combination with the feature weight value calculated from the Euclidean distance, so as to find a path of similar distance; and obtaining knowledge in the knowledge graph according to the distance information of the features.
[0099] The above descriptions are only preferred embodiments of the present application. Although the present application has been disclosed as above by the preferred embodiments, the present application is not limited thereto. Anyone familiar with this art, without departing from the scope of the technical solutions of the present application, can use the method and technical content disclosed above to make many possible changes and modifications to the technical solutions of the present application, or modify the technical solutions into equivalent embodiments of equivalent changes. For example, feature importance ranking can also use the CatBoost (Categorical Boosting) algorithm, and the Light GBM algorithm. Distributed representation can also use translation models such as TransG, TransR, and CTransR. The mapping of the two-dimensional plane can also use principal component analysis (Principal Component Analysis, PCA), Sammon mapping, SNE and other dimensionality reduction algorithms. Therefore, any simple changes, equivalent changes and modifications made to the above embodiments based on the technical essence of the present application, without departing from the content of the technical solutions of the present application, still fall within the protection scope of the technical solutions of the present application.