Medical Prediction Method and System Based on Semantic Graph Network
20220277858 · 2022-09-01
Inventors
Cpc classification
G06F18/254
PHYSICS
G16H50/20
PHYSICS
G16H50/70
PHYSICS
G06N3/0442
PHYSICS
G06V30/28
PHYSICS
International classification
G16H50/70
PHYSICS
Abstract
The present invention discloses a medical prediction method and system based on a semantic graph network, which recognizes an entity in an electronic medical record based on domain knowledge, and uses a two-way gated loop unit to learn a sequence features of a text. Secondly, in order to extract a semantic relation in the electronic medical record in a fine-granularity manner, the present invention defines two types of subgraphs, graph representation based on defined knowledge and graph representation based on undefined knowledge, and uses a Graph Convolution Network (GCN) and a Graph Attention Network (GAT) to extract a semantic relation representation, where the graph representation based on undefined knowledge allows the learning of a relation between an entity or an word and the graph representation based on undefined knowledge, and it also allows to learn a relation between word or entity and itself, in order to translate entity or word representation into a uniform graph embedding representation. For an attribute-value pair, the present invention uses a bi-directional gate recurrent unit (Bi-GRU) to extract an entity corresponding to a numerical feature or a categorical feature after extracting the numerical feature or the categorical feature in the electronic medical record to construct attribute-value graph representation. Finally, the semantic relation and an attribute-value are fused to train a prediction model of a disease level.
Claims
1. A medical prediction method based on a semantic graph network, specifically comprising the following steps: S1. preprocessing medical text data; S2. Feature extraction on the preprocessed medical text data; S3. fusing a multi-granularity feature on the extracted feature to obtain a final document feature representation; and S4. predicting a chronic disease on the final document feature representation.
2. The medical prediction method based on the semantic graph network according to claim 1, wherein Step S1 is specifically as follows: S11. manually annotating the medical text data according to a target category that needs to be predicted, and loading the medical text data into a domain ontology; S12. cutting the medical text data into Chinese character strings according to punctuation marks, numbers and space characters, and removing off-stream words.
3. The medical prediction method based on the semantic graph network according to claim 1, wherein the feature extraction in Step S2 includes: entity embedding representation, word embedding representation, semantic relation representation extraction, and attribute-value pair extraction.
4. The medical prediction method based on the semantic graph network according to claim 3, wherein the entity extraction is specifically as follows: first, mapping the preprocessed medical text data to the domain ontology; dividing the medical text data into semantic sets via a maximum matching method; then finding an entity set matching the semantic set and an entity type set corresponding to the entity set from the semantic set to obtain an entity representation and an entity type representation; and finally, combining the entity representation and the entity type representation to extract an entity representation.
5. The medical prediction method based on the semantic graph network according to claim 3, wherein the word embedding representation and the attribute-value pair extraction are specifically as follows: using a Bi-GRU to find a dependency relation between word sequences in the medical text data, and putting sequence information between words into a graph attention network to identify semantic relation and extract an attribute-value pair.
6. The medical prediction method based on a semantic graph network according to claim 3, wherein the semantic relation representation extraction is specifically as follows: using a graph convolution network and the graph attention network to construct a semantic relation graph and defining two types of subgraphs of graph representation based on defined knowledge and graph representation based on undefined knowledge, wherein the graph representation based on defined knowledge uses a relation between entities marked in the domain ontology and uses the graph convolution network and the graph attention network to extract an entity relation in an electronic medical record text, for the entity or the word whose corresponding relation cannot be found from the domain ontology, the graph representation based on undefined knowledge directly uses the graph convolution network and the graph attention network to extract a relation between the words or the entities based on a dependency relation between words in context extracted by the Bi-GRU.
7. The medical prediction method based on the semantic graph network according to claim 1, wherein Step S3 is specifically as follows: feature-fusing an extracted entity representation, an extracted word representation, an extracted semantic relation representation, and an attribute-value pair representation to obtain the final document feature representation.
8. The medical prediction method based on the semantic graph network according to claim 1, wherein Step S4 is specifically as follows: inputting the document feature representation into softmax layer for medical prediction, and calculating a loss function based on a cross entropy between a real label and a predicted label to obtain a classification result of a disease type and a prediction result of a disease level.
9. A medical prediction system based on a semantic graph network, comprising a data preprocessing module, a feature extraction module, a multi-granularity feature fusion module, and a disease type classifier module; an output terminal of the data preprocessing module is connected to an input terminal of the feature extraction module; an output terminal of the feature extraction module is connected to an input terminal of the multi-granularity feature fusion module; an output terminal of the multi-granularity feature fusion module is connected to an input terminal of the disease type classifier module; the data preprocessing module is configured to manually annotate medical text data according to a target category to be predicted, and load the medical text data into a domain ontology, and is also configured to segment the medical text data with Chinese character strings according to punctuation marks, numbers, and space characters, and remove off-stream words; the feature extraction module is configured to extract an entity representation, a word representation, a semantic relation representation, and a attribute-value pair in the medical text data; the multi-granularity feature fusion module is configured to fuse an extracted entity representation, an extracted word representation, an extracted semantic relation representation, and an attribute-value pair representation as inputs of softmax layer for disease prediction; the disease type classifier module is configured to generate a classification result of a disease type.
10. The medical prediction system based on the semantic graph network according to claim 9, wherein the feature extraction module further includes four sub-modules, namely: an entity embedding representation module, a word embedding representation module, and a semantic relation representation extraction module and an attribute-value pair extraction module; the entity embedding representation module is connected to the word feature extraction module, the word embedding representation module is connected to the attribute-value pair extraction module, the attribute-value pair extraction module is connected to the semantic relation representation extraction module; the entity embedding representation module is configured to map a processed medical text to the medical ontology, extract a concept's own feature and a concept type feature, and combine the concept's own feature and the concept type feature to extract a concept feature; the word embedding representation module is configured to perform BiGRU learning of a word sequence feature in context for the concept, wherein the concept cannot be found to match the word embedding representation module from the medical ontology; the semantic relation representation extraction module is configured to find an entity pair of a corresponding relation category in the domain ontology and an entity pair whose corresponding relation category cannot be found in the domain ontology; the attribute-value pair extraction module is configured to extract a relation between a disease-time and a detection-examination result.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0024] In order to explain embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the drawings that need to be used in the embodiments. Obviously, the drawings in the following description are only some of embodiments of the present invention. The person skilled in the art can obtain other drawings based on these drawings without creative work.
[0025]
[0026]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0027] The following clearly and completely describes the technical solutions in embodiments of the present invention in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by the person skilled in the art without inventive work shall fall within the protection scope of the present invention.
[0028] In order to make the forgoing objectives, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail with reference to the drawings and specific embodiments.
Embodiment 1
[0029] Referring to
[0030] S2. Performing entity embedding representation (21), word embedding representation (22), semantic relation representation extraction (23), and attribute-value pair extraction (24) on the preprocessed medical text data.
[0031] The entity embedding representation (21): an entity representation included an entity representation and an entity type representation. First, the preprocessed text was mapped to the domain ontology, and the text data was divided into a semantic set {Y.sub.1, . . . Y.sub.n}∈D (D was a text data) via a maximum matching method, where D included an entity set {C.sub.1, . . . C.sub.n}∈Y and had a corresponding entity type {C.sub.1type, . . . C.sub.Ntype}, and an entity set could be found in the domain ontology. An entity representation was extracted by combining the entity representation and the entity type representation, and denoted as e.sub.i=c.sub.i⊕c.sub.itypee={e.sub.1 . . . e.sub.n}e.sub.i∈e, where c.sub.i was the concept's own feature and belonged to a concept set {C.sub.1 . . . C.sub.N}. c.sub.itype was the concept c.sub.i's type feature and belonged to {C.sub.1type . . . C.sub.Ntype}, and ⊕ was a vector splicing operation. In this method, both the entity and a word belonged to a word-level feature. The word2vec model was used to convert the entity, the entity type and a word in a context into a d-dimensional vector form. Graph representation methods of the entity and the word were introduced in a graph representation based on undefined knowledge method in (23).
[0032] The word embedding representation (22): Bi-GRU was used to capture a dependency relation between word sequences and extract a word representation. If there was a word sequence w.sub.l∈[w.sub.1, . . . , w.sub.n] and the corresponding hidden unit h.sub.i∈[h, . . . , h.sub.n], context information of the word sequence and a corresponding hidden unit might be obtained by formula (1) and formula (2):
{right arrow over (h.sub.i)}={right arrow over (GRU)}(w.sub.i,θ),i∈[1,n] (1)
=
(w.sub.i,θ),i∈└n,1┘ (2)
θ represented parameters in a GRU model. Forward sequence information {right arrow over (h.sub.i)} and reverse sequence information were combined to extract a context feature h.sub.i=[{right arrow over (h)},
] of the word w.sub.i, where h.sub.i represented a hidden state. Finally, the
sequence information between the words was put into a graph attention network to identify a semantic relation and extract an attribute-value pair.
[0033] The semantic relation representation extraction (23): in this step, the present invention used a graph convolution network and the graph attention network to construct a semantic relation graph and define two types of subgraphs: (1) graph representation based on defined knowledge: the subgraph used a relation between entities marked in the domain ontology, and used the graph convolution network and the graph attention network to extract a graph representation of an entity relation in an electronic medical record text. (2) Graph representation based on undefined knowledge: for an entity or a word (where the entity or the word could not be found in the domain ontology), according to a dependency relation between words in context extracted by the Bi-GRU, the graph convolution network and the graph attention network were directly used to extract a relation between the words or the entities.
[0034] (1) The graph representation based on defined knowledge: first, based on a medical ontology, entities contained in an electronic medical record and the relation between the entities were identified as a node and an edge of a graph, where the node and the edge were recorded as V.sup.K and E.sup.K, respectively. {h.sub.1, h.sub.2, . . . , h.sub.|n|} was used to represent a feature of the node {v.sub.1, v.sub.2, . . . , v.sub.|n|}, h.sub.i∈, e.sub.ij.sup.r=(v.sub.i,v.sub.j), where, i≠j indicated that there was a corresponding relation r of the node v.sub.i and v.sub.j in an ontology. Then a knowledge graph representation model G.sup.K={V.sup.K,E.sup.K} was built based on |V.sup.K| and |E.sup.K|. Due to individual differences in patients, a fine-granularity relation between the entities could provide more detailed disease-related information and was more important for disease prediction. However, the same entity pair might correspond to a variety of different relations in the domain ontology. For example, there might be a relation TrID (a treatment method improved a certain disease) between a disease entity “chronic constipation” and a treatment entity “Dumic”, where a TrWD treatment method worsened a certain disease, and was applied to a certain disease, and a treatment effect was not stated. Therefore, the present invention used syntactic analysis to extract a trigger word and an adjective of the trigger word and combine the trigger word and the adjective of the trigger word, and then used a cosine distance to calculate semantic similarity with a relation category, thereby determining which fine-granularity relation the entity pair belonged to. If there was not the adjective of the trigger word in a sentence, similarity between the trigger word and an entity category was directly calculated, as shown in formulas (3) and (4):
p.sub.2=sim[(c.sub.i⊖f.sub.i)r.sub.i] (3)
p.sub.2=sim[c.sub.j,r.sub.j] (4)
Where, c.sub.i and c.sub.j represented the trigger words, f.sub.i represented the adjective of c.sub.i, r.sub.i and r.sub.j represented relation categories, and sim[a,b] represented the calculation of similarity between a and b. The present invention tested a similarity threshold value in the range of 0.85-0.92 in an experiment, and results showed that there was a best effect at 0.89.
[0035] Next, an adjacency matrix A.sup.K was defined. For each graph, the present invention defined a binary matrix P∈.sup.nd×nb to represent the relation between the entities in the sentence. If the entity pairs v.sub.i and v.sub.j in the sentence had a corresponding entity relation in the domain ontology, then P.sub.ij=1, otherwise, P.sub.ij was equal to 0. The present invention only considered a first-order neighbor, and a knowledge-based adjacency matrix was represented by formula (5):
[0036] After obtaining the adjacency matrix, the present invention first used learning node representation of the graph convolution network, as shown in formula 6-2:
H.sup.K(t)=ReLU(A.sup.KH.sup.K(t-2)W.sup.K(t-1)+B.sup.K) (6)
[0037] Where,
D.sup.K was a degree matrix of A.sup.K, and the degree matrix is a diagonal matrix D.sub.ti.sup.K=Σ.sub.j=1.sup.n A.sub.ij.sup.K. W.sup.K and B.sup.K represented a weight and bias parameters, W.sup.K∈.sup.(nd+nb)×l, B.sup.K∈
.sup.(nd+nb)×l. ReLU represented a nonlinear activation function. H.sup.K(t-1) represented a feature of a previous layer of H.sup.K.
[0038] After a graph convolution layer, the present invention combined the entity relation in the domain ontology and used a graph attention layer to extract knowledge-based node representation. For a given node, the graph attention network first learned the importance of a neighboring node with the same relation and fused the neighboring node according to a weight score. If there were node features h−{h.sub.1, h.sub.2, . . . , h.sub.|n|} and h.sub.i∈.sup.F, a new node representation set was generated as an output h={h.sub.1′, h.sub.2′, . . . , h.sub.|n|′}, h.sub.i′∈
.sup.F′ via the graph attention layer. F′ represented the dimension of an output feature. In order to transform an input into a higher-level output feature, the graph attention layer used a weight matrix to parameterize shared linear transformation at each node, W∈
.sup.F′×F and used a shared attention mechanism to calculate an attention coefficient, as shown in formula (7):
e.sub.ij.sup.Φr=a(W.sub.bh.sub.i,W.sub.b(h.sub.j|E.sub.r)) (7)
Where, e.sub.ij.sup.Φr represented that graphs Φ consisting of entity pairs v.sub.i and v.sub.j in the sentence had a relation in the domain ontology r. E.sub.r represented a relation vector of r. W.sub.b represented a weight. a∈.sup.2F′. Next, the present invention used formula (8) to regularize weight scores of the adjacent nodes:
[0039] Where, N.sub.i.sup.Φr represented the neighbor node of a node v.sub.i and had a relation r. Finally, the feature of a subsequent node v.sub.i was obtained by combining a knowledge graph with formula (9). X.sup.Φ={x.sub.1.sup.Φ, . . . , x.sub.n.sup.Φ}, x.sub.i.sup.Φ⊂x.sup.Φ was used to represent a knowledge graph contained in an electronic medical record. {x.sub.1.sup.Φ, . . . , x.sub.n.sup.Φ} Was combined to obtain the knowledge graph G.sup.K of the electronic medical record, as shown in formula (10):
[0040] (2) The graph representation based on undefined knowledge
For an entity or a word whose corresponding relation category could not be found from the ontology, a dependency relation between the word sequences was extracted according to the Bi-GRU, and the present invention used a graph convolution model to extract the graph representation based on undefined knowledge G.sup.C={V.sup.C,E.sup.C}. The adjacency matrix A.sup.C was represented by formula (11). If the word or an entity node v.sub.p is related to v.sub.q, where p=q or p≠q (when p=q, learning the feature of the concept or the word itself), then U.sub.ij=1, otherwise, U.sub.ij is equal to 0.
[0041] The learning node representation of the graph convolution network is shown in formula (12):
H.sup.C(t)−ReLU(Ã.sup.CH.sup.C(t-1)W.sup.C(t-1)+B.sup.C) (12)
[0042] Where,
D.sup.C was a degree matrix of A.sup.C, and the degree matrix was a diagonal matrix D.sub.ii.sup.C=Σ.sub.j=1.sup.n A.sub.ij.sup.C. W.sup.C and B.sup.C represented the weight and the bias parameters. Then the graph attention network was used to update representation of the node v.sub.p, as shown in formula (13):
e.sub.pq.sup.Φ=a(W.sub.jh.sub.p,W.sub.jh.sub.q) (13)
Next, formula (14) was used to regularize the weight scores of the adjacent nodes, and finally formula (15) was used to calculate the graph representation of the entity or the word v.sub.p and v.sub.q.
[0043] Where, ∥ represented the vector splicing operation. LeakyRelu represented a non-linear activation function. N.sub.j represented the neighbor node of v.sub.p. z.sup.Φ={z.sub.1.sup.Φ, . . . , z.sub.m.sup.Φ}, z.sub.j.sup.Φ∈z.sup.Φ represented a text graph contained in the electronic medical record. A set graph {z.sub.1.sup.Φ, . . . , z.sub.m.sup.Φ} obtained text graph representation G.sup.C, as shown in formula (16).
G.sup.C=Σ.sub.j=1.sup.mz.sup.Φ (16)
[0044] The attribute-value pair extraction (24): an attribute-value could be divided into two types: disease-time and a test-test result. where, the type of a disease-time value included only a numeric type, and the type of a test-test result value included the numeric type and a categorical type. Each attribute-value included two elements, an attribute and its corresponding value. Unlike an entity relation where a tail entity was usually relatively stable and would not change from a patient to a patient, in the attribute-value, the value would vary from a patient to a patient; for example, the blood pressure value of each patient was different. For the numeric type, each value could be expressed in different units, such as “10 years” and “122/70 mmHg”. For this type, the present invention first extracted a real value of EMR and its corresponding unit symbol, including a ratio symbol, such as “47.6%”, and a character symbol, such as “5 years”. If there were a real value D.sub.i and its corresponding unit symbol U.sub.i, the updated value could be represented by v.sub.i=D.sub.iΦu.sub.i (u.sub.i was unit symbols). A categorical type value was considered to be word-level representation, and did not have the unit symbol. Due to the different expressions of different doctors, negative words contained in the electronic medical record usually changed the polarity of the categorical value; for example, the expressions of “not abnormal” and “normal” in “a patient's cardiac ultrasound was not abnormal” and “the patient's cardiac ultrasound was normal” had the same meaning. Therefore, it was necessary to combine the negative words to extract a categorical value feature. If there was no negative word prefix before the type value, word vector representation of the type value was directly extracted. If the type value was prefixed by a negative word, the present invention first combined the negative word with the type value, and then calculated similarity between the type value and other type values via the cosine distance (here a similarity distance was also set to 0.9).
[0045] According to the guidance of a medical expert, a quantitative threshold value was set for the value of each examination result during training for disease inference. The value of the examination result was divided into 4 levels: a low level, a normal level, a high level, and a very high level. If there was an examination entity v.sub.n, its corresponding examination result v.sub.m and grade index l.sub.i, i=4 as well as the attribute-value of the test-test result could be expressed as a graph g.sub.n.sup.Φ−[v.sub.n;(v.sub.m+l.sub.i)], where [x.sub.1;x.sub.2] represented that vector splicing of x.sub.1 and x.sub.2 was performed. For the disease-time, if there was a disease entity v.sub.o and its corresponding time v.sub.s, an attribute-value of the disease-time could be expressed as g.sub.o.sup.Φ=[v.sub.o;v.sub.s]. In addition, the expression of an attribute-value relation in the test-test result was the same as that of the disease-time. g.sub.k.sup.Φ was used to represent one of the graphs in the attribute-value. g.sub.k.sup.Φ∈{g.sub.1.sup.Φ, . . . , g.sub.l.sup.Φ} obtained the graph of the attribute-value in a document, as shown in formula (17).
G.sup.V=Σ.sub.k=1.sup.lg.sup.Φ (17)
[0046] In the process of extracting an attribute-value pair, the present invention first identified a numerical value and a categorical value contained in a sentence, then learned context information of the value via the Bi-GRU, and extracted the entity closest to the value as its corresponding attribute feature.
S3. obtaining a final document feature representation d.sub.i, i∈[1 . . . n] by combining the graph representation based on defined knowledge, the graph representation based on undefined knowledge and an attribute-value-based graph representation, as shown in formula (18).
d.sub.i=[G.sup.K⊕G.sup.C⊕G.sup.V] (18)
Where, G.sup.K was knowledge graph representation, G.sup.C was text graph representation, G.sup.V was attribute-value graph representation, and ⊕ was the vector splicing operation.
S4. using the document feature representation d as an input of softmax layer to predict the level of COPD on the document, and calculating a loss function based on a cross entropy between a real label and a predicted label, as shown in formula (19) and formula (20).
Where, W.sub.c and b.sub.c represented a weight matrix and a bias term in a classification layer. θ represented the parameters in the model, including W.sup.k, W.sup.c, W.sub.e. c represented the number of categorical labels, c>1. represented the cross entropy between the real label y.sub.i and the predicted label ŷ.sub.i.
[0047] Referring to
[0048] An output terminal of the data preprocessing module is connected to an input terminal of the feature extraction module. An output terminal of the feature extraction module is connected to an input terminal of the multi-granularity feature fusion module. An output terminal of the multi-granularity feature fusion module is connected to an input terminal of the disease type classifier module.
[0049] The data preprocessing module was configured to manually label medical text data according to a target category to be predicted, then load the medical text data into a domain ontology; divide a text to be processed into Chinese character strings according to punctuation marks, numbers and space characters, and remove off-stream words.
[0050] The feature extraction module was divided into four submodules, namely: an entity embedding representation module, a word embedding representation module, a semantic relation representation extraction module, and an attribute-value pair extraction module.
(1) The entity embedding representation module was configured to map a processed medical text to a medical ontology, extract a concept's own feature and a concept type feature, and combine the concept's own feature and the concept type feature to extract a concept feature.
(2) The word embedding representation module was configured to use BiGRU to learn a sequence feature of a word in context if a concept matching the medical ontology could not be found from the medical ontology.
(3) The semantic relation representation extraction module: semantic relation included three types: an entity-entity relation, an entity-word relation, and a word-word relation. The entity-entity relation could be divided into two types, graph representation based on defined knowledge (referring to an entity pair, where the entity pair could find a corresponding relation category in the domain ontology) and the graph representation based on undefined knowledge (referring to an entity pair, where the entity pair could not find the corresponding relation category in the domain ontology). The word was not a medical term but included important semantic information (such as basic patient information). In a graph representation based on undefined knowledge, this method allowed to extract the relation between the entity or the word and the graph representation based on undefined knowledge, and graph representation of the entity or the word.
(4) The attribute-value pair extraction module: an attribute-value pair included two categories: disease-time and a test-test result. An attribute referred to an entity representation in Step (21). A value could be divided into two types: a numeric type value and a categorical type value. A value in the disease-time only included the numeric type value, and a value in the detection-examination result included the numeric type value and the category type value. Attribute-value graph representation was constructed according to each attribute and its corresponding value.
[0051] The multi-granularity feature fusion module was configured to fuse an extracted entity representation, an extracted word representation, an extracted semantic relation representation, and an extracted attribute-value pair representation as inputs of softmax layer for disease prediction. In order to prevent overfitting, a convolution layer of a graph convolution network used dropout operation and used zero padding to maintain the validity of a sentence.
[0052] The disease type classifier module was configured to put a result of model training into softmax classification layer, and use softmax classifier to generate a classification result of the final disease type.
[0053] The forgoing embodiments only describe the preferred mode of the present invention, and do not limit the scope of the present invention. Without departing from the design spirit of the present invention, the person skilled in the art can make variations and improvements to the technical solutions of the present invention, which should fall within the protection scope determined by the claims of the present invention.