AUTOMATIC INDUSTRY CLASSIFICATION METHOD AND SYSTEM

Abstract

An automatic industry classification method comprises: determining a scope of target patents, defining a target industry tree; generating marks on the target industry tree; performing a rough classification for the target patents by using the marks; performing a fine classification for the target patents according to a result of the rough classification. The automatic industry classification method and system provided by the present invention uses a transductive learning method, so that full mining of small annotation quantity information is realized. The automatic industry classification method and system uses information of IPC, so that information dimension is enriched, and calculation amount needed in the classification is reduced. The automatic industry classification method and system further uses the hierarchical vectors generated by the abstract, the claims and the description, so that the information of word order relation is reserved, and the patent text is deeply mined.

Claims

1. An automatic industry classification method, comprising determining a scope of target patents, wherein the automatic industry classification method further comprises following steps: step 1: defining a target industry tree; step 2: generating marks on the target industry tree; step 3: performing a rough classification for the target patents by using the marks; and step 4: performing a fine classification for the target patents according to a result of the rough classification.

2. The automatic industry classification method according to claim 1, wherein step 1 further comprises: defining an industry tree I={i.sub.1, . . . , i.sub.j, . . . , i.sub.n} as needed, wherein i.sub.j∈I and is a first level industry, j is a serial number of the first level industry, 1≤j≤n, and n is a number of all leaf nodes of I; and setting i.sub.jkl . . . ={i.sub.jkl . . . 1, . . . , i.sub.jkl . . . t} as any non-leaf node of I, wherein degree of other nodes except the leaf nodes is greater than or equal to 2, k is a serial number of a second level industry, l is a serial number of a third level industry, and t is a serial number of a penultimate level industry; wherein the determining of the scope of the target patents is to manually determine the scope of the target patents to be classified as needed.

3. (canceled)

4. (canceled)

5. The automatic industry classification method according to claim 1, wherein step 2 further comprises: according to resource constraints, determining a number p of patents to be marked, wherein p≥N, each leaf node of the target industry tree is marked with at least one patent belonging to the node, and N is a number of a last level industry.

6. The automatic industry classification method according to claim 2, wherein step 3 comprises determining nodes above the leaf nodes; and wherein step 3 further comprises following sub-steps: step 31: generating a node set V of a graph; step 32: arranging the marks; step 33: generating an edge set E of the graph; step 34: generating an adjacency matrix; and step 35: performing node division.

7. (canceled)

8. The automatic industry classification method according to claim 6, wherein step 31 further comprises: defining one or more International Patent Classification (IPC) of each target patent as an IPC combination IPC.sub.v={ipc.sub.1, . . . , ipc.sub.q}, wherein all different IPC combinations of the target patents form the node set V.

9. The automatic industry classification method according to claim 6, wherein step 32 further comprises: taking an industry on a leaf node marked with patents as a classification y.sub.i∈ custom-character of the leaf node, setting a number of marked nodes to be l, adjusting a sequence of the leaf nodes, wherein the marked nodes is adjusted to be the front, then 1≤i≤l; and verifying whether l<<a number of unmarked nodes u, and if not, adjusting the leaf nodes marked with patents, otherwise V={IPC.sub.1, . . . , IPC.sub.l, IPC.sub.l+1, . . . , IPC.sub.l+u}.

10. (canceled)

11. The automatic industry classification method according to claim 6, wherein the edge set E is a matrix, and a weight e.sub.ij of edges between two vertices is a number of patents in a union IPC.sub.i∪IPC.sub.j of IPCs of the two vertices, wherein, e.sub.ij is value in the matrix E.

12. The automatic industry classification method according to claim 6, wherein step 34 further comprises following sub-steps: step 341: generating a distance matrix S, wherein a calculation formula of the distance matrix S is s.sub.ij=∥e.sub.i−e.sub.j∥.sub.2, wherein, e.sub.i and e.sub.j are respectively an i-th row and a j-th row of the edge set E; step 342: generating the adjacency matrix W by using the distance matrix S.

13. (canceled)

14. The automatic industry classification method according to claim 6, wherein step 35 further comprises following sub-steps: step 351: generating a degree matrix D=diag(d.sub.1, d.sub.2, . . . , d.sub.l+u), having a diagonal element d.sub.i=Σ.sub.j=1.sup.l+uW.sub.ij, wherein, u is a number of unmarked nodes, and W.sub.ij is the adjacent matrix; step 352: generating a marked matrix, a nonnegative (l+u)×| custom-character | marked matrix F=(F.sub.1.sup.T, F.sub.2.sup.T, . . . , F.sub.l+u.sup.T).sup.T, wherein an element of an i-th row F.sub.i=(F.sub.i1, F.sub.i2, . . . , ) is a marked vector of IPC.sub.i in the node set, a classification rule is y.sub.i=argmaxF.sub.ij, wherein, is a set of industries, and T represents a transposition of a matrix; step 353: initializing the nonnegative marked matrix F, for i=1, 2, . . . , m and j=1, 2, . . . , | custom-character |, $F (0) = Y_{i j} = {\begin{matrix} 1, if (1 \leq i \leq l) \land (y_{i} = j) \\ 0, otherwise \end{matrix};$ step 354: constructing a propagation matrix $B = D^{- \frac{1}{2}} W D^{- \frac{1}{2}}$ wherein, $D^{- \frac{1}{2}} = diag (\frac{1}{\sqrt{d_{1}}}, \frac{1}{\sqrt{d_{2}}}, . . ., \frac{1}{\sqrt{d_{l + u}}}),$ d represents diagonal elements of the degree matrix D; step 355: generating an iterative calculation formula F(t+1)=αBF(t)+(1−α)Y, wherein, α∈(0,1) is a parameter, F(t) is a result of a t-th iteration, and Y is an initial matrix; step 356: iterating the calculation formula to convergence to obtain a state $F^{*} = \lim_{t .fwdarw. \infty} F (t) = (1 - α) {(M - α B)}^{- 1} Y$ under convergence, wherein, M is a unit matrix; and step 357: performing a prediction of the unmarked nodes y.sub.i=argmax custom-character F.sub.ij*, wherein, l+1≤i≤l+u.

15. The automatic industry classification method according to claim 6, wherein step 4 further comprises following sub-steps: step 41: setting objects to be classified; step 42: extracting text information of patents; step 43: generating text sets to be trained; step 44: performing text vectorization; step 45: performing patent classification; and step 46: in any leaf node classed by the step 45, identifying a patent, wherein the patent does not belong to any industry of the leaf nodes on the target industry tree.

16. The automatic industry classification method according to claim 15, wherein step 41 further comprises taking patent nodes of each class divided in step 3 as a group, wherein patents corresponding to a patent node marked as y.sub.i∈ custom-character are in the group, and there are || groups; wherein step 42 further comprises extracting an abstract, claims and a description of each patent in each group, performing word segmentation of text information of patent by using an existing tool, and generating a text set G={g.sub.1, . . . , g.sub.n}, wherein g.sub.i=(p.sub.i1, p.sub.i2, p.sub.i3), p.sub.i1, p.sub.i2, and p.sub.i3 are respectively word sequences obtained by word segmentation of an abstract, claims, and a description of an i-th patent and wherein the text sets to be trained comprise the text set G, a text set G.sub.1={p.sub.11, . . . , p.sub.n1}, a text set G.sub.2={p.sub.12, . . . , p.sub.n2} and a text set G.sub.3={p.sub.13, . . . , p.sub.n3}, the text set G, the text set G.sub.1, the text set G.sub.2, the text set G.sub.3 are respectively composed of word segmentation results of all-texts, abstracts, claims, and descriptions of the patents in the group.

17. (canceled)

18. (canceled)

19. The automatic industry classification method according to claim 16, wherein step 44 further comprises following sub-steps: step 441: vectoring a text in the text sets to be training, wherein in each text set to be trained, an element P=(t.sub.1, . . . , t.sub.m) is a segmented word sequence with m elements, t.sub.i∈P is determined by w words t.sub.i, context={t.sub.i−w, . . . , t.sub.i−2, t.sub.i−1, t.sub.i+1, t.sub.i+2, . . . , t.sub.i+w} before and after t.sub.i, and by maximizing $\frac{1}{m} {.Math.}_{i = w}^{m - w} \log p (t_{i} .Math. t_{i, context}, pid)$ wherein, the pid is a paragraph number of t.sub.i in p, p(t.sub.i|t.sub.i, context,pid)= $p (t_{i} .Math. t_{i, context}, pid) = \frac{e^{y_{t_{i}}}}{{.Math.}_{j} e^{y_{j}}},$ y.sub.t.sub.i=b+UΦ(t.sub.i, context,pid), U and b are parameters of softmax, and a vector corresponding to P is obtained by training data to be trained using a stochastic gradient descent method; and step 442: generating a matrix of the text, wherein vectorization results of G={g.sub.1, . . . , g.sub.n}, G.sub.1={p.sub.11, . . . , p.sub.n1}, G.sub.2={p.sub.12, . . . p.sub.n2} and G.sub.3={p.sub.13, . . . , p.sub.n3} are supposed to be respectively H.sub.1={h.sub.11, . . . , h.sub.n1}, H.sub.2={h.sub.12, . . . , h.sub.n2}, H.sub.3={h.sub.13, . . . , h.sub.n3}, and H.sub.4={h.sub.14, . . . , h.sub.n4}, then a generated set of matrix of text of target patents is H={h.sub.1, . . . , h.sub.n}, wherein h.sub.i=(h.sub.i1, h.sub.i2, h.sub.i3, h.sub.i4).

20. (canceled)

21. (canceled)

22. The automatic industry classification method according to claim 15, wherein step 45 further comprises setting marked patents as S=∪.sub.j=1.sup.kS.sub.j⊂H, wherein, S.sub.j≠Ø is the marked patent of a j-th leaf node on the industry tree, and initializing j cluster centers of a k-means algorithm using the marked patents, wherein a cluster membership of the marked patents is not changed in an iterative updating process of clusters.

23. The automatic industry classification method according to claim 15, wherein step 46 further comprises following sub-steps: step 461: calculating a k distance of a patent p, and setting a k-th distance of the patent p as k−distance(o), wherein in patents divided into a leaf node on the industry tree, there is a patent o, and the distance between the patent o and the patent p is d(p, o); step 462: a k-th distance domain of the patent p, wherein a distance of a patent set from the patent p is ≤k−distance(o), and the patent set is called the k-th distance domain N.sub.k(p) of the patent p; step 463: calculating a reachable distance reachdist(p, o)=max{k−distance(o), ∥p−o∥} of the patent p relative to the patent o; step 464: calculating a local reachable density $lr d_{k} (p) = \frac{| N_{k} (p) |}{{.Math.}_{o \in N_{n} (p)} {reach}_{d} {ist}_{k} (p, o)}$ step 465: calculating a local outlier factor ${LOF}_{k} (p) = \frac{{.Math.}_{o \in N_{n} (p)} \frac{{lrd}_{k} (o)}{{lrd}_{k} (p)}}{.Math. N_{k} (p) .Math.}$ and step 466: if LOF(p) is greater than a threshold, determining that p is an outlier, and does not belong to the leaf node; wherein if following two conditions are met, k−distance(o)=d(p,o): a first condition is that: in the leaf node, there are at least k patents q to make d(p,q)≤d(p,o); a second condition is that: in the leaf node, there are at most k−1 patents q to make d(p,q)<d(p,o).

24. (canceled)

25. An automatic industry classification system, comprising a confirmation module for determining a scope of target patents, further comprises: an industry tree generation module for defining a target industry tree; a mark generation module for generating marks on the target industry tree; a rough classification module for performing a rough classification for the target patents by using the marks; a fine classification module for performing a fine classification for the target patents according to a result of the rough classification; wherein, the automatic industry classification system performs automatic industry classification by executing the automatic industry classification method of claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] FIG. 1 is a flowchart of a preferred embodiment of an automatic industry classification method according to the present invention.

[0034] FIG. 1A is a flowchart of a rough classification method for target patents in the embodiment shown in FIG. 1 of industrial automatic classification method according to the present invention.

[0035] FIG. 1B is a flowchart of a generating method of an adjacency matrix in the embodiment shown in FIG. 1 of industrial automatic classification method according to the present invention.

[0036] FIG. 1C is a flowchart of a node division method in the embodiment shown in FIG. 1 of industrial automatic classification method according to the present invention.

[0037] FIG. 1D is a flowchart of a fine classification method for target patents in the embodiment shown in FIG. 1 of industrial automatic classification method according to the present invention.

[0038] FIG. 1E is a flowchart of a text vectorization method in the embodiment shown in FIG. 1 of industrial automatic classification method according to the present invention.

[0039] FIG. 1F is a flowchart of a patent classification method in the embodiment shown in FIG. 1 of industrial automatic classification method according to the present invention.

[0040] FIG. 2 is a block diagram of a preferred embodiment of an automatic industry classification system according to the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0041] The present invention is further described with reference to the drawings and specific embodiments.

Embodiment 1

[0042] As shown in FIG. 1 and FIG. 2, step 1000 is executed, and a target industry tree is defined by using an industry tree generation module 200, and a scope of patents to be classified is manually determined as needed.

[0043] Step 1100 is executed, and a scope of target patents is determined by using a confirmation module 210. An industry tree I={i.sub.1, . . . , i.sub.j, . . . , i.sub.n} is defined as needed, wherein, i.sub.j∈I and is a first level industry, j is a serial number of the first level industry, 1≤j≤n, n is the number of all leaf nodes of I. i.sub.jkl . . . ={i.sub.jkl . . . 1, . . . , i.sub.jkl . . . t} is set as any non-leaf node of I, degree of other nodes except the leaf nodes is greater than or equal to 2, wherein, k is a serial number of a second level industry, l is a serial number of a third level industry, and t is a serial number of a penultimate level industry.

[0044] Step 1200 is executed, and a mark generation module 220 is used to generate marks on the target industry tree. The number p of patents which can be marked is determined according to resource constraints, p≥N, each leaf node of the industry tree should be marked with at least one patent belonging to the node, wherein, N is the number of the last level industry.

[0045] Step 1300 is executed, and a rough classification module 230 is used to perform a rough classification for the target patents by using the marks, and nodes above the leaf node are determined. As shown in FIG. 1A, step 1310 is executed, and a node set V of a graph is generated. IPC(s) of each target patent is defined as an IPC combination IPC.sub.v={ipc.sub.1, . . . , ipc.sub.q} and all different IPC combinations of the target patents form the node set V.

[0046] Step 1320 is executed, and the marks are arranged. The industry on the leaf node marked with patents is taken as a classification y.sub.i∈ custom-character of the leaf node, the number of nodes which have been marked is set to be l, a sequence of the nodes is adjusted, and the marked nodes is adjusted to be the front, then 1≤i≤l. Whether l<<the number of unmarked nodes u is verified, and if not, adjusting the marked patent, otherwise V={IPC.sub.1, . . . , IPC.sub.l, IPC.sub.l+1 . . . , IPC.sub.l+u}.

[0047] Step 1330 is executed, and an edge set E of the graph is generated. The edge set E is a matrix, and weight e.sub.ij of edges between two vertices is the number of patents in a union IPC.sub.i∪IPC.sub.j of IPCs of the two vertices, wherein, e.sub.ij is value in the matrix E.

[0048] Step 1340 is executed, and an adjacency matrix is generated. As shown in FIG. 1B, step 1341 is executed, and a distance matrix S is generated. A calculation formula of the distance matrix S is s.sub.ij=∥e.sub.i−e.sub.j∥.sub.2, wherein, e.sub.i and e.sub.j are respectively the i-th row and the j-th row of the edge set E.

[0049] Step 1342 is executed, and the adjacency matrix W is generated by using the distance matrix S.

[0050] Step 1350 is executed, and node division is performed. As shown in FIG. 1C, step 1351 is executed, and a degree matrix D=diag(d.sub.1, d.sub.2, . . . , d.sub.l+u) is generated, diagonal elements of the degree matrix is d.sub.i=Σ.sub.j=1.sup.l+uW.sub.ij, wherein, u is the number of unmarked nodes, and W.sub.ij is the adjacent matrix.

[0051] Step 1352 is executed, and a marked matrix is generated, a nonnegative (l+u)×| custom-character | marked matrix F=(F.sub.1.sup.T, F.sub.2.sup.T, . . . , F.sub.l+u.sup.T).sup.T, the element of the i-th row F.sub.i=(F.sub.i1, F.sub.i2, . . . , F.sub.i|.sub.|) is a marked vector of IPC.sub.i in the node set, a classification rule is y.sub.i=argmaxF.sub.ij, wherein, is a set of industries, T represents transpose of the matrix.

[0052] Step 1353 is executed, and the marked matrix F is initialized, for i=1, 2, . . . , m and j=1, 2, . . . , | custom-character |,

[00009] $F (0) = Y_{i j} = {\begin{matrix} 1, if (1 \leq i \leq l) \land (y_{i} = j) \\ 0, otherwise \end{matrix} .$

[0053] Step 1354 is executed, and a propagation matrix

[00010] $B = D^{- \frac{1}{2}} W D^{- \frac{1}{2}}$

is constructed, wherein,

[00011] $D^{- \frac{1}{2}} = diag (\frac{1}{\sqrt{d_{1}}}, \frac{1}{\sqrt{d_{2}}}, .Math., \frac{1}{\sqrt{d_{l + u}}}),$

d represents diagonal elements of the degree matrix D.

[0054] Step 1355 is executed, and an iterative calculation formula F(t+1)=αBF(t)+(1−α)Y is generated, wherein, α∈(0,1) is a parameter, F(t) is a result of the t-th iteration, and Y is an initial matrix.

[0055] Step 1356 is executed, and the calculation formula is iterated to convergence to obtain a state

[00012] $F^{*} = \lim_{t .fwdarw. \infty} F (t) = (1 - α) {(M - α B)}^{- 1} Y$

under convergence, wherein, M is a unit matrix.

[0056] Step 1357 is executed, and a prediction of unmarked nodes y.sub.i=argmax custom-character F.sub.ij* is performed, wherein, l+1≤i≤l+u.

[0057] Step 1400 is executed, and a fine classification module 240 is used to perform a fine classification for the target patents according to a result of the rough classification. As shown in FIG. 1D, step 1410 is executed, and objects to be classified is set. The patent nodes of each class divided in the step 1300 are taken as a group, that means patents corresponding to node marked as y.sub.i∈ custom-character are a group, there are || groups.

[0058] Step 1420 is executed, and text information of patents is extracted. Abstract, claims and description of each patent in each group are extracted, word segmentation of text information of patent is performed by using an existing tool, and a text set G={g.sub.1, . . . , g.sub.n} is generated, wherein g.sub.i=(p.sub.i1,p.sub.i2,p.sub.i3), p.sub.i1, p.sub.i2, and p.sub.i3 are respectively word sequences obtained by word segmentation of the abstract, the claims, and the description of the i-th patent.

[0059] Step 1430 is executed, and text sets to be trained are generated. The text sets to be trained comprise the text set G, a text set G.sub.1={p.sub.11, . . . , p.sub.n1}, a text set G.sub.2={p.sub.12, . . . , p.sub.n2} and a text set G.sub.3={p.sub.13, . . . , p.sub.n3}, which are respectively composed of word segmentation results of the all-texts, the abstracts, the claims, and the descriptions of the patents in the group.

[0060] Step 1440 is executed, and text vectorization is performed. As shown in FIG. 1E, step 1411 is executed, and the text in the text sets to be trained is vectored. In each text set to be trained, an element P=(t.sub.1, . . . , t.sub.m) is a segmented word sequence with m elements, t.sub.i∈P is determined by w words t.sub.i, context={t.sub.i−w, . . . , t.sub.i−2, t.sub.i−1, t.sub.i+1, t.sub.i+2, . . . , t.sub.i+w} before and after it, and by maximizing

[00013] $\frac{1}{m} {.Math.}_{i = w}^{m - w} \log p (t_{i} .Math. t_{i, context}, pid)$

wherein, the pid is a paragraph number of t.sub.i in p,

[00014] $p (t_{i} .Math. t_{i, context}, pid) = \frac{e^{y_{t_{i}}}}{{.Math.}_{j} e^{y_{j}}},$

y.sub.t.sub.i=b+UΦ(t.sub.i, context,pid), U and b are parameters of softmax, and a vector corresponding to P is obtained by training data to be trained using a stochastic gradient descent method. Wherein, Φ is a mapping operation.

[0061] Step 1442 is executed, and a matrix of text is generated. The vectorization results of G={g.sub.1, . . . , g.sub.n}, G.sub.1={p.sub.11, . . . , p.sub.n1}, G.sub.2={p.sub.12, . . . , p.sub.n2} and G.sub.3={p.sub.13, . . . , p.sub.n3} are supposed to be respectively H.sub.1={h.sub.11, . . . , h.sub.n1}, H.sub.2={h.sub.12, . . . , h.sub.n2}, H.sub.3={h.sub.13, . . . , h.sub.n3}, and H.sub.4={h.sub.14, . . . , h.sub.n4}, then a generated set of matrix of text of target patents is H={h.sub.1, . . . , h.sub.n}, wherein h.sub.i=(h.sub.i1, h.sub.i2, h.sub.i3, h.sub.i4).

[0062] Step 1450 is executed, and patent classification is performed. Marked patents are set as S=∪.sub.j=1.sup.kS.sub.j⊂H, wherein, S.sub.j≠Ø is the marked patent of the j-th leaf node on the industry tree, j cluster centers of a k-means algorithm are initialized using the marked patents, and cluster membership of marked patents is not changed in an iterative updating process of clusters.

[0063] Step 1460 is executed, and in any leaf node classed by the step 45, the patent that does not belong to any industry of the leaf node on the tree is identified. As shown in FIG. 1F, step 1461 is executed, and a k distance of a patent p is calculated. A k-th distance of the patent p is set as k−distance(o), and in patents divided into a leaf node on the industry tree, there is a patent o, and the distance between the patent o and the patent p is d(p, o).

[0064] Step 1462 is executed, and a k-th distance domain of the patent p is calculated: a patent set whose distance from the patent p is ≤k−distance(o) is called the k-th distance domain N.sub.k(p) of the patent p.

[0065] Step 1463 is executed, and a reachable distance reachdist(p, o)=max{k−distance(o), ∥p−∥} of the patent p relative to the patent o is calculated. If the following two conditions are met, k−distance(o)=d(p, o):

{circle around (1)} in the leaf node, there are at least k patents q to make d(p, q)≤d(p, o);
{circle around (2)} in the leaf node, there are at most k−1 patents q to make d(p, q)<d(p, o).

[0066] Step 1464 is executed, and a local reachable density

[00015] $l r d_{k} (p) = \frac{.Math. N_{k} (p) .Math.}{{.Math.}_{o \in N_{n} (p)} {reach}_{d} {ist}_{k} (p, o)}$

is calculated.

[0067] Step 1465 is executed, and a local outlier factor

[00016] $L0 F_{k} (p) = \frac{{.Math.}_{o \in N_{n} (p)} \frac{{lrd}_{k} (o)}{{lrd}_{k} (p)}}{.Math. N_{k} (p) .Math.}$

is calculated.

[0068] Step 1466 is executed, and if LOF(p) is greater than a threshold, it is thought that p is an outlier, and does not belong to the leaf node.

Embodiment 2

[0069] An automatic industry classification method comprises the following steps.

[0070] 1. Defining a target industry tree. An industry tree I={i.sub.1, . . . , i.sub.j, . . . , i.sub.n} is defined as needed, wherein, i.sub.j∈I and is a first level industry, and i.sub.j may be divided into second level industries, i.sub.j={i.sub.j1, . . . , i.sub.jm}, and so on, any non-leaf node of I is i.sub.jkl . . . ={i.sub.jkl . . . 1, . . . , i.sub.jkl . . . t}. According to a general practice of industry division, degree of other nodes except the leaf nodes is greater than or equal to 2. The number of leaf nodes under I is set as N.

[0071] 2. Determining a scope of target patents. The scope of patents to be classified is manually determined as needed, such as applications in a certain country or applications in certain years.

[0072] 3. Generating marks. The number p of patents which can be marked is determined according to resource constraints, p≥N, each leaf node of the industry tree should be marked with at least one patent belonging to the node.

[0073] 4. Performing a rough classification for the target patents, that is determining nodes above the leaf node.

[0074] (1) Generating a node set V of a graph: IPC(s) of each target patent is defined as an IPC combination IPC.sub.v={ipc.sub.1, . . . , ipc.sub.q}, and all different IPC combinations of the target patents form the node set V.

[0075] (2) Arranging marks: the industry on the leaf node marked with patents are taken as a classification y.sub.i∈ custom-character of the leaf node, the number of nodes which have been marked is set to be l, a sequence of the nodes is adjusted, and the marked nodes is adjusted to be the front, then 1≤i≤l, whether l<<the number of unmarked nodes u is verified, and if not, adjusting the marked patent until yes, otherwise V={IPC.sub.1, . . . , IPC.sub.l, . . . , IPC.sub.l+1 . . . , IPC.sub.l+u}.

[0076] (3) Generating an edge set E of the graph: the E may be expressed as a matrix, a union of IPCs of two vertices is IPC.sub.i∪IPC.sub.j, then weight of edges between the two vertices e.sub.ij is equal to the number of patents with IPC in IPC.sub.i∪IPC.sub.j.

[0077] (4) Generating an adjacency matrix:

(4.1) a distance matrix S is generated using such as Euclidean distance, s.sub.ij=∥e.sub.i−e.sub.j∥.sub.2;
(4.2) the adjacency matrix W is generated by using the distance matrix S by using such as a full-connected method of Gaussian kernel function.

[0078] (5) Performing node division:

(5.1) a degree matrix D=diag(d.sub.1, d.sub.2, . . . , d.sub.l+u) is generated, diagonal elements of the degree matrix is d.sub.i=Σ.sub.j=1.sup.l+uW.sub.ij;
(5.2) a marked matrix is generated, a nonnegative (l+u)×| custom-character | marked matrix F=(F.sub.1.sup.T, F.sub.2.sup.T, . . . , F.sub.l+u.sup.T).sup.T, the element of the i-th row F.sub.i=(F.sub.i1, F.sub.i2, . . . , ) is a marked vector of IPC.sub.i in the node set, a classification rule is y.sub.i=argmaxF.sub.ij;
(5.3) the marked matrix F is initialized, for i=1, 2, . . . , m and j=1, 2, . . . , | custom-character |,

[00017] $F (0) = Y_{i j} = {\begin{matrix} 1, if (1 \leq i \leq l) \land (y_{i} = j) \\ 0, o t h e r w i s e \end{matrix};$

(5.4) a propagation matrix

[00018] $B = D^{- \frac{1}{2}} W D^{- \frac{1}{2}}$

is constructed, wherein,

[00019] $D^{- \frac{1}{2}} = diag (\frac{1}{\sqrt{d_{1}}}, \frac{1}{\sqrt{d_{2}}}, . . ., \frac{1}{\sqrt{d_{l + u}}});$

(5.5) an iterative calculation formula F(t+1)=αBF(t)+(1−α)Y is generated, wherein, α∈(0,1) is a parameter;
(5.6) the calculation formula is iterated to convergence to obtain a state

[00020] $F^{*} = \lim_{t .fwdarw. \infty} F (t) = (1 - α) {(M - α B)}^{- 1} Y;$

(5.7) a classification prediction of unmarked nodes y.sub.i=argmax custom-character F.sub.ij* is performed, wherein, l+1≤i≤l+u.

[0079] 5. Performing a fine classification for target patents, that is determining the leaf node.

[0080] (1) Setting objects to be classified: the patents corresponding to node of each class divided in the step 4 are taken as a group, that means patents corresponding to node marked as y.sub.i∈ custom-character are a group, there are || groups.

[0081] (2) Extracting text information of patents: abstract, claims and description (hereinafter referred to as “all-text”) of each patent in each group are extracted, word segmentation of text information of patent is performed by using an existing tool, and a text set G={g.sub.1, . . . , g.sub.n} is generated, wherein d.sub.i=(p.sub.i1,p.sub.i2,p.sub.i3), p.sub.i1, p.sub.i2, and p.sub.i3 are respectively word sequences obtained by word segmentation of the abstract, the claims, and the description of the i-th patent.

[0082] (3) Generating 4 text sets to be trained: G, G.sub.1={p.sub.11, . . . , p.sub.n1}, G.sub.2={p.sub.12, . . . , p.sub.n2} and G.sub.3={p.sub.13, . . . , p.sub.n3}, which are respectively composed of word segmentation results of the all-texts, the abstracts, the claims, and the descriptions of the patents in the group.

[0083] (4) Text vectorization is performed.

[0084] (4.1) the text in the four text sets to be trained is vectored. In each text set to be trained, an element P=(t.sub.1, . . . , t.sub.m) is a segmented word sequence with m elements, t.sub.i∈P is determined by w words t.sub.i, context={t.sub.i−w, . . . , t.sub.i−2, t.sub.i−1, t.sub.i+1, t.sub.i+2, . . . , t.sub.i+w} before and after it, and by maximizing

[00021] $\frac{1}{m} {.Math.}_{i = w}^{m - w} \log p (t_{i} .Math. t_{i, context}, pid)$

wherein, the pid is a paragraph number of t.sub.i in p,

[00022] $p (t_{i} | t_{i, context}, pid) = \frac{e^{y_{t_{i}}}}{{.Math.}_{j} e^{y_{j}}},$

y.sub.t.sub.i=b+∪Φ(t.sub.i, context,pid), U and b are parameters of softmax, and a vector corresponding to P is obtained by training data to be trained using a stochastic gradient descent method.

[0085] (4.2) A matrix of text is generated. The vectorization results of G={g.sub.1, . . . , g.sub.n}, G.sub.1={p.sub.11, . . . , p.sub.n1}, G.sub.2={p.sub.12, . . . , p.sub.n2} and G.sub.3={p.sub.13, . . . , p.sub.n3} are supposed to be respectively H.sub.1={h.sub.11, . . . , h.sub.n1}, H.sub.2={h.sub.12, . . . , h.sub.n2}, H.sub.3={h.sub.13, . . . , h.sub.n3}, and H.sub.4={h.sub.14, . . . , h.sub.n4}, then a generated set of matrix of text of target patents is H={h.sub.1, . . . , h.sub.n}, wherein h.sub.i=h.sub.i1, h.sub.i2, h.sub.i3, h.sub.i4).

[0086] (5) Patent classification is performed. Marked patents are set as S=∪.sub.j=1.sup.kS.sub.j⊂H, wherein, S.sub.j≠Ø is the marked patent of the j-th leaf node on the industry tree, j cluster centers of a k-means algorithm are initialized using the marked patents, and cluster membership of marked patents is not changed in a iterative updating process of clusters.

[0087] (6) The patent that does not belong to any industry of the leaf node industry on the tree is identified, in any leaf node classed in (5):

(6.1) a k distance of a patent p:
for a positive integer k, the k-th distance of the patent p is set as k−distance(o), and in patents divided into a leaf node on the industry tree, there is a patent o, and the distance between the patent o and the patent p is d(p, o). If the following two conditions are met, k−distance(o)=d(p, o):
{circle around (1)} in the leaf node, there are at least k patents q to make d(p, q)≤d(p, o);
{circle around (2)} in the leaf node, there are at most k−1 patents q to make d(p, q)<d(p, o);
(6.2) a k-th distance domain of the patent p:
a patent set whose distance from the patent p is ≤k−distance(o) is called the k-th distance domain of the patent p, and is record as N.sub.k(p);
(6.3) a reachable distance of the patent p relative to the patent o: reachdist(p, o)=max{k−distance(o), ∥p−o∥};
(6.4) a local reachable density:

[00023] $l r d_{k} (p) = \frac{| N_{k} (p) |}{{.Math.}_{o \in N_{n} (p)} {reach}_{d} {ist}_{k} (p, o)};$

(6.5) a local outlier factor:

[00024] ${LOF}_{k} (p) = \frac{{.Math.}_{o \in N_{n} (p)} \frac{{lrd}_{k} (o)}{{lrd}_{k} (p)}}{.Math. N_{k} (p) .Math.};$

(6.6) if LOF(p) is greater than a threshold, it is thought that p is an outlier, and does not belong to the leaf node.

[0088] In order to better understand the present invention, the detailed description is made above in conjunction with the specific embodiments of the present invention, but it is not a limitation of the present invention. Any simple modification to the above embodiments based on the technical essence of the present invention still belongs to the scope of the technical solution of the present invention. Each embodiment in this specification focuses on differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. As for the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and the relevant part can refer to the part of the description of the method embodiment.

AUTOMATIC INDUSTRY CLASSIFICATION METHOD AND SYSTEM

Assignee

Inventors

Cpc classification

Classification Explorer

G06F40/289

PHYSICS

Classification Explorer

G06F18/24323

PHYSICS

Classification Explorer

G06F18/23213

PHYSICS

Classification Explorer

G06F16/353

PHYSICS

Classification Explorer

G06V10/7625

PHYSICS

Classification Explorer

G06Q10/0637

PHYSICS

Classification Explorer

G06F2216/11

PHYSICS

Classification Explorer

G06Q50/184

PHYSICS

Classification Explorer

G06F40/30

PHYSICS

International classification

Classification Explorer

G06F16/35

PHYSICS

Classification Explorer

G06K9/62

PHYSICS

Abstract

Claims

Description