Analytical data analysis method and analytical data analyzer
11423331 · 2022-08-23
Assignee
Inventors
Cpc classification
H01J49/0036
ELECTRICITY
G06N7/01
PHYSICS
A61B5/0033
HUMAN NECESSITIES
G06V20/69
PHYSICS
G06F2218/10
PHYSICS
International classification
A61B5/00
HUMAN NECESSITIES
Abstract
This analytical data analysis method uses machine learning of analysis result data (31) measured by an analyzer (1), and includes generating simulated data (32) in which a data variation has been added to the analysis result data (31) within a range that does not affect identification, performing the machine learning using the generated simulated data (32), and performing discrimination using a discrimination criterion (23b) obtained through the machine learning.
Claims
1. An analytical data analysis method using machine learning of a plurality of analysis result data measured by an analyzer, the analytical data analysis method comprising: generating a plurality of simulated data by adding a data variation to the plurality of analysis result data within a range in which a result of a discrimination of the plurality of analysis result data is not reversed when the data variation is added; performing the machine learning using the plurality of analysis result data and the plurality of generated simulated data; and performing discrimination using a discrimination criterion, that is a parameter used for the discrimination obtained through the machine learning, to analyze the plurality of analysis result data.
2. The analytical data analysis method according to claim 1, wherein the range in which the result of the discrimination is not reversed when the data variation is added is a range corresponding to a specific variation factor associated with measurement by the analyzer.
3. The analytical data analysis method according to claim 2, wherein each of the plurality of analysis result data is a spectrum obtained by the analyzer; and the specific variation factor is a variation factor caused by the analyzer or a sample and generated when the spectrum is obtained by the analyzer.
4. The analytical data analysis method according to claim 3, wherein the generating of the plurality of simulated data includes generating the plurality of simulated data by varying a value of an intensity of the spectrum according to a ratio of change of the intensity of the spectrum caused by the sample.
5. The analytical data analysis method according to claim 3, wherein the ratio of change of the intensity of the spectrum caused by the sample increases or decreases at a substantially constant rate as a mass of the sample or a wavelength absorbed by the sample increases, and the plurality of simulated data are generated by multiplying the value of the intensity of the spectrum by the ratio of change of the intensity of the spectrum caused by the sample.
6. The analytical data analysis method according to claim 3, wherein the generating of the plurality of simulated data includes generating the plurality of simulated data by giving, to a baseline of the spectrum, a variation corresponding to a variation in the baseline generated at a time of measuring the plurality of analysis result data.
7. The analytical data analysis method according to claim 3, wherein the generating of the plurality of simulated data includes generating the plurality of simulated data by adding a difference in individual difference data of each of a plurality of analyzers.
8. The analytical data analysis method according to claim 3, wherein the generating of the plurality of simulated data includes generating the plurality of simulated data by adding a random number to the plurality of analysis result data within the range that does not affect identification.
9. The analytical data analysis method according to claim 3, wherein the generating of the plurality of simulated data includes generating the plurality of simulated data by adding a peak of an impurity to the spectrum according to the impurity detected at a time of the measurement by the analyzer.
10. The analytical data analysis method according to claim 3, wherein the machine learning is performed, using the plurality of simulated data, on the plurality of analysis result data measured by a mass spectrometer that generates a mass spectrum as the analyzer.
11. The analytical data analysis method according to claim 10, wherein the plurality of analysis result data include the mass spectrum of a biological sample collected from a subject, and the performing of the discrimination includes performing cancer discrimination on the plurality of analysis result data of the sample using the discrimination criterion.
12. The analytical data analysis method according to claim 2, wherein the plurality of simulated data are generated by adding the data variation within a range of variation in the plurality of analysis result data caused by the specific variation factor.
13. The analytical data analysis method according to claim 12, comprising: acquiring the variation in the plurality of analysis result data caused by the specific variation factor; and generating the plurality of simulated data by adding the acquired variation in the plurality of analysis result data caused by the specific variation factor.
14. An analytical data analyzer comprising: a data input that acquires analysis result data obtained by another analyzer; a storage that stores a discrimination criterion, that is a parameter used for the discrimination generated through machine learning using simulated data generated by adding a data variation to the analysis result data within a range in which a result of a discrimination of the analysis result data is not reversed when the data variation is added and the analysis result data, and a discrimination algorithm for the machine learning; and a processor that discriminates the analysis result data acquired by the data input according to the discrimination algorithm using the discrimination criterion, to analyze the analysis result data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
MODES FOR CARRYING OUT THE INVENTION
(12) Embodiments embodying the present invention are hereinafter described on the basis of the drawings.
(13) [First Embodiment]
(14) The structure of an analytical data analyzer 100 according to a first embodiment is now described with reference to
(15) As shown in
(16) The analytical data analyzer 100 performs machine learning using a generated mass spectrum 32 (see
(17) The analyzer 1 is a device that performs scientific analysis of the measurement sample 3. The analyzer 1 generates, for example, a spectrum as analysis result data. Although any analyzer may be used as long as the same generates a spectrum, the analyzer 1 is a mass spectrometer that generates a mass spectrum, for example. In the first embodiment, machine learning is performed on a plurality of analysis result data measured by the analyzer 1 that generates a mass spectrum as an analyzer, using a plurality of simulated data.
(18) The analyzer 1 may be of any type, but is a matrix-assisted laser desorption ionization-quadrupole ion-trap time-of-flight mass spectrometer (MALDI-QIT-TOFMS), for example.
(19) The analyzer 1 includes an ionizer 10, an ion trap 11, and a time-of-flight mass analyzer 12.
(20) The analyzer 1 ionizes the sample 3 in the ionizer 10 by a MALDI method, temporarily captures generated ions by the ion trap 11, and selects ions according to the mass-to-charge ratio (m/z). The ions emitted from the ion trap 11 are folded back by an electric field generated by reflectron electrodes 13 provided in the time-of-flight mass analyzer 12 and are detected by an ion detector 14.
(21) The data processor 2 includes an analysis controller 21, a spectrum generator 22, a storage 23, and an arithmetic unit 24. The storage 23 stores a discrimination algorithm 23a used for discrimination and the discrimination criterion 23b generated by machine learning. The discrimination criterion 23b is a parameter used for discrimination generated by machine learning. As an example of machine learning, an SVM (support vector machine) is used, for example. The discrimination is discrimination of cancer, for example.
(22) The analysis controller 21 controls the ionizer 10, the ion trap 11, and the time-of-flight mass analyzer 12. In addition, the spectrum generator 22 generates a mass spectrum based on a value detected by the ion detector 14 and transmits data of the generated mass spectrum to the arithmetic unit 24. The arithmetic unit 24 discriminates the input mass spectrum using the discrimination algorithm 23a and the discrimination criterion 23b stored in the storage 23.
(23) An input 5 is, for example, a keyboard, a mouse, a touch panel, etc. and is connected to the data processor 2, and an operation for starting spectrum analysis, for example, is performed via the input 5. A display 4 is, for example, a monitor such as a liquid crystal display connected to the data processor 2 and displays the discrimination results etc.
(24) In the first embodiment, the measurement sample 3 is a biological sample. For example, the measurement sample 3 is urine or blood collected from a subject. Furthermore, in the first embodiment, the analysis result data includes the mass spectrum of the biological sample 3 collected from the subject, and in a discrimination step, discrimination of cancer on the plurality of analysis result data of the sample 3 is performed using the discrimination criterion 23b.
(25) A flow at the time of learning and a flow at the time of discrimination according to the first embodiment are now described with reference to
(26) First, the flow of learning is described with reference to
(27) Next, the flow at the time of discrimination is described with reference to
(28) Steps of generating the simulated data according to the first embodiment of the present invention are now described with reference to
(29)
(30)
(31) In the first embodiment, a step of acquiring the variation in the plurality of analysis result data caused by the specific variation factor and a step of generating the plurality of simulated data by adding the variation in the plurality of analysis result data caused by the acquired specific variation factor are included. Specifically, the mass spectrum 32, which is the simulated data, is generated by acquiring the ratio of change of the intensity of the mass spectrum 31, which is the analysis result data of the sample 3, and multiplying the mass spectrum 31 by the acquired ratio of intensity change. As shown in the graph of
(32) In the first embodiment, the plurality of simulated data in which the variation has been added within the range of the ratio on the straight line 30a shown in the graph 30 of
(33)
(34) (Effects of First Embodiment)
(35) According to the first embodiment, the following effects are achieved.
(36) According to the first embodiment, as described above, in the analytical data analyzer 100, the spectrum generator 22 of the data processor 2 generates the mass spectrum 31 based on the ion intensity of the sample 3 detected by the ion detector 14 of the analyzer 1. The mass spectrum 31 generated by the spectrum generator 22 is transmitted to the arithmetic unit 24. The arithmetic unit 24 discriminates the input mass spectrum 31 using the discrimination algorithm 23a and the discrimination criterion 23b stored in the storage 23. Furthermore, according to the first embodiment, a step of generating the mass spectrum 32 by multiplying the mass spectrum 31 by the ratio of intensity change for each mass-to-charge ratio of the sample 3 is included. Accordingly, the simulated data (mass spectrum 32) in which the variation has been added within the range that does not affect identification can be generated. Consequently, the amount of data used for machine learning can be increased, and thus the accuracy of machine learning can be improved.
(37) According to the first embodiment, as described above, the range that does not affect identification is the range corresponding to the specific variation factor associated with the measurement by the analyzer 1. Accordingly, data corresponding to the variation factor associated with the measurement by the analyzer 1 can be learned, and thus a decrease in the accuracy of machine learning caused by the variation factor associated with the measurement by the analyzer 1 can be significantly reduced or prevented.
(38) According to the first embodiment, as described above, the analysis result data is the mass spectrum 31 obtained by the analyzer 1, and the specific variation factor is the variation factor caused by the sample 3 and generated when the mass spectrum 31 is obtained by the analyzer 1.
(39) Accordingly, the mass spectrum 32 corresponding to the variation factor caused by the sample 3 at the time of obtaining the mass spectrum 31 can be learned, and thus a decrease in the accuracy of machine learning caused by the variation factor caused by the sample 3 can be significantly reduced or prevented.
(40) According to the first embodiment, as described above, the mass spectrum 32 is generated by adding the data variation within the range of variation in the mass spectrum 31 caused by the ratio of intensity change of the sample 3. Accordingly, learning can be performed using the mass spectrum 32 generated by adding the variation associated with the measurement by the analyzer 1. Consequently, a decrease in the accuracy of machine learning caused by a plurality of variation factors associated with the measurement by the analyzer 1 can be significantly reduced or prevented.
(41) According to the first embodiment, as described above, the step of acquiring the variation in the mass spectrum 31 caused by the ratio of intensity change of the sample 3 and the step of generating the mass spectrum 32 by adding the acquired variation in the mass spectrum 31 caused by the ratio of intensity change of the sample 3 are included. Accordingly, learning can be performed using the mass spectrum 32 corresponding to the ratio of intensity change of the sample 3 associated with the measurement, and learning using a data variation not associated with the measurement can be significantly reduced or prevented. Consequently, over-fitting can be significantly reduced or prevented, and thus a decrease in the accuracy of machine learning can be significantly reduced or prevented.
(42) According to the first embodiment, as described above, the mass spectrum 32 is generated by varying the value of the intensity of the mass spectrum 31 according to the ratio of change of the intensity of the mass spectrum 31 caused by the sample 3 in the step of generating the simulated data. Accordingly, learning can be performed using the mass spectrum 32 corresponding to the ratio of change in the intensity of the mass spectrum 31 that differs for each sample 3. Consequently, a decrease in the accuracy of machine learning caused by the ratio of change in the intensity of the mass spectrum 31 that differs for each sample 3 can be significantly reduced or prevented.
(43) According to the first embodiment, as described above, the ratio of change of the intensity of the mass spectrum 31 caused by the sample 3 increases at the substantially constant rate as the mass of the sample 3 increases, and the mass spectrum 32 is generated by multiplying the value of the intensity of the mass spectrum 31 by the ratio of intensity change. Accordingly, learning can be performed using the mass spectrum 32 in which the ratio of change of the intensity of the mass spectrum 31 according to the mass of the sample 3 is reflected. Consequently, a decrease in the accuracy of machine learning caused by the ratio of change of the intensity of the mass spectrum 31 according to the value of the mass of the sample 3 can be significantly reduced or prevented.
(44) According to the first embodiment, as described above, machine learning is performed, using the mass spectrum 32, on the mass spectrum 31 measured by the analyzer 1 that generates the mass spectrum as an analyzer. Accordingly, the mass spectrum 32 in which the variation associated with the measurement by the analyzer 1 has been added to the obtained mass spectrum 31 can be generated and used for machine learning. Consequently, a decrease in the accuracy of machine learning due to the specific factor associated with the measurement by the analyzer 1 can be significantly reduced or prevented.
(45) According to the first embodiment, as described above, the analysis result data includes the mass spectrum 31 of the biological sample 3 collected from the subject, and in the discrimination step, cancer discrimination is performed on the mass spectrum 31 of the sample 3 using the discrimination criterion 23b. Accordingly, cancer discrimination can be performed by discriminating the data of the mass spectrum 31 through machine learning. The biological sample 3 is, for example, blood or urine collected from the subject.
(46) [Second Embodiment]
(47) The structure of an analytical data analyzer 200 according to a second embodiment is now described with reference to
(48)
(49) In the analytical data analyzer 200 according to the second embodiment, in a step of generating the simulated data, the simulated data corresponding to the variation in the baseline of the mass spectrum 40 of the sample 3 is generated unlike the first embodiment in which the mass spectrum 32 corresponding to the ratio of change of the intensity of the mass spectrum 31 of the sample 3 is generated. In the second embodiment, the same structures as those of the aforementioned first embodiment are denoted by the same reference numerals, and description thereof is omitted.
(50) As shown in
(51) The remaining structures of the analytical data analyzer 200 according to the second embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(52) (Effects of Second Embodiment)
(53) According to the second embodiment, the following effects are achieved.
(54) According to the second embodiment, as described above, the mass spectrum 41 is generated by giving, to the baseline of the mass spectrum 40 of the sample 3, the variation corresponding to the variation in the baseline generated at the time of measuring the mass spectrum 40. Accordingly, learning can be performed using the mass spectrum 41 corresponding to a difference in measurement environment. Consequently, a decrease in the accuracy of machine learning due to the difference in measurement environment can be significantly reduced or prevented.
(55) The remaining effects of the analytical data analyzer 200 according to the second embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(56) [Third Embodiment]
(57) The structure of an analytical data analyzer 300 according to a third embodiment is now described with reference to
(58)
(59) As shown in the graph of
(60) In the analytical data analyzer 300 according to the third embodiment, in a step of generating the simulated data, the mass spectrum 52 is generated by adding the difference in the individual difference data of the analyzer 1 unlike the first embodiment in which the mass spectrum 32 corresponding to the ratio of change of the intensity of the mass spectrum 31 of the sample 3 is generated. In the third embodiment, a plurality of simulated data are generated by adding a variation to the analysis result data (mass spectrum 31) within a range equal to or less than the detected intensity ratio in
(61) In the analytical data analyzer 300 according to the third embodiment, in a step of generating the simulated data, the mass spectrum 52 is generated by adding the difference (graph 50) between the individual difference data of the analyzer 1 to the mass spectrum 51 of the sample 3. Then, learning is performed using the generated mass spectrum 52, a discrimination criterion 23b is generated, and discrimination is performed using the generated discrimination criterion 23b.
(62) The remaining structures of the analytical data analyzer 300 according to the third embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(63) (Effects of Third Embodiment)
(64) According to the third embodiment, the following effects are achieved.
(65) According to the third embodiment, as described above, the mass spectrum 52 is generated by adding the difference in the individual difference data of the analyzer 1 to the mass spectrum 51 of the sample 3, learning is performed using the generated mass spectrum 52, and discrimination is performed using the obtained discrimination criterion 23b. Accordingly, learning can be performed using the mass spectrum 52 corresponding to an error of the detection sensitivity of the spectrum between the analyzers 1. Consequently, a decrease in the accuracy of machine learning due to the error of the detection sensitivity between the analyzers 1 can be significantly reduced or prevented.
(66) The remaining effects of the analytical data analyzer 300 according to the third embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(67) [Fourth Embodiment]
(68) The structure of an analytical data analyzer 400 according to a fourth embodiment is now described with reference to
(69)
(70) In the analytical data analyzer 400 according to the fourth embodiment, in a step of generating the simulated data, the mass spectrum 61 is generated by adding the random number to the mass spectrum 60 of the sample 3 within the range that does not affect identification unlike the first embodiment in which the mass spectrum 32 corresponding to the ratio of change of the intensity of the mass spectrum 31 of the sample 3 is generated. In the fourth embodiment, the same structures as those of the aforementioned first embodiment are denoted by the same reference numerals, and description thereof is omitted.
(71) The analytical data analyzer 400 according to the fourth embodiment generates the mass spectrum 61 by adding the random number to the mass spectrum 60 within the range that does not affect identification in the step of generating the simulated data. Then, discrimination is performed using a discrimination criterion 23b generated as a result of using the generated mass spectrum 61 for learning.
(72) The remaining structures of the analytical data analyzer 400 according to the fourth embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(73) (Effects of Fourth Embodiment)
(74) According to the fourth embodiment, the following effects are achieved.
(75) According to the fourth embodiment, as described above, in the step of generating the simulated data, the mass spectrum 61 is generated by adding the random number to the mass spectrum 60 within the range that does not affect identification. Accordingly, learning can be performed using the mass spectrum 61 corresponding to the random noise. Consequently, when noise is mixed at the time of measurement, a decrease in the accuracy of machine learning can be significantly reduced or prevented.
(76) The remaining effects of the analytical data analyzer 400 according to the fourth embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(77) [Fifth Embodiment]
(78) The structure of an analytical data analyzer 500 according to a fifth embodiment is now described with reference to
(79)
(80) In the analytical data analyzer 500 according to the fifth embodiment, in a step of generating the simulated data, the simulated data is generated by adding the peak of the impurity to the mass spectrum 70 according to the impurity detected at the time of measurement by the analyzer 1 unlike the first embodiment in which the mass spectrum 32 corresponding to the ratio of change of the intensity of the mass spectrum 31 of the sample 3 is generated. The peak of the impurity not found in a region 70a of the mass spectrum 70 can be confirmed in a region 71a of the mass spectrum 71. In addition, as the impurity, keratin that has adhered to the finger of an operator, for example, is considered. The impurity that may be mixed is different depending on the sample 3, and thus it is only required to acquire data of the impurity that may be mixed. In the fifth embodiment, the same structures as those of the aforementioned first embodiment are denoted by the same reference numerals, and description thereof is omitted.
(81) In the analytical data analyzer 500 according to the fifth embodiment, in the step of generating the simulated data, the mass spectrum 71 is generated by adding the peak of the impurity to the mass spectrum 70 according to the impurity detected at the time of measurement by the analyzer 1. Discrimination is performed using a discrimination criterion 23b generated as a result of using the generated mass spectrum 71 for learning. In the fifth embodiment, a plurality of simulated data are generated by changing the height of the peak of the impurity, and are used for machine learning.
(82) The remaining structures of the analytical data analyzer 500 according to the fifth embodiment are similar to those of the analytical data analyzer 100 according to the first embodiment.
(83) (Effects of Fifth Embodiment)
(84) According to the fifth embodiment, the following effects are achieved.
(85) According to the fifth embodiment, as described above, the mass spectrum 71 is generated by adding the peak of the impurity to the mass spectrum 70 according to the impurity detected at the time of measurement by the analyzer 1. Accordingly, learning can be performed using the mass spectrum 71 corresponding to the mixing of impurity. Consequently, a decrease in the accuracy of machine learning can be significantly reduced or prevented when the impurity is mixed.
(86) [Sixth Embodiment]
(87) The structure of an analytical data analyzer 600 according to a sixth embodiment is now described with reference to
(88) As shown in
(89) The analytical data analyzer 600 according to the sixth embodiment analyzes the analysis result data obtained via an external storage medium such as a hard disk or a USB memory, or the Internet.
(90) (Effects of Sixth Embodiment)
(91) According to the sixth embodiment, the following effects are achieved.
(92) According to the sixth embodiment, as described above, the analytical data analyzer 600 includes the data input 7 that acquires the analysis result data 6, the storage 23 that stores the discrimination criterion 23b generated through machine learning using the simulated data generated by adding the data variation to the analysis result data 6 within the range that does not affect identification and the discrimination algorithm 23a for machine learning, and the arithmetic unit 24 that discriminates the analysis result data 6 acquired using the discrimination criterion 23b. Accordingly, a plurality of simulated data in which the variation has been added within the range that does not affect identification can be generated. Accordingly, the amount of data used for machine learning can be increased, and thus the accuracy of machine learning can be improved.
(93) [Modified Examples]
(94) The embodiments disclosed this time must be considered as illustrative in all points and not restrictive. The scope of the present invention is not shown by the above description of the embodiment but by the scope of claims for patent, and all modifications (modified examples) within the meaning and scope equivalent to the scope of claims for patent are further included.
(95) For example, while the example in which the mass spectrum is obtained as the analysis result data has been shown in each of the aforementioned first to fifth embodiments, the present invention is not restricted to this. Non-spectral data may be used as the analysis result data.
(96) While the example in which the MALDI method is used as the ionization method of the ionizer 10 has been shown in each of the aforementioned first to fifth embodiments, the present invention is not restricted to this. For example, ESI (electrospray method) may be used as the ionization method.
(97) While the example in which the mass spectrometer is provided as the analyzer has been shown in each of the aforementioned first to fifth embodiments, the present invention is not restricted to this. According to the present invention, the spectrum can be obtained as the analysis result data, and any analyzer may be used as long as the same adds a variation associated with the detection to the obtained spectrum. For example, an FT-IR (Fourier Transform Infrared Spectrophotometer) may be used, or a chromatograph may be used.
(98) While the example in which the simulated data in which the variation corresponding to the variation factor generated associated with the measurement has been added is generated, and learning is performed has been shown in each of the aforementioned first to fifth embodiments, the present invention is not restricted to this. According to the present invention, machine learning may be performed by combining the simulated data generated in the first to fifth embodiments, or machine learning may be performed using all the simulated data. According to this structure, the amount of data (the number of data patterns) used for machine learning can be increased, and thus the accuracy of machine learning can be further improved.
(99) While the example in which as a machine learning method, an SVM (support vector machine) is used to generate the discrimination criterion 23b has been shown in each of the aforementioned first to sixth embodiments, the present invention is not restricted to this. For example, a neural network may be used, or AdaBoost may be used. Machine learning using other than these may be performed.
(100) While the example in which the analytical data analyzer 100 is used to discriminate cancer has been shown in the aforementioned first embodiment, the present invention is not restricted to this. For example, the analytical data analyzer may be used to discriminate a disease other than cancer.
DESCRIPTION OF REFERENCE NUMERALS
(101) 1: analyzer
(102) 6, 31, 40, 51, 60, 70: analysis result data
(103) 7: data input
(104) 23: storage
(105) 23a: discrimination algorithm
(106) 23b: discrimination criterion
(107) 24: arithmetic unit
(108) 30: intensity ratio by mass of sample (specific variation factor associated with measurement by analyzer)
(109) 32, 41, 52, 61, 71: simulated data
(110) 100, 200, 300, 400, 500, 600: analytical data analyzer