METHOD OF CREATING CHARACTERISTIC PROFILES OF MASS SPECTRA AND IDENTIFICATION MODEL FOR ANALYZING AND IDENTIFYING FEATURES OF MICROORGANIZMS

20210080384 ยท 2021-03-18

Assignee

Inventors

Cpc classification

International classification

Abstract

A method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying microorganisms includes obtaining data of MALDI-TOF MS of microorganisms having same features; using a kernel density estimation to generate characteristic profiles of an m/z of the data; creating a characteristic MS profile based on the m/z; repeating above three step until characteristic MS profiles of features of the microorganisms is obtained; comparing m/z of MALDITOF MS spectrum of known microorganisms with the characteristic profiles to obtain first matched vectors; using a machine learning method to establish a feature classification model; using MALDI-TOF MS to analyze microorganisms having unknown features; comparing the m/z of MALDI-TOF MS spectrum of the microorganisms having unknown features with the characteristic MS profiles to obtain second matched vectors; using the feature classification model to analyze the second matched vectors; and identifying the microorganisms having the unknown features.

Claims

1. A method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying microorganisms, comprising the steps of: (1) obtaining data of MALDI-TOF MS of microorganisms having same features; (2) using a kernel density estimation to generate characteristic profiles of an m/z of the data; (3) creating a characteristic MS profile based on the m/z; (4) repeating steps (1) to (3) until characteristic MS profiles of a plurality of features of the microorganisms is obtained; (5) comparing m/z of MALDI-TOF MS spectrum of microorganisms having known features with the characteristic MS profiles to obtain a plurality of first matched vectors; (6) using a machine learning method to establish a feature classification model; (7) using MALDI-TOF MS to analyze microorganisms having unknown features; (8) comparing the m/z of MALDI-TOF MS spectrum of the microorganisms having unknown features with the characteristic MS profiles to obtain a plurality of second matched vectors; (9) using the feature classification model to analyze the second matched vectors; and (10) identifying the microorganisms having the unknown features.

2. The method of claim 1, wherein the machine learning method is Support Vector Machine (SVM), Artificial Neuron Network (ANN), k Nearest Neighbor (kNN), Logistic Regression (LR), Fuzzy Logic, Bayesian Algorithms, Decision Tree Induction Algorithm (DT), Random Forest (RF), Deep Learning, or any combination thereof.

3. The method of claim 1, wherein the microorganisms are bacteria, molds, or viruses.

4. The method of claim 1, wherein the features of the microorganisms are species, sub-species, resistance to antibiotics, or toxicity.

5. The method of claim 1, wherein the kernel density estimation are uniform kernel, triangular kernel, biweight kernel, triweight kernel, Epanechnikov kernel, or Gaussian kernel.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 is a flow chart of a method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying features of microorganisms by analyzing the MS of their biomolecules according to the invention; FIG. 2 includes a first plot of a density versus m/z in the range of 4000 to 7000 for ST3 in which black blocks represent original m/z distributions and dashed lines represent Gaussian function estimations, a second plot of a density versus m/z in the range of 4000 to 7000 for ST42 in which black blocks represent original m/z distributions and dashed lines represent Gaussian function estimations, and a third plot of a density versus m/z in the range of 4000 to 7000 for other ST types in which black blocks represent original m/z distributions and dashed lines represent Gaussian function estimations according to the invention;

[0016] FIG. 3 is a table showing peak values and ranges of ST3, ST42 and other ST types;

[0017] FIG. 4 is a table of matched vectors versus ST3, ST42 and other ST types; FIG. 5A includes a first plot of sensitivity versus 1-specificity for Random Forest (RF) in which a solid line represents a Gaussian function and a dashed line represents density-based clustering according to the invention, a second plot of sensitivity versus 1-specificity for Support Vector Machine (SVM) in which a solid line represents a Gaussian function and a dashed line represents density-based clustering according to the invention, and a third plot of sensitivity versus 1-specificity for Logistic Regression (LR) in which a solid line represents a Gaussian function and a dashed line represents density-based clustering all in terms of ST3 according to the invention;

[0018] FIG. 5B includes a first plot of sensitivity versus 1-specificity for RF in which a solid line represents a Gaussian function and a dashed line represents density-based clustering according to the invention, a second plot of sensitivity versus 1-specificity for SVM in which a solid line represents a Gaussian function and a dashed line represents density-based clustering according to the invention, and a third plot of sensitivity versus 1-specificity for LR in which a solid line represents a Gaussian function and a dashed line represents density-based clustering all in terms of ST42 according to the invention;

[0019] FIG. 5C includes a first plot of sensitivity versus 1-specificity for RF in which a solid line represents a Gaussian function and a dashed line represents density-based clustering according to the invention, a second plot of sensitivity versus 1-specificity for SVM in which a solid line represents a Gaussian function and a dashed line represents density-based clustering according to the invention, and a third plot of sensitivity versus 1-specificity for LR in which a solid line represents a Gaussian function and a dashed line represents density-based clustering all in terms of other ST types according to the invention;

[0020] FIG. 6 is a table showing sensitivity of each of ST3, ST 42 and other ST types in terms of LR, RF and SVM versus Gaussian function, density-based clustering, Gaussian function, density-based clustering, Gaussian function and density-based clustering; specificity of each of ST3, ST 42 and other ST types in terms of LR, RF and SVM versus Gaussian function, density-based clustering, Gaussian function, density-based clustering, Gaussian function and density-based clustering; accuracy of each of ST3, ST 42 and other ST types in terms of LR, RF and SVM versus Gaussian function, density-based clustering, Gaussian function, density-based clustering, Gaussian function and density-based clustering; and area under curve (AUC) of each of ST3, ST 42 and other ST types in terms of LR, RF and SVM versus Gaussian function, density-based clustering, Gaussian function, density-based clustering, Gaussian function and density-based clustering according to the invention; and

[0021] FIG. 7 is a table showing accuracy in terms of machine learning method, LR, RF and SVM versus Gaussian function and density-based clustering, and AUC in terms of machine learning method, LR, RF and SVM versus Gaussian function and density-based clustering according to the invention.

DETAILED DESCRIPTION OF THE INVENTION

[0022] Referring to FIG. 1, a flow chart of a method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying features of microorganisms by analyzing the MS of their biomolecules according to the invention is illustrated and comprises the steps of:

[0023] T1: obtaining data of MALDI-TOF MS of microorganisms having same features;

[0024] T2: using a kernel density estimation to generate characteristic profiles of an m/z of the data, wherein the kernel density estimation are uniform kernel, triangular kernel, biweight kernel, triweight kernel, Epanechnikov kernel, or Gaussian kernel;

[0025] T3: creating a characteristic MS profile based on the m/z;

[0026] T4: repeating the steps T1 to T3 until characteristic MS profiles of a plurality of features of the microorganisms is obtained;

[0027] T5: comparing m/z of MALDI-TOF MS spectrum of microorganisms having known features with the characteristic MS profiles generated by Gaussian function to obtain a plurality of first matched vectors;

[0028] T6: using a machine learning method to establish a feature classification model;

[0029] T7: using MALDI-TOF MS to analyze microorganisms having unknown features;

[0030] T8: comparing the m/z of MALDI-TOF MS spectrum of the microorganisms having unknown features with the characteristic MS profiles to obtain a plurality of second matched vectors;

[0031] T9: using the feature classification model to analyze the second matched vectors; and

[0032] T10: identifying the microorganisms having the unknown features. Sub-species of Staphylococcus haemolyticus is taken as an exemplary example in conjunction with FIG. 1 according to the invention in which MALDI-TOF MS collects data of 254 Staphylococcus haemolyticus. Next, Multi-Locus Sequence Typing (MLST) is used to identify sub-species of the Staphylococcus haemolyticus. The data include 15 sub-species in which ST3 and ST42 are of interest and data of other sub-species are few. Therefore, the data is classified as ST3, ST42 and other ST types.

[0033] Referring to FIG. 2, it includes a first plot of a density versus m/z in the range of 4000 to 7000 for ST3 in which black blocks represent original m/z distributions and dashed lines represent Gaussian function estimations, a second plot of a density versus m/z in the range of 4000 to 7000 for ST42 in which black blocks represent original m/z distributions and dashed lines represent Gaussian function estimations, and a third plot of a density versus m/z in the range of 4000 to 7000 for other ST types in which black blocks represent original m/z distributions and dashed lines represent Gaussian function estimations according to the invention.

[0034] Referring to FIG. 3, it is a table showing characteristic peaks and ranges of ST3, ST42 and other ST types.

[0035] In FIGS. 2 and 3, signals distributions of different sub-species of microorganism are shown. A Gaussian function is used to estimate m/z data of ST3, ST42 and other

[0036] ST types respectively. Further, maximum and minimum area values are calculated and taken as aligned central points and drifting ranges. Finally, all peak values and its ranges are combined to obtain a model having aligned m/z.

[0037] As shown in FIG. 2, an MS signals distribution of each species of microorganism may be drifted. For example, molecules having an m/z of 4500 may generate a signal around 4500. However, a Gaussian function may be used to process data not subjected to discretization to obtain a correct position of a characteristic peak.

[0038] As shown in FIG. 3, it shows portions of characteristic MS profiles of ST3, ST42 and other ST types and ranges covered by the characteristic peaks. In the first row, it shows m/zs of the characteristic peaks being calculated and the corresponding ranges are shown below. For example, ST3 has a characteristic peak of 2036.38 and a covered range of 2025.34 to 2050.42. The m/zs represent the characteristic peaks of ST3 characteristic MS profile. Location and possible drifting range of the m/z of each characteristic peak can be correctly defined based on the above information. A characteristic MS profile of a specific sub-species can be formed by summarizing the m/zs of the characteristic peaks.

[0039] Repeating the steps T1 to T3 until characteristic MS profiles of a plurality of specific sub-species is obtained. After the characteristic MS profiles of the specific sub-species has been obtained, it is possible of comparing MS data of a plurality of known microorganisms sub-species with a characteristic MS profile of each sub-species in terms of signals to create a plurality of matched vectors as a training dataset. A plurality of different conventional machine learning methods are trained to establish a sub-species classification identification model.

[0040] Referring to FIGS. 4 and 5, in an operation of unknown specimen, MALDI-TOF MS is used to obtain MS data of unknown microorganisms, and m/z data of each species is compared with the characteristic MS profiles in terms of signals to create a plurality of matched vectors which determine whether the MS signals of the unknown species are similar to that of each sub-species. As shown in FIG. 4, unknown microorganisms are compared with the characteristic MS profile of each of ST3, ST42 and other ST types to obtain three different vectors which are labeled first, second and third vectors respectively based on the order of creating the matched vectors. Taking a comparison with the ST3 MS as an example, the first vector is 1, the second vector is 0, and the third vector is 1 in which 1 represents the existence of a signal peak in a specific m/z center and its covered range after the MS signals of the unknown microorganisms have compared with the ST3 MS; and to the contrary, 0 represents there is no signal peak of the m/z. After the three sub-species have been compared with the MS signals of the unknown microorganisms, the first, second and third vectors are concatenated to create a plurality of matched vectors of the unknown microorganisms. In fact, the matched vectors represent a characteristic of the unknown microorganisms and contain information of each species. The dimension of the vector is a fixed value in consideration of classification and identification so that a machine learning method can be used for analysis and determination.

[0041] Referring to FIGS. 5, 6 and 7 in which as shown in FIG. 5, three different machine learning methods are used in the embodiment including Logistic Regression (LR), Random Forest (RF) and Support Vector Machine (SVM); and Gaussian function and density-based clustering are used respectively to create a dichotomy model of sub-species of each species. Its performance is excellent.

[0042] As shown in the dichotomy model of each of ST3, ST42 and other ST types of FIGS. 5A, 5B and 5C, a Gaussian function is used to generate an MS of characteristic profiles. Irrespective of the machine learning method being used, an area under curve (AUC) of a receiver operating characteristic (ROC) curve is greater than 0.85 and density-based clustering is found. Further, the AUC of ROC curve is greater than 0.90 for an RF model in cooperation with Gaussian function.

[0043] As shown in FIG. 6, it is found that there are many advantages of using Gaussian function in each model. As shown in FIG. 7, a plurality of comprehensive classification identification models of sub-species are established in the embodiment. But being different from the dichotomy model, the comprehensive classification identification models of sub-species can do a plurality of times of classification and identification of sub-species in one time. In the embodiment, ST3, ST42 and other ST types can be identified in one time.

[0044] In conclusion, Gaussian function in cooperation with different machine learning methods can carry out an excellent identifying effect, e.g., having an accuracy of about 0.90 and being better than density-based clustering. Further, a standard deviation of the accuracy is very small and it means that the machine learning method has a very high accuracy.

[0045] It is clear from the above embodiment, the novel and nonobvious method of the invention can obtain more accurate characteristic MS profiles of species. Further, the machine learning methods being used can more precisely identify microorganism sub-species. It is understood that sub-species is a feature of microorganisms. In other words, the method of the invention can be easily extended to the identification of species, sub-species, resistance to antibiotics, or toxicity.

[0046] While the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modifications within the spirit and scope of the appended claims.