METHOD AND SYSTEM FOR THE IDENTIFICATION OF COMPOUNDS IN COMPLEX BIOLOGICAL OR ENVIRONMENTAL SAMPLES
20230047202 · 2023-02-16
Inventors
Cpc classification
H01J49/0036
ELECTRICITY
G01N30/8682
PHYSICS
International classification
G16C20/20
PHYSICS
Abstract
Method and system for the identification of compounds in complex biological or environmental samples by receiving (102) a mass spectrum (1) from a mass spectrometry coupled with a separation technique; for each data point (2) of the mass spectrum (1), annotating (106) in an annotation database (12) combinations of formulas and adducts the theoretical mass-to-charge ratio of which (m/z).sup.T corresponds to the mass-to-charge ratio (m/z) measured of the data point (2); for each formula and adduct annotated, detecting (108) regions of interest in a retention time range (RT.sub.0-RT.sub.1) according to characterisation criteria; generating (110) an inclusion list (14) with the retention time ranges (RT.sub.0-RT.sub.1) and the theoretical mass-to-charge ratios (m/z).sup.T of the formulas and adducts associated with the regions of interest; and sending (112) the inclusion list to a mass spectrometer for the identification of compounds in the sample by tandem mass spectrometry.
Claims
1. A method for the identification of compounds in complex biological or environmental samples, comprising: receiving (102) a mass spectrum (1) from a mass spectrometry analysis coupled with a separation technique applied to a sample, wherein the mass spectrum (1) comprises a plurality of data points (2) with information on retention time (RT), mass-to-charge ratio measured (m/z) and intensity of the signal measured; consulting (104) a molecular formula database (10) which includes the theoretical mass-to-charge ratio (m/z).sup.T of the molecular ion of a plurality of molecular formulas and ionisation adducts; for each data point (2) of the mass spectrum (1), annotating (106) in an annotation database (12) the combinations of molecular formulas and ionisation adducts the theoretical mass-to-charge ratio (m/z).sup.T of which corresponds to the mass-to-charge ratio (m/z) measured of said data point (2) considering a given mass error, wherein each annotation includes the retention time (RT) and the intensity of the measured signal of the data point (2); for each molecular formula and ionisation adduct annotated in the annotation database (12), detecting (108) regions of interest defined in a retention time range (RT.sub.0-RT.sub.1) wherein the annotated data points meet characterisation criteria; generating (110) an inclusion list (14) which includes the retention time ranges (RT.sub.0-RT.sub.1) of the detected regions of interest and the theoretical mass-to-charge ratios (m/z).sup.T of the molecular formulas and ionisation adducts associated with each of the regions of interest; and sending (112) the inclusion list to a mass spectrometer for the identification of compounds in the sample by means of tandem mass spectrometry.
2. The method of claim 1, which comprises detecting in the mass spectrum (1) isotopologues associated with the molecular formulas and/or ionisation adducts annotated, wherein the detection of isotopologues comprises: searching (162), in the retention time range (RT.sub.0-RT.sub.1) of each region of interest (28), for data points (2) of the mass spectrum (1) the mass-to-charge ratio measured (m/z) of which corresponds, considering a mass error, to a theoretical mass-to-charge ratio (m/z).sup.T of an isotopologue of the molecular formula and/or ionisation adduct associated with the region of interest (28); obtaining (164) the intensity of the measured signal of the data points found; calculating (166) a theoretical intensity of the data points found starting from the intensity of the measured signal of the data points of the region of interest (28) corresponding to the molecular formula and/or ionisation adduct; comparing (168) the measured intensities with the calculated theoretical intensities; determining (170) the detection of the isotopologue based on said comparison.
3. The method of claim 1, wherein the detection (108) of the regions of interest comprises: determining (122) candidate regions (20) defined in a retention time range (RT.sub.C0-RT.sub.C1) with a minimum number of data points and/or a minimum density of data points annotated; characterising (124) the candidate regions (20), obtaining characterisation parameters (22); and selecting (128) those candidate regions (20) the characterisation parameters (22) of which meet certain characterisation criteria as regions of interest.
4. The method of claim 3, wherein the characterisation (124) of the candidate regions (20) comprises calculating (132) a slope (m) of a linear regression (24) from the data points (2) annotated in the candidate regions (20); and wherein the characterisation criteria comprise verifying (142) that the absolute value of the slope (m) calculated is greater than a threshold slope (m.sub.min).
5. The method of claim 3, wherein the characterisation (124) of the candidate regions (20) comprises calculating (134, 136) an average intensity (I.sub.avg) and/or a maximum intensity (I.sub.max) of the measured signal from the data points (2) annotated in the candidate regions (20); and wherein the characterisation criteria comprise verifying (144, 146) that the average intensity (I.sub.avg) and/or the maximum intensity (I.sub.max) calculated is greater than an average intensity (I.sub.avg.sup.TH) and/or threshold maximum intensity (I.sub.max.sup.TH).
6. The method of claim 3, wherein the characterisation (124) of the candidate regions (20) comprises calculating (138) an intensity range of the signal measured from the data points (2) annotated in the candidate regions (20), the intensity range being defined by a ratio between the maximum intensity and the minimum intensity in the candidate region (20); and wherein the characterisation criteria comprise verifying (148) that the calculated intensity range is greater than a threshold intensity range.
7. The method of claim 3, wherein the characterisation (124) of the candidate regions (20) comprises calculating (140) a signal-to-noise ratio (SNR) between an intensity level associated with the data points (2) annotated in the candidate region (20) and an intensity level associated with the data points (2) of the mass spectrum (1) located in an area surrounding (26) the candidate region (20); and wherein the characterisation criteria comprise verifying (150) that the signal-to-noise ratio (SNR) calculated is greater than a threshold signal-to-noise ratio (SNR.sup.TH).
8. The method of claim 7, wherein the area surrounding (26) the candidate region (20) is defined by a space delimited by a mass-to-charge ratio range (m/z.sub.P0-m/z.sub.P1) which includes a mass-to-charge ratio range (m/z.sub.C0-m/z.sub.C1) corresponding to the candidate region (20), and for a retention time range (RT.sub.P0-RT.sub.P1) which includes the retention time range (RT.sub.C0-RT.sub.C1) corresponding to the candidate region 20.
9. The method of claim 1, further comprising: defining a set of molecular formulas depending on the sample to be analysed; defining ionisation adducts associated with the molecular formulas; and generating the molecular formula database (10) including, for each molecular formula and associated ionisation adduct, the theoretical mass-to-charge ratio (m/z).sup.T.
10. The method of claim 1, further comprising performing a mass spectrometry analysis coupled with a separation technique applied to the sample to obtain the mass spectrum (1).
11. The method of claim 1, further comprising performing a tandem mass spectrometry analysis using the information comprised in the inclusion list in order to identify compounds in the sample.
12. A system for the identification of compounds in complex biological or environmental samples comprising a control unit with data processing means configured to execute the steps of the method according to claim 1.
13. The system of claim 12, comprising a mass spectrometer responsible for performing a mass spectrometry analysis coupled with a separation technique on the sample in order to obtain the mass spectrum (1).
14. The system of any claim 12, comprising a mass spectrometer responsible for performing a tandem mass spectrometry analysis using the information included in the inclusion list in order to identify compounds in the sample.
15. A programme product for the identification of compounds in complex biological or environmental samples, comprising programme instructions for carrying out the method defined in claim 1 when the programme is executed in a processor.
16. The programme product according to claim 15, comprising at least one non-transitory computer-readable storage medium which stores the programme instructions.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0034] What follows is a very brief description of a series of drawings that aid in better understanding the invention, and which are expressly related to an embodiment of said invention that are presented by way of a non-limiting example of the same.
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
DETAILED DESCRIPTION OF THE INVENTION
[0048] The data points 2 of a mass spectrum 1 acquired in MS1 mode in a mass spectrometer coupled with a separation technique (i.e., liquid chromatography coupled with mass spectrometry, LC-MS, or capillary electrophoresis coupled with mass spectrometry, CE-MS) contains, as represented in the graph of
[0049] Today, the annotation of the mass spectrum 1 in MS1 mode (MS1 annotation) follows the following scheme: [0050] 1) An algorithm is used (i.e., CentWave) for the detection of regions of interest 3 (ROI) in the raw data which applies a continuous wavelet transformation and the Gaussian adjustment in the chromatographic separation domain, or any other separation technique coupled with HRMS (on the horizontal axis the retention time RT and on the vertical axis the intensity of the signal measured, as represented in
[0052] After completing the MS1 annotation, for the characterisation or identification of metabolites an MS2 annotation is performed by using tandem mass spectrometry or MS.sup.n (n≥2). There are currently three methods for the identification of metabolites by means of LC-MS/CE-MS (or LC-HRMS/CE-HRMS) and MS.sup.n in untargeted metabolomics: [0053] Inclusion list (targeted MS/MS): Samples are analysed in MS.sup.1 mode and the data is processed by means of one or more software programmes which detect and align peaks (as explained in
[0056] The present invention consists of a new method for processing raw data from an LC-HRMS or CE-HRMS analysis in MS1 mode and selecting mass-to-charge ratios (m/z) and retention time (RT) ranges for the identification of metabolites in a subsequent analysis performed by means of tandem mass spectrometry or MS.sup.n (n≥2).
[0057] The method 100 of the present invention comprises the steps shown in the flow chart of
[0058] First, the method 100 comprises receiving 102 a mass spectrum 1 from an LC-MS or CE-MS analysis (acquired in MS1 mode) applied on a biological or environmental sample. The mass spectrum 1 comprises a plurality of data points 2 with information including the retention time (RT), the mass-to-charge ratio (m/z) measured and the intensity of the signal measured.
[0059] Next, a molecular formula database 10 is accessed or consulted 104 including the theoretical mass-to-charge ratio (m/z).sup.T of the molecular ion of a plurality of molecular formulas and associated ionisation adducts. In one embodiment, the molecular formula database 10 comprises a list of formulas and a list of adducts, with the theoretical charge-to-mass ratio of each formula and each adduct, such that the theoretical charge-to-mass ratio of the combinations of molecular formulas and ionisation adducts can subsequently be calculated starting from the monoisotopic mass of the molecular formula plus the difference in mass that the ionisation adduct contributes when charged at the source (i.e., H, Na K). In another embodiment, the molecular formula database 10 directly stores the theoretical charge-to-mass ratio of the different combinations of formulas and adducts, such that no subsequent calculation is necessary.
[0060] The contents of the molecular formula database 10, or the information accessed in the consultation 104, are preferably oriented towards the particular sample to be analysed, based on a large universe or space of molecular formulas related to the matrix to be analysed (serum, urine, cells, environmental samples, etc.). In the case of biological matrices of biomedical interest, the molecular formulas included in the Human Metabolome Database (HMDB) can be used. For example, a set of molecular formulas can be defined depending on the sample to be analysed and ionisation the association of which with the molecular formulas is known. Databases can be considered which include only the molecular formulas oriented towards the particular sample (for example with the formulas which are expected to be found in blood plasma), or larger databases, like the HMDB database which includes information on more than 10,000 metabolites found in the human body.
[0061] Once the molecular formulas and the ionisation adducts thereof are defined, the contents of the molecular formula database 10 can be generated including, for each molecular ion of the molecular formula and for each associated ionisation adduct, the theoretical mass-to-charge ratio (m/z).sup.T, which can be obtained directly from the molecular formula considering the corresponding atomic weights. The method may comprise the step of generating the molecular formula database 10. Alternatively, the molecular formula database 10 may have already been created prior to the implementation of method 100, such that the method 100 only requires accessing a memory (i.e., on a local device or in the cloud) wherein the previously generated molecular formula database 10 is stored.
[0062] The construction of the molecular formula database 10 may comprise the generation of a table containing all the theoretical mass-to-charge ratios (m/z).sup.T after considering the main isotopologues (i.e., M1, M2, M3) and known adducts in both positive and negative ionisation (the fragments at the source can be considered as an adduct in the adduct list) for each unique molecular formula considered. The information contained in the molecular formula database 10 can, for example, be structured in the form of a table, wherein a different formula/adduct/isotopologue is included in each row. The table can be ordered by theoretical mass-to-charge ratio (m/z).sup.T, the first column, as represented in the following example:
TABLE-US-00001 m/z.sup.T Formula Adduct Isotopologue 376.2312 C21H27NO4 +NH4 M1
[0063] The method searches for all the theoretical mass-to-charge ratio (m/z).sup.T values at each data point 2 of the LC-MS or CE-MS mass spectrum 1, within a predefined error (typically 1 to 5 ppm). Alternatively, a scan is made at the data points 2 of the mass spectrum 1 and it is verified, for each data point 2, whether the mass-to-charge ratio (m/z) thereof measured corresponds to a theoretical mass-to-charge ratio (m/z).sup.T from the molecular formula database 10. In order to facilitate the search, the molecular formula database 10 can include the data ordered from lowest to highest theoretical mass-to-charge ratio (m/z).sup.T.
[0064] For each data point 2 of the mass spectrum 1, the molecular formulas and ionisation adducts the theoretical mass-to-charge ratio (m/z).sup.T of which corresponds to the measured mass-to-charge ratio (m/z) of said data point are annotated 106 in an annotation database 12, considering a certain margin or mass error (coming from the accuracy of the measuring or calibration of the mass spectrometer). The annotation database 12 includes, for each molecular formula and ionisation adduct annotated, the retention time (RT) and the intensity of the signal measured of the data point associated with the formula/adduct. The information contained in the annotation database 12 can be structured for example in the form of a table, wherein a different annotation is included in each row. Each row will therefore be a new annotation which will include the formula and/or adduct annotated, the corresponding retention time (RT) thereof, the intensity of the signal measured of the data point 2 of the associated mass spectrum 1 and, optionally, the mass-to-charge ratio (m/z) measured.
TABLE-US-00002 Formula Adduct RT Intensity m/z measured C21H27NO4 +NH4 375.2281
[0065] The different annotations of retention times (RT) and intensity which are made in the annotation database 12 for one same formula and adduct in different rows of the table can be grouped (and even represented in a graph, as shown in
[0066] According to the mass error defined, there will be more or less overlap of possible formulas and/or adducts annotated for one same data point 2. In the graph of
[0067] Next, once the annotation 106 has been performed, each molecular formula and ionisation adduct of the annotation database 12 is then analysed, grouping all the annotations that occurred for one same formula/adduct (see example of
[0068] The method 100 implements an algorithm in order to find regions of interest based on verifying one or more characterisation criteria, first considering a criterion of minimum density and/or minimum number of data points in the region of interest (which will determine candidate regions), and considering additional criteria afterwards, such as a minimum slope of the data points in the region of interest or a certain minimum signal-to-noise ratio. The detected regions of interest can also be compared, in an optional but recommended manner, with a sample blank in order to rule out false positives or data points exogenous to the sample.
[0069] Therefore, and unlike the state of the art, determining the regions of interest does not consist of finding peaks in the mass spectrum 1 by fitting a model (i.e., Gaussian) to the data. The approach of the new method is independent from the shape and determination of the spectral peaks 4, it not being necessary to make any type of correlation between the spectral peaks (as shown in
[0070]
[0071] Next, the candidate regions 20 are characterised 124, obtaining characterisation parameters 22 of the candidate regions 20. Lastly, the characterisation parameters 22 obtained are compared 126 with characterisation criteria, and those candidate regions 20 the characterisation parameters 22 of which meet certain characterisation criteria are selected 128 as regions of interest.
[0072]
[0078] However, it is possible to use other different characterisation parameters or criteria. Furthermore, the characterisation criteria can be coupled with machine learning techniques (artificial neural networks, random forests, etc.) in order to filter candidate regions 20 and generate a more specific inclusion list in exchange for applying a bias associated with the learning method itself.
[0079] In the example of
[0080] The method 100 continues with the generation 110 of an annotated and highly accurate inclusion list 14, with variable time ranges according to the elution profile of each m/z, for MS/MS (or MS.sup.n) experiments which facilitates the identification of metabolites. The inclusion list 14 includes the retention time ranges (RT.sub.0-RT.sub.1) of the detected regions of interest and the theoretical mass-to-charge ratios (m/z).sup.T of the molecular formulas and/or ionisation adducts associated with each of the detected regions of interest. Optionally, the inclusion list may also include the molecular formulas and/or ionisation adducts associated with each of the detected regions of interest.
[0081] Lastly, the inclusion list 14 is sent 112 to a mass spectrometer in order to, by means of a tandem mass spectrometry analysis, perform the identification of metabolites in the sample by using the data from the inclusion list 14. Optionally, the method may comprise performing the tandem mass spectrometry analysis by using the information comprised in the inclusion list in order to identify metabolites in the sample. The MS/MS analyses are subsequent to the mass scans in MS1 mode performed in the LC-MS analysis, requiring a second injection of the same sample since currently there is no technology for accumulating or storing ions after being detected in the MS1.
[0082] The new method analyses the data points of the mass spectrum of a representative biological sample, acquired in MS1 mode, in order to select those mass-to-charge ratios m/z (and the time ranges thereof) which will be fragmented in subsequent MS.sup.n experiments. A novel aspect of the present invention is the manner of selecting the mass-to-charge ratios (m/z) and the retention time ranges in order to perform the MS.sup.n analysis, since it is not based on the detection of peaks, it is a method independent from the chromatographic elution profile of the compound, being able to detect metabolites with non-Gaussian elution shapes or similar (such as those of
[0083] Furthermore, the present invention presents a novel manner of detecting isotopologues of molecular formulas and/or ionisation adducts in the mass spectrum 1. The detection of isotopologues can be verified once the regions of interest 28 of the molecular formulas and ionisation adducts have been detected 108. The detection of isotopologues 120 comprises, as represented in the flow chart of
Int(M0)*ratio*(1+k)>Int(iso)>Int(M0)*ratio*(1−k)
[0085] Then an additional verification based on cosine similarity comparison can optionally be performed, which is defined as:
[0086] Said verification can be performed in the following manner: [0087] The annotation database 12 is searched for the entries corresponding to the conditions to be compared (i.e., M0 compared to an isotopologue M1) in the RT interval corresponding to the region of interest being analysed corresponding to the M0. [0088] All those entries in each set which share the retention time RT are searched for (in other words, that entries of the two conditions M0 and M1 have been found in one same scan—i.e., same instant in time RT—). [0089] If there are enough entries (i.e., more than 5, in order to avoid false positives when N is small), cosine similarity is calculated (with I=<i1,i2,i3 . . . iN> and J=<j1,j2,j3 . . . jN>, the vectors of the intensities of the two conditions to be compared):
Cos=(i1j1+i2j2+ . . . iNjN)/(module(I)*module(J)) [0090] If Cos>k (i.e., k=0.99), then it is determined that an isotopologue has been found and which one it was is recorded.
[0091] The search 162 for data points corresponding to an isotopologue with a certain formula and adduct can be performed by consulting the annotation database 12, which can include annotations of the isotopologues (M1, M2, . . . ), in addition to the annotations of the formulas/adducts (M0). To this end, when performing an annotation 106 of a formula/adduct (M0), the existence of a data point with mass-to-charge relation corresponding to an isotopologue (M1, M2, . . . ) and an intensity close to the theoretical is verified, and in that case the annotation of the isotopologue is performed. Alternatively, the search 162 for isotopologues can be performed directly in the mass spectrum 1 (since the instant in time RT and the mass-to-charge ratio to be searched for are known).
[0092] The search for the isotopologues the presence or absence of which must be determined for each formula and/or adduct annotated can be determined in the molecular formula database 10, which may include, for example, the isotopologues to be considered for each formula and/or adduct (for example, the main isotopologues M1 and M2 of each formula/adduct M0) and the corresponding theoretical mass-to-charge ratio (m/z).sup.T thereof. The molecular formula database 10 may also include the theoretical abundance ratio of the isotopologue. In one embodiment, the isotopologues which can theoretically be detected are determined based on the mass resolution of the spectrum in the mass-to-charge ratio m/z range analysed, this makes it possible to adjust for each M0 the space of isotopologues that the mass spectrometer can detect depending on the resolution of the equipment. The information related to the isotopologues can be included for example in an isotopologue database, wherein the composition of the isotopologues (M1, M2, . . . ) detectable with the mass spectrometer, the mass-to-charge ratio m/z with respect to the M0 and the abundance ratio are stored.
[0093] Therefore, the method makes it possible to calculate the isotopic pattern of each formula and to differentiate which isotopologues are detectable by the apparatus given the intensity ratio with respect to the M0 and the resolution of the mass spectrometer. The method to determine if the peaks of the calculated isotopologues are separable depends on the mass analyser used (as explained for example in the document “Orbitrap Mass Spectrometry”, Zubarev et al., Analytical Chemistry 2013, 85 (11), pages 5288-5296). In the case of Orbitrap analysers, the resolution is inversely proportional to the square root of the m/z, and therefore it can be calculated mathematically. In the case of FTICR analysers, the resolution has an inverse scale to the m/z, for which reason it can also be calculated mathematically. In contrast, the resolution in TOF analysers is independent from the m/z, for which reason the resolution of each m/z is calculated by means of a calibration curve.
[0094]
[0098] In the example of
[0099]
[0100] In case of overlapping of several formulas-adducts for a given mass-to-charge ratio (m/z), the number of isotopologues associated with one same formula that have been detected by the method can be used to prioritise one candidate formula over another, providing relevant information about which compound can be treated before even performing the tandem mass spectrometry.