Alignment of breath sample data for database comparisons

Abstract

A method for synchronizing data for gas samples with volatile organic compounds. The data includes chromatographic data indicative of molecule retention times. The method includes identifying or selecting marker molecules and clustering the plurality of gas samples into a plurality of clusters according to a clustering criterion. Next, a first correction of retention time deviations is performed on the data for the gas samples between clusters by using the marker molecules as anchor points to provide a coarse reduction of retention time deviations between the data. Finally, a second correction of retention time deviations is performed on the data, so as to further reduce retention time deviations between the data. The method reduces significant retention time deviations to allow, e.g., breath sample fingerprints obtained by different equipment at different times to be compared in one database for use on a digital platform.

Claims

1. A computer implemented method for synchronizing data for a plurality of gas samples with volatile organic compounds, the method comprising receiving, for each of the plurality of gas samples, chromatographic data indicative of molecule retention times, identifying at least one marker molecule in the chromatographic data for each of the plurality of gas samples, clustering the plurality of gas samples into a plurality of clusters according to a clustering criterion, performing a first correction of retention time deviations on the data for the plurality of gas samples between clusters by using the at least one marker molecule as anchor points, so as to reduce retention time deviations between the data for the plurality of gas samples, and performing, after said first correction, a second correction of retention time deviations on the data for the plurality of gas samples, so as to further reduce retention time deviations between the data for the plurality of gas samples.

2. The method according to claim 1, wherein the step of identifying at least one marker molecule comprises detecting intensity peaks in the chromatographic data indicative of molecule retention times.

3. The method according to claim 1, wherein the step of identifying at least one marker molecule comprises identifying 5-20 marker molecules.

4. The method according to claim 1, wherein the step of identifying at least one marker molecule comprises selecting at least two marker molecules which have retention times differing more than 200 seconds.

5. The method according to claim 1, wherein the at least one marker molecules comprises at least one molecule selected from: Acetone, Isoprene, Ethylacetate, Benzene, Pentanal, Methylcyclohexane, Toluene, Octane, Styrene, α-pinene, Propylbenzene, Phenol, α-methylstyrene, and d-limonene.

6. The method according to claim 5, wherein the at least one marker molecule comprises at least Benzene and Toluene selected as marker molecules.

7. The method according to claim 1, wherein the step of identifying at least one marker molecule comprises identifying at least one marker molecule which is present only in a subset of the plurality of gas samples.

8. The method according to claim 1, wherein the step of clustering is performed according to a clustering criterion involving retention times for the at least one marker molecule in the plurality of gas samples.

9. The method according to claim 1, wherein the step of clustering is performed according to a clustering criterion involving information about the plurality of gas samples.

10. The method according to claim 1, wherein the step of performing the first correction comprises calculating a polynomial fitting function, on retention times of the at least one marker molecule.

11. The method according to claim 1, wherein the step of performing the first correction comprises iteratively identifying the at least one marker molecule and subsequently performing retention time corrections, until a predetermined stop criterion is met.

12. The method according to claim 1, receiving, for each of the plurality of gas samples, mass spectrometric data, the method comprising analyzing said mass spectrometric data to identify molecules in the gas samples.

13. A computer program product comprising computer executable program code which, when executed on a processor, causes the processor to synchronize data for a plurality of gas samples with volatile organic compounds, comprising receiving, for each of the plurality of gas samples, chromatographic data indicative of molecule retention times, identifying at least one marker molecule in the chromatographic data for each of the plurality of gas samples, clustering the plurality of gas samples into a plurality of clusters according to a clustering criterion, performing a first correction of retention time deviations on the data for the plurality of gas samples between clusters by using the at least one marker molecule as anchor points, so as to reduce retention time deviations between the data for the plurality of gas samples, and performing, after said first correction, a second correction of retention time deviations on the data for the plurality of gas samples, so as to further reduce retention time deviations between the data for the plurality of gas samples.

14. A breath analysis system comprising: a device arranged to receive, for each of a plurality of gas samples obtained as breath exhaled from a subject, chromatographic data indicative of molecule retention times, and a processor programmed to: (a) synchronize data for a plurality of gas samples with volatile organic compounds, comprising receiving, for each of the plurality of gas samples, chromatographic data indicative of molecule retention times, identifying at least one marker molecule in the chromatographic data for each of the plurality of gas samples, clustering the plurality of gas samples into a plurality of clusters according to a clustering criterion, performing a first correction of retention time deviations on the data for the plurality of gas samples between clusters by using the at least one marker molecule as anchor points, so as to reduce retention time deviations between the data for the plurality of gas samples, and performing, after said first correction, a second correction of retention time deviations on the data for the plurality of gas samples, so as to further reduce retention time deviations between the data for the plurality of gas samples, and (b) subsequently analyze the chromatographic data for the plurality of gas samples in accordance with an analysis algorithm, and to a provide an output accordingly.

15. The system according to claim 14, further comprising a chromatographic analyzer arranged to receive the plurality of gas samples obtained as breath exhaled from the subject, and to provide chromatographic data indicative of molecule retention times, for each of the plurality of gas samples accordingly.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which

(2) FIG. 1 illustrates a block diagram of a breath analysis system embodiment,

(3) FIG. 2 illustrates steps of a retention time alignment method embodiment,

(4) FIG. 3 illustrates an example of a mass spectrum for Toluene,

(5) FIG. 4 illustrates a graph with example retention time shifts as a function of retention time for selected molecules detected in gas samples from different batches obtained at different periods,

(6) FIG. 5 illustrates a graph with retention time shifts as a function of retention time for the same molecules and batches as in FIG. 4, but corrected according to the first time correction of the invention,

(7) FIG. 6 illustrates a graph with retention time shifts as a function of retention time for the same molecules and batches as in FIG. 5, but now corrected also according to the second time correction of the invention, and

(8) FIGS. 7a-7c illustrate graphs showing signature fragment of Toluene as a function of retention time for different batches of gas samples. FIG. 7a shows before time alignment, FIG. 7b shows after the first time correction, and FIG. 7c shows the final result after the second time correction.

DESCRIPTION OF EMBODIMENTS

(9) FIG. 1 illustrates a block diagram of a breath analysis system embodiment with a digital platform DP comprising a computer or server, e.g. the cloud based health suite digital platform HSDP, incorporating breath sample B_S analysis data for clinical information CL_I e.g. to assist in diagnosing diseases based on volatile organic compounds in the gas sample B_S. The DP involves processing software implementing the synchronization method RT_A for retention time alignment according to the first aspect of the invention. The DP system may combine and compare breath fingerprints collected over time and measured on different machines. The approach may cover other metabolomics methods using MS or selective detection such as LC-MS data.

(10) Based on a gas sample B_S obtained as a sample of breath collected from a subject, the gas sample B_S is analyzed in an analyzer preferably comprising a GC column device GCC. The GC column device GCC may be a GC-MS analyzer, as known in the art, and the output data GCD preferably comprises mass spectrography data in addition to the chromatography data. Alternatively, the analyzer may be a LC-MS analyzer, as also known in the art.

(11) The output from the analyzer GCC is chromatographic data GCD which is applied to the DP which is arranged to receive, for each of a plurality of gas samples B_S, chromatographic data GCD indicative of molecule elution times, and a processor in the DP is programmed to perform the retention time synchronization or alignment method RT_A according to the invention, and to subsequently analyze the chromatographic data for the plurality of gas samples B_S in accordance with a further analysis algorithm F_A, and to a provide a clinical information output CL_I accordingly. Such further analysis F_A is known in the art and will not be described further, since it is not the scope of the present invention. However, the retention time correction algorithm RT_A according to the invention allows a higher quality of data for such further analysis algorithms F_A and thereby allows clinical information CL_I with higher quality for detection of diseases and e.g. other information of medical interest.

(12) FIG. 2 illustrates steps of a retention time synchronization method embodiment, i.e. an embodiment of the method to be implemented as the retention time alignment algorithm RT_A in software in the DP in the system shown in FIG. 1. The method comprises receiving R_GCD, for each of the plurality of gas samples, chromatographic data indicative of molecule elution times, preferably the input data for each gas sample also comprises mass spectrography data. As mentioned, e.g. the data may be in the form known as output from existing GC-MS analyzing equipment.

(13) As a first step in the processing algorithm, the method comprises identifying I_MM a plurality of marker molecules in the chromatographic data for each of the plurality of gas samples. Preferably, the marker molecules are identified as the co-called easy identifiable molecules (EIMs). This step is applied after peaks are detected in the GC-MS data, using for example the matched filtration and peak identification, e.g. algorithms which can be found in the known XCMS software package. Then, molecules commonly present in most samples and which have a clearly identifiable mass spectrum are selected as marker molecules. These selected marker molecules then serve as marker molecules or anchor points. Preferably, marker molecules are selected which exhibit distinct mass spectra with distinct peaks in their mass spectra. Molecules containing carbon rings (aromatics) generally have such spectra, while linear hydrocarbons do not. Examples are benzene (low abundance) and toluene (rather abundantly present and clearly identifiable due to the benzene ring by fragments m/z=91, 92). Additionally, each part of the retention time window needs to be represented by EIMs, such that time shifts in every part of the full time window can be corrected for. For a good result 5-20, such as about 10, of such EIMs need to be selected based on the data in the available gas samples, and further based on input from an operator according to the operator's experience. The marker molecules are preferably identified as follows. Each EIM is expected to elute in a certain time window, characteristic for that molecule. Toluene, for example, typically elutes around 10 minutes. For each marker molecule the expected mass spectrum is known according to the known databases. Alternatively user libraries containing mass spectra from known compounds or standards can also be used, or other large databases. FIG. 3 shows, as an example, the mass spectrum for toluene. Within the time window associated with the marker molecule all mass spectra are compared to the known mass spectrum of the marker molecule. To calculate the similarity between mass spectra, the spectra are represented as vectors. The cosine of the angle between the factors may be calculated using the dot-product function, and is used as a similarity measure. Such algorithm provides a suitable similarity estimate between mass spectra.

(14) Preferred candidates for EIM to be used in the alignment procedure, especially in case of gas samples being breath exhaled by a human, are given in the below table. It may be preferred to use at least Benzene and Toluene, but it may be preferred to include also one or more from the table with higher retention times.

(15) TABLE-US-00001 Molecular Signature mass Molecule Formula mass fragment Acetone C3H6O 58 58 Isoprene C5H5 68 67 Ethylacetate C4H8O2 88 88 Benzene C6H6 78 78 Pentanal C5H10O 86 58 Methylcyclohexane C7H14 98 70 Toluene C7H8 92 92 Octane C8H18 114 114 Styrene C8H8 104 104 α-pinene C10H16 136 136 Propylbenzene C9H12 120 120 Phenol C6H6O 94 94 α-methylstyrene C9H10 118 118 d-limonene C10H16 136 121

(16) Next, the method comprises clustering CL the plurality of gas samples into a plurality of clusters according to a clustering criterion. Preferably, the clustering is performed in accordance with the retention time of the EIMs. Additionally or alternatively, other information on the samples can also be used for clustering, such as whether the samples are measured closely in time on the same analyzer machine. E.g. the gas samples may be measured in batches, resulting is little retention time deviations between the samples within each batch, and larger deviations between the samples in different batches.

(17) FIG. 4 shows a graph with example data indicating an initial spread in retention time shift d as a function of retention time Rt for molecules from the above table. The solid lines indicate quadratic fits through the data points. The example is based on gas samples obtained in different batches, at different periods in time. One batch (the one labelled 20120314) is taken as a reference since it contains many samples and is measured about halfway in the total time covered by the batches. The retention time shifts d are seen to be rather large, ranging from 20 seconds at low retention times Rt to 60 seconds at large retention times Rt. Note also the large difference in retention time shift d between the four highest curves and all other curves. The upper curves are measured until a specific date, where the analyzing GC column was replaced by a new one, causing a completely different retention time Rt pattern in the rest of the measurements.

(18) The next step is performing P_C1 a first correction of retention time deviations on the data for the plurality of gas samples between clusters by using the marker molecules as anchor points, so as to reduce retention time deviations d between the data for the plurality of gas samples. This may be performed by fitting marker molecules (EIMs) using a linear or higher order polynomial function. Based on the fit, the first raw retention time collection is performed on the full retention time range. The identification of the marker molecules and the subsequent retention time correction can be iteratively performed until no improvement, or only improvement below a set threshold, is obtained.

(19) FIG. 5 illustrates for the same example data from FIG. 5 the result of the first retention time correction. The retention time shifts d are now much lower, only about 10 seconds for the full retention time range Rt. The solid lines indicate linear fits through the data points.

(20) As the last step, after the first coarse correction of retention time deviations P_C1, the method comprises performing a second time correction P_C2 of retention time deviations on the data for the plurality of gas samples, so as to further reduce retention time deviations d between the data for the plurality of gas samples. A linear fit, or other fit function, can be made through the data points per batch, and used as the basis for performing the second retention time correction P_C2. The second retention time correction P_C2 may be performed by the time alignment algorithm known from the XCMS toolbox or other standard software package or similar algorithm. Specifically, it may be preferred that the second retention time correction P_C2 comprises first matching peaks across samples and grouping them together, then ‘well behaved’ groups are identified. These peak groups contain very few samples which have no peaks assigned and very few samples which have more than one peak assigned. Because of these conditions, well behaved groups have a high probability of containing properly matched peaks. The alignment is performed by calculating the median retention time in each of those peak groups, and correcting all retention times accordingly. Since the well behaved peak groups are typically evenly distributed over a significant part of the retention time range, a detailed correction can be calculated for this range. The method is preferably iteratively applied. At each iteration cycle the peak grouping parameters are narrowing down until a satisfactory alignment is obtained.

(21) FIG. 6 shows a graph of the example data after the second retention time correction P_C2 has been applied. As seen, the retention time deviations d are now further reduced.

(22) FIGS. 7a-7c illustrate the final quality of the alignment can be inspected by looking at the behaviour of the signature mass fragments of the marker molecules. The intensity of signature fragment 92 of toluene is plotted against the retention time for three steps in the alignment procedure: FIG. 7a: raw, no alignment performed yet, FIG. 6b: after the first retention time correction, and FIG. 7c: after the second retention time correction. Different curves represent different gas samples. It can be seen that the peaks of the fragment align during the alignment procedure. In FIG. 7c the peaks are all aligned, showing that the procedure has successfully worked.

(23) All the steps above preferably result in an ion-fragment peak table. Each row in such table corresponds with a sample. The first few columns contain sample and patient data, such as sample data, age, gender and illnesses. The remaining columns may contain the abundances of the peaks, or ion-fragments. Typically, there are a few thousand of those. This table serves as input for further statistical analysis.

(24) It is understood that the method may comprise or be used in connection with a further analysis of the data, e.g. diagnosing a disease based on a result of analysing exhaled breath from a subject according to the gas sample synchronizing method. The method may further comprise initiating a specific therapy, e.g. a medical treatment of Tuberculosis. Further, breath VOC analysis may be used for monitoring/analysis of lung cancer, breast cancer, other types of cancer, or respiratory infections. Also, breath analysis may be applicable for monitoring diseases such as asthma and Chronic Obstructive Pulmonary Disease (COPD) e.g. response to treatment, exacerbation monitoring. Furthermore, breath analysis may further be applied for monitoring glucose level in diabetes. Still further, an application example may be monitoring for sepsis and necrotizing enterocolitis (NEC) from VOC analysis based on gas analysis based on feces in neonates.

(25) To sum up, the invention provides a method RT_A for synchronizing data for a plurality of gas samples, e.g. breath samples, with volatile organic compounds. The data comprises chromatographic data indicative of molecule elution times, and preferably also mass spectrography data. The method comprises identifying or selecting I_MM marker molecules, e.g. 5-20 molecules, preferably easily identifiable molecules for each of the plurality of gas samples, and clustering CL the plurality of gas samples into a plurality of clusters according to a clustering criterion, e.g. including additional information such as time of obtaining the data and/or analyzing equipment used. Next, a first correction of retention time deviations P_C1 is performed on the data for the gas samples between clusters by using the marker molecules as anchor points, so as to provide a coarse reduction of retention time deviations d between the data for the gas samples. Finally, a second correction of retention time deviations P_C2 on the data for the gas samples, so as to further reduce retention time deviations d between the data for the gas samples, e.g. by using standard software packages. The method can reduce significant retention time deviations so as to allow e.g. breath sample fingerprints obtained by different equipment at different periods of time to be compared in one database for use on a digital platform DP such as the HSDP.

(26) While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive; the invention is not limited to the disclosed embodiments. Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Alignment of breath sample data for database comparisons

Assignee

Inventors

Cpc classification

Classification Explorer

A61B5/7285

HUMAN NECESSITIES

Classification Explorer

G01N30/8686

PHYSICS

Classification Explorer

G01N30/8631

PHYSICS

Classification Explorer

A61B2010/0087

HUMAN NECESSITIES

Classification Explorer

G01N2030/025

PHYSICS

Classification Explorer

A61B5/082

HUMAN NECESSITIES

Classification Explorer

A61B5/0004

HUMAN NECESSITIES

Classification Explorer

G01N30/7206

PHYSICS

Classification Explorer

G01N2033/4975

PHYSICS

Classification Explorer

G01N33/497

PHYSICS

International classification

Classification Explorer

G01N33/497

PHYSICS

Classification Explorer

A61B5/08

HUMAN NECESSITIES

Classification Explorer

G01N30/86

PHYSICS

Classification Explorer

G01N30/72

PHYSICS

Abstract

Claims

Description