Method of deconvolution of mixed molecular information in a complex sample to identify organism(s)
11177017 · 2021-11-16
Assignee
Inventors
- Olivier Pible (Bedarrides, FR)
- Jean Armengaud (Villeneuve-lez-Avignon, FR)
- François Allain (Agneaux, FR)
Cpc classification
G16B10/00
PHYSICS
International classification
Abstract
The present invention relates to methods to determine the identity of one or more organisms present in a sample (if these are already reported in a taxonomic database) or the identity of the closest related organism reported in a taxonomic database. The present invention does this by comparing a data set acquired by analyzing at least one component of the biological sample to a database, so as to match each component of the analyzed content of the sample to one or more taxon(s) and then collating the phylogenetic distance between each taxa and the taxon with the highest number of matches in the data set. A deconvolution function is then generated for the taxon with the highest number of matches, based on a correlation curve between the number of matches per taxon (Y axis) and the phylogenetic distance (X axis), the outcome of this function providing the identity of the organism or the closest known organism to it.
Claims
1. A method of identifying an organism in a sample, the method comprising: a) generating a data set concerning one component of the sample, said component being selected from the group consisting of peptide sequences and nucleic acid sequences; b) comparing, automatically by a computing device, the data set with a database of known data concerning the component and matching each member of the data set to one or more taxons; c) defining by a computing device (i) X values along an X axis as a phylogenetic distance, (ii) Y values along a Y axis as a number of matches per taxon, (iii) taxon k as the taxon with the highest number of matches in the data set, and (iv) signature function as a function modeling a contribution of taxon k to the number of matches per taxon observed for any taxon at a phylogenetic distance between the taxon and taxon k; d) calculating or assigning, by the computing device, the phylogenetic distance between each taxon and the taxon k; e) generating, by the computing device, a signature function for taxon k represented by a correlation curve; f) defining, by the computing device, an objective function selected from the group consisting of sum of squared errors, maximum of errors, and sum of absolute errors; said errors being calculated by subtracting Y values of the signature function relating to taxon k from the Y values assigned at step c) to each taxon; and g) minimizing, by the computing device, the objective function by fitting parameters of the signature function for taxon k, and comparing the objective function with a threshold to determine whether the sample comprises the organism represented by taxon k, the sample comprises an unknown organism, or the sample comprises at least one other detectable organism, wherein: (i) if the objective function is below the threshold and the signature function represented by the correlation curve intersects the Y axis with a negative slope, the sample comprises the organism represented by taxon k; (ii) if the objective function is below the threshold and the signature function represented by the correlation curve intersects the Y axis with a zero slope, the sample comprises the unknown organism which is distant from taxon k by the abscissa of the point of the signature function where the slope becomes negative; and (iii) if the objective function is above the threshold, the sample comprises the at least one other detectable organism.
2. The method of claim 1, wherein the component is a peptide sequence, the matches are peptide spectrum matches and the generation of the data set in step a) is performed by tandem mass spectrometry.
3. The method of claim 1 or 2, wherein the signature function in step c) is a monotonic decreasing function.
4. The method of claim 1, wherein the signature function in step c) has the formula:
0<X.sub.k<d.sub.k:Y.sub.k=N.sub.k×(A.sub.k×exp(−d.sub.k/d1.sub.k))+(1−A.sub.k)×exp(−d.sub.k//d2.sub.k))
d.sub.k<X.sub.k:Y.sub.k=N.sub.k×(A.sub.k×exp(−X.sub.kd1.sub.k)+(1−A.sub.k)×exp(−X.sub.kd2.sub.k) Formulae 1 wherein k is the taxon with the highest number of matches in the data set, N.sub.k is the number of matches attributed to taxon k, variable X.sub.k is the phylogenetic distance between each taxon from step b) and taxon k, Y.sub.k is the modelled number of matches due to taxon k attributed to the taxon at distance X.sub.k: from taxon k, exp( ) is the exponential function, A.sub.k is the percentage of the exponential tenn in the form exp(−X.sub.k/dl.sub.k), with the complement to 1 attributed to the second exponential term in the form exp(−X.sub.k/d2.sub.k); terms dl.sub.k and d2.sub.k are homogenous to distances representing components shared between taxa due to sequence conservation; d.sub.k represents the phylogenetic distance between the taxon in the sample and taxon k which is the closest taxon in the database.
5. The method of claim 4, wherein in step g), the objective function is minimized by fitting parameters, for each identified taxon k, selected from the group consisting of: N.sub.k, d.sub.k, A.sub.k, dl.sub.k and d2.sub.k.
6. A method of identifying several organisms in a sample, comprising performing the method of claim 1 and then repeating steps d.sub.p) to g.sub.p), d.sub.p) calculating or assigning a phylogenetic distance between each of said taxon(s) and a taxon k.sub.p with the highest positive error or the taxon with a number of specific matches; e.sub.p) generating a signature function for said taxon k.sub.p, said function being defined as a function modelling a contribution of said taxon k.sub.p to a number of matches per taxon observed for any taxon at said phylogenetic distance between said taxon and taxon k.sub.p, said number of matches per taxon defining Y values along a Y axis and said phylogenetic distance defining X values along a X axis; f.sub.p) defining an objective function selected from the group consisting of sum of squared errors, maximum of errors and sum of absolute errors, said errors being calculated by subtracting a sum of the signature functions relating to the taxa k to k.sub.p from the Y values assigned at step b) to each taxon; and g.sub.p) minimizing the objective function by fitting parameters of the signature function for taxa k to k.sub.p, and comparing the objective function value with a first threshold and/or comparing the objective function change upon repetition with a second threshold, until the objective function value is below a first threshold or the objective function changed upon repetition is below a second threshold.
7. The method of claim 1, wherein taxa are clades at a given taxonomical level.
8. A method of quantifying an organism in a sample, comprising: aa) identifying an organism in a sample by the method of claim 1, wherein an organism present in the sample is assigned to a taxon, so that there is a match between an organism and a taxon; bb) substituting the match within said taxon by quantifying the component associated with the match in the sample; cc) ordering the quantified matches from step bb) by quantity, from highest to lowest and selecting a subset; and dd) quantifying a taxon by a calculation based on the selected subset, using the sum, mean, or median of the subset.
9. A method of quantifying several organisms in a sample comprising: aa) identifying organisms in a sample according to the method of claim 6, wherein each group of organisms present in the sample is assigned to a taxon, so that there is a match between a group of organisms and a taxon; bb) substituting each match within said taxon by quantifying the component associated with the match in the sample; cc) ordering the quantified matches from step bb) by quantity, from highest to lowest and selecting a subset; and dd) quantifying a taxon by a calculation based on the selected subset, using the sum, mean, or median of the subset.
10. The method of claim 8 or claim 9 wherein in step bb) the component is peptide sequences, and the quantifying is performed by a method selected from the group consisting of a method using extracted Ion Chromatograms, a quantification method based on mass spectrometry data or liquid-chromatography data, a tandem mass spectrometry (MS/MS) total ion current and methods based on peptide fragments isolation and quantification selected from the group consisting of selected reaction monitoring (SRM), parallel reaction monitoring (PRM) and multiple reaction monitoring (MRivt).
11. The method of claim 1, wherein step b) is performed iteratively on an increasingly smaller number of taxa, wherein only the taxa identified after repetition of steps c) to g) are retained and then from within these retained taxon(s) further steps b) to g) are repeated at least one time.
Description
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
x<d:y=N*(A*exp(−d/a)+(1−A)*exp(−d/b))
x>=d:y=N*(A*exp(−x/a)+(1−A)*exp(−x/b)) .square-solid.Taxon Escherichia coli K-12 (83333) A=0.60 a=0.0008 b=0.0800 N=1146 d=0.0011 .square-solid.Taxon Yersinia pestis KIM10+(187410) A=0.60 a=0.0008 b=0.0800 N=1256 d=0.0013 Fit result: R{circumflex over ( )}2=0.9953
(13)
(14)
(15)
(16)
x<d:y=N*(A*exp(−d/a)+(1−A)*exp(−d/b))
x>=d:y=N*(A*exp(−x/a)+(1−A)*exp(−x/b)) .square-solid.Taxon Sphingomonas wittichii RW1 (392499) A=0.60 a=0.0008 b=0.0800 N=2834 d=0.0000 Fit result: R{circumflex over ( )}2=0.2010
(17)
x<d:y=N*(A*exp(−d/a)+(1−A)*exp(−d/b))
x>=d:y=N*(A*exp(−x/a)+(1−A)*exp(−d/b)) .square-solid.Taxon Sphingomonas wittichii RW1 (392499) A=0.60 a=0.0008 b=0.0800 N=2610 d=0.0012 .square-solid.Taxon Escherichia coli BL21 (DE3) (469008) A=0.60 a=0.0008 b=0.0800 N=1305 d=0.0011 .square-solid.Taxon Ruegeria pomeroyi DSS-3 (246200) A=0.60 a=0.0008 b=0.0800 N=780 d=0.0064 Fit result: R{circumflex over ( )}2=0.9957
(18)
x<d:y=N*(A*exp(−d/a)+(1−A)*exp(−d/b))
x>=d:y=N*(A*exp(−x/a)+(1−A)*exp(−x/b)) .square-solid.Taxon Sphingomonas wittichii (160791) A=0.60 a=0.0010 b=0.0800 N=2808 d=0.0017 .square-solid.Taxon Escherichia coli (562) A=0.60 a=0.0010 b=0.0800 N=1277 d=0.0013 .square-solid.Taxon Ruegeria pomeroyi (89184) A=0.60 a=0.0010 b=0.0800 N=844 d=0.0130 Fit result: R{circumflex over ( )}2=0.9757
(19)
x<d:y=N*(A*exp(−d/a)+(1−A)*exp(−d/b))
x>=d:y=N*(A*exp(−x/a)+(1−A)*exp(−x/b)) .square-solid.Taxon Sphingomonas (13687) A=0.60 a=0.0008 b=0.0800 N=3906 d=0.0082 .square-solid.Taxon Escherichia (561) A=0.60 a=0.0008 b=0.0800 N=1435 d=0.0012 .square-solid.Taxon Ruegeria (97050) A=0.60 a=0.0008 b=0.0800 N=675 d=0.0022 Fit result: R{circumflex over ( )}2=0.9636
1. MATERIALS AND METHODS
(20) 1.1 Preparation of Samples
(21) The biological sample is prepared for analysis of its whole protein content by tandem mass spectrometry, for example following the protocol described in (FR1354692) or using standard approach such as described in Mass Spectrometry: A Textbook (Springer, 2011).
(22) In a preferred protocol, the sample undergoes protein precipitation by trichloroacetic acid (10% final), centrifugation in an EPPENDORF centrifuge, removal of the supernatant, pellet dissolved into Laemmli buffer, SDS-PAGE electrophoresis but with a short migration time, excision of a polyacrylamide band containing the whole proteome, reduction and alkylation of cysteines by iodoacetamide, and enzymatic proteolysis by trypsin. The resulting peptides are then washed, concentrated and loaded onto a reverse-phase chromatography column coupled to a mass spectrometer for tandem mass spectrometry analysis.
(23) The sample processed in the experiment reported in
(24) The sample processed in the experiment reported in
(25) The sample processed in the experiment reported in
(26) The sample processed in the experiment reported in
(27) 1.2. NanoLC-MS/MS.
(28) Settings and conditions for analyzing the peptides by tandem mass spectrometry are described for the LTQ ORBITRAP XL (ThermoFisher) mass spectrometer, coupled to an UltiMate 3000 LC system (Dionex) equipped with a reverse-phase ACCLAIM PEPMAP100 C18μ-precolumn (5 μm, 100 Å, 300 μm i.d.×5 mm, Dionex-ThermoFisher) followed by a nanoscale ACCLAIM PEPMAP100 C18 capillary column (3 μm, 100 Å, 75 μm i.d.×15 cm, Dionex). Step 1 Load 1 to 10 μL (maximum volume allowed by the system) of the acidified peptide mixture and resolve over a 90 min linear gradient from 5 to 50% solvent B (0.1% formic acid, 80% acetonitrile in water) using a flow rate of 0.3 μL/min. The loading volume is adjusted as a function of the total current measured by the mass spectrometer to avoid saturating the detector. Step 2 Collect full-scan mass spectra over the 300 to 1,800 m/z range and MS/MS on the three most abundant precursor ions (minimum signal required set at 10,000, possible charge states: 2+), with dynamic exclusion of previously-selected ions (exclusion duration of 10 sec, one replicate).
1.3. Ms/Ms Assignments.
(29) The resulting RAW file recorded by the mass spectrometer contains MS spectra and MS/MS spectra for MS isotopic patterns corresponding to certain requirements (intensity above 10000, +2 charge). These requirements are associated with peptides that have a high probability of being identified from the corresponding MS/MS fragmentation spectrum. The RAW file is converted to MGF (Mascot Generic File) format using the extract msn.exe program (ThermoScientific), with options set as follows: 400 (minimum mass), 5,000 (maximum mass), 0 (grouping tolerance), 0 (intermediate scans), 10 (minimum peaks), 2 (extract MSn), and 1,000 (threshold). These options can be set in the Mascot Daemon software (version 2.3.02, Matrix Science) in the options of the Data import filter (ThermoFinnigan LCQ/DECA RAW file). The MGF file is then processed by the Mascot Server (version 2.3.02, Matrix Science, running on a 64-bits computer with an INTEL XEON CPU W3520 @2.67 GHz, RAM 24 GB), using search parameters set in the Mascot Daemon client. The database used is based on the most-updated NCBInr database to allow protein accession-to-organism Taxonomy ID (taxid) mapping. It can comprise the complete database or a non-redundant subset. Other parameters for MS/MS to peptide assignment are: maximum number of missed cleavages set at 2, 5 ppm for mass tolerance on the parent ion, 0.5 Da for mass tolerance on the product ions, carbamidomethylated cysteine residues as fixed modification, and oxidized methionine residues as variable modification. Decoy database search can be selected if a false discovery rate (FDR) needs to be calculated. The Mascot inference process results in a DAT file that can be used as an input for the organism identification program that the inventors have developed and named ptorg.ID.
(30) 1.4. Determination of the Number of PSMs Per Taxon with μOrg.ID.
(31) Databases: The NCBI nr database is downloaded weekly on the Mascot server from ftp://ftp.ncbi.nih.gov/BLAST/db/ in fasta format. This database is used both by Mascot for identifications and to create a BLAST formatted database for BLAST searches. The NCBI taxonomy database is also loaded weekly on the server from ftp://ftp.ncbi.nih.gov/pub/taxonomy/. Files used are gi_taxid_prot.dmp for gi to taxid mapping; nodes.dmp for taxonomy level and hierarchy; and names.dmp for taxa names.
(32) Python packages: The Mascot DAT files can be read using the msparser tool (Matrix Science). The Python version of msparser is used (v2.4.02) and interfaced with a complete package written in Python (v2.6.6).
(33) Additional packages to the Python installation include biopython (v1.55), 1×ml (v3.0.1), numpy (v1.6.2), scipy (v0.9.0), poster (v0.8.1), pysqlite2, tablib (v0.9.1), and ujson (v1.23). A Python library from ThermoScientific, msfilereader, can also be used to access RAW files (in which case python package comtypes (v0.6.2) is needed).
(34) 1.5. Determination of the Distances Between Taxa with μorg.ID:
(35) Software tools: clustalw.exe (v2.1), muscle.exe (v3.8.31), and BLAST (v2.2.27+) are installed on the Mascot server. The NCBInr fasta files are processed to create a BLAST database using the makeBLASTdb.exe utility with the options parse_seqids and hash_index set and the following parameters: prot (dbtype), gi_taxid_prot.dmp (taxid_map), NCBInr (title), and NCBInr (out).
(36) Python packages: In addition to the previous packages, dendropy (v3.12.0) is installed for this task.
(37) Supervector for phylogenetic distance estimation. The method chosen for distance estimation is based on Ciccarelli et al., which has been corrected by correction of taxa in accordance with current NCBI taxonomy to allow the calculation of phylogenetic distances based on protein information between any taxa from the NCBI database.
(38) 1.6. Signal Deconvolution With μorg.ID:
(39) Software tools: Curve fitting and signal deconvolution are performed in Excel (Microsoft Office 2010), using VBA macros and the solver for curve fitting (GRG non linear, default options). Alternatively, they can be performed with any tools efficient for mixture model analysis, including the evaluation of the number of components, for example using scipy functions, such as curvefit and the Levenberg-Marquardt algorithm, using the Jacobian matrix of the correlation function for curve fitting.
(40) 1.7. EXtracted Ion Chromatograms
(41) An in-house software written in Python was used to gather MS intensity information associated to each MS/MS spectrum. Python packages include msparser (Matrix Science) used to access Mascot DAT files and msfilereader (ThermoScientific) used to access RAW files. The ranges to collect intensity data associated to a PSM were 1.2 ppm for the m/z window compared to the peptide m/z, and 300 s compared to the retention time (RT) of the MS/MS considered. All MS scans, acquired at 1 Hertz, where processed with these m/z and RT ranges to extract the full XIC, and collect the maximum intensity value associated with each PSM.
(42) 1.8. DNA Data Analysis
(43) Unassembled whole genome sequencing (WGS) data were downloaded for a typical Escherichia coli sequencing on an IIlumina HISEQ 2000 sequencing system for ERR163875. A first random SRA subset (14,448 reads) was transformed to fasta format using the SRA toolkit (NCBI) to convert SRA to fastq, then the BioPython SeqIO.convert function was used to obtain a fasta file. This file was used as a query for a BLASTn search on NCBI nt with default parameters except for a E-value at 1e-20. A second subset of 14,428 reads was then selected on the double criteria: (i) 200 bp sequence with no undetermined nucleotide (N) and, (ii) at least one BLAST hit with a E-value below 1e-20. A Python script was written to associate organism information to each read using the “Hit_def” field from the XML BLAST output after minor “Hit_def” curation to homogenize species naming. The final output in Table VI also includes a numbering of reads only associated with the species taxa listed.
2. RESULTS
(44) 2.1 Definition of “Peptide Spectrum Matches” and Database Comparison.
(45) The first step of the procedure consists of the assignation of peptide sequences to the MS/MS spectra recorded by tandem mass spectrometry. For this, the inventors used the MASCOT software with standard parameters. The efficiency of MS/MS spectra assignment using the current peptide extraction protocol for samples corresponding to either Gram+ or Gram− bacteria is indicated in Table I by the assignment ratio. This ratio is the number of spectra assigned by Mascot to a peptide sequence from the database at a confidence value p below 0.05, also called Peptide Spectrum Matches (PSMs). On a small database extracted from NCBInr corresponding to the annotated proteome of a known organism, the ratio of assigned MS/MS spectra per recorded spectra in our experimental set-up varies from 63 to 71% at a confidence value of 95% (p<0.05) for data recorded for this specific organism. On the complete NCBInr database (release of Apr. 19, 2013), which contains over 5,000 times more amino acids, the ratio drops to 27 to 35% even at a confidence value of 90% (p<0.10) because of a higher number of peptides matching the m/z value of the parent solely by chance. For example, an average of 4,405 PSMs were obtained for Escherichia coli BL21(DE3) sample against the Escherichia coli BL21(DE3) database while only 1981 PSMs were obtained on the NCBInr database at the same p value threshold 0.05 (Table I). However, these values are indicative of the number of PSMs expected in the current process. Unless otherwise specified, data shown in this document have been collected on the complete NCBInr database and at a p-value below 0.10. (Note: alternatively, or as a post-treatment, a faster Mascot search will be obtained by selecting a subset of the NCBInr based on a unique taxon per genus, selected for being representative of the number of proteins in the genus and for good sequencing quality (complete status on NCBI genomes—www.ncbi.nlm.nih.gov/genome or GOLD—www.genomesonline.org). This reduced database allows a higher number of PSMs than with the whole NCBInr database in a first faster coarse-grained Mascot search, while retaining the capability to identify the different genera represented in the sample. In this scheme, once a genus is found in the sample in the first search, all leaf taxa from the corresponding family are gathered to build a second database including all possible strains or species in the sample. The analysis of this second Mascot search can be used fora fine-grain identification of the organisms in the sample).
(46) TABLE-US-00001 TABLE I Efficiency of MS/MS spectra assignments for pure organisms using a dedicated protein database extracted from the NCBInr or the complete protein NCBInr database. Dedicated DB NCBInr DB # PSMs % Assignments # PSMs % Assignments # PSMs % Assignments # MS/MS (p < 0.05) (p < 0.05) (p < 0.05) (p < 0.05) (p < 0.1) (p < 0.1) Escherichia coli BL21(DE3) 6722 4542 68% 2090 31% 2308 34% GRAM− 6437 4381 68% 1946 30% 2146 33% 6054 4292 71% 1906 31% 2101 35% Average 6404 4405 69% 1981 31% 2185 34% Standard deviation ±335 ±127 ±2% ±97 ±1% ±109 ±1% Bacillus cereus ATCC 14579 5297 3575 67% 1374 26% 1562 29% GRAM+ 5325 3373 63% 1309 25% 1454 27% Average 5311 3474 65% 1342 25% 1508 28% Standard deviation ±20 ±143 ±3% ±46 ±1% ±76 ±2%
2.2. Determination of Number of PSMs Per Taxon With μorg.ID.
(47) The next stage of the process is to attribute spectra to taxa, using PSMs from the Mascot search. Classical proteomics tries to maximize the protein inference confidence by means such as parsimony (a rule which attributes each spectrum only to the most probable protein, i.e., with the highest number of spectra or peptides attributed) or a minimum of 2 different peptides to validate a protein (see the journal “Molecular and Cellular Proteomics” current guidelines available at www.mcponline.org/site/misc/PhialdelphiaGuidelinesFINALDRAFT.pdf).
(48) The μorg.ID procedure does not involve the interpretation of data in terms of proteins, instead all the information available is used to estimate a quantitative representative of the contribution of an organism in a complex sample, possibly in a mixture. Conservation of tryptic peptide sequences is such that many sequences can be found in several organisms from different clades because they are in conserved regions of widespread conserved proteins. The co-occurrence of such conserved peptides is higher for closely-related organisms than far-related organisms. Table II shows this level of conservation for a specific dataset (E. coli BL21(DE3)) at the phylum level. A total of 14% of peptides found associated with the proteobacteria phylum are also found associated to the next phylum, streptophyta (266 amongst 1915), or more than 10% in another phylum, firmicutes. To be able to identify, but also and most importantly quantify, different organisms in a mixture, it is thus mandatory to exclude parsimony methods that will favor the most prominent organism detrimentally to the others, and to attribute each MS/MS spectrum to all possible corresponding taxa taking care of the contributions of conserved patterns in a post-treatment.
(49) TABLE-US-00002 TABLE II Peptide conservation between phyla for an E. coli BL21(DE3) pure sample, i.e., pure proteobacteria phylum (human contamination (due to keratins) very low, as shown by a low number of specific PSMs in the chordata phylum (48), validated by 3 specific PSMs). # # Tax Super- # Specific Pep- Name Taxid Rank kingdom PSMs PSMs tides Proteobacteria 1224 phylum Bacteria 2329 1370 1909 Streptophyta 35493 phylum Eukaryota 439 0 266 Firmicutes 1239 phylum Bacteria 301 0 201 Arthropoda 6656 phylum Eukaryota 271 0 146 Actinobacteria 201174 phylum Bacteria 118 1 75 Bacteroidetes 976 phylum Bacteria 96 0 61 Chordata 7711 phylum Eukaryota 48 3 26
(50) To enable fast association of PSMs to taxonomical information, a sqlite database is built using Python with 3 tables. The first table is called ‘gi2firstgi’ and associates each gi from the NCBInr with the first gi listed in the fasta file if they correspond to the same polypeptide sequence. Note that a gi is a unique identifier of a given polypeptide in the NCBI database. All gis associated with the same “first” gi thus share exactly the same polypeptide sequence, summarizing the non-redundancy information in the NCBInr database. This table is created from a parsing of all headers from the NCBInr database. The second table called ‘gi2taxid’ is created directly from the gi_taxid_prot.dmp NCBI taxonomy file and associates gis (master key) to taxids. The third table called ‘taxid2nbgis_nbseqs’ associates taxids (i) with the number of gis per taxid by querying each taxid in the second table (gi2taxid) and (ii) with the number of different sequences, using the first table (gi2firstgi) to identify redundant gis of identical sequence (one gi for the NCBI RefSeq and one for the Genbank accession, for example).
(51) To associate PSMs to taxa (see
(52) Proteins corresponding to the Protein DataBank are excluded because many structures are obtained with mutated sequences, resulting in abnormal taxon specific spectra.
(53) File nodes.dmp (taxonomic hierarchy and level information) is processed, and the spectrum-taxa information is reversed to obtain a list of spectra per taxon, including sub-taxon aggregation for Glade taxa. Spectra lists for each taxon are finally uniquified before counting spectra per taxon. In the case of bacteria, where the “no rank” level can correspond to strains, the aggregation of several sub-taxa of the same “no rank” level was performed, but sub-taxa were also listed in the same evaluation. This result corresponds to a table including both “no rank” and leaf taxa, which also allows the evaluation at the finest taxonomical level of bacterial leaf taxa only referenced at the species level.
(54) In addition to the number of PSMs per taxon, a counting of unique (or specific) PSMs is performed for each taxon, indicating the number of PSMs that are only associated to each given taxon. Table II lists a subset of such information at the phylum level and Table III at the “no rank+leaf” level, i.e., the most resolute level, for the same pure E. coli BL21(DE3) strain sample. Species Escherichia sp. 1_1_43 appears in Table III because it has no “no rank” sub-taxon. Taxon 37762 for E. coli B collects 2288 PSMs for only 97 sequences in the database because of its sub-taxon (taxid 413997).
(55) The post-process of a DAT file is very fast, and tabulated text files giving PSM counts per taxon at the “superkingdom”, “phylum”, “class”, “order”, “family”, “genus”, and “no rank” leaf taxonomic levels are generated in a few minutes. For example, in the search used for Tables II and III, 2,346 MS/MS spectra are assigned to at least one peptide, corresponding to a total of 103,872 different protein sequences in the NCBInr database (v2013/04/23), 1,386,243 different gi accessions associated with these sequences, and finally 19,904 different taxids. The complete process took about 7 minutes.
(56) TABLE-US-00003 TABLE III Numbering of PSMs per taxon at the most resolute level, i.e. the “no rank + leaf” level, and at the species level for selected taxa, for a an E. coli BL21(DE3) pure sample. Tax Super- # # # specific # Name Taxid Rank kingdom sequences MS/MS MS/MS peptides Escherichia coli 469008 no rank Bacteria 4380 2315 0 1321 BL21(DE3) Escherichia coli ‘BL21- 866768 no rank Bacteria 4156 2315 0 1321 Gold(DE3)pLysS AG’ Escherichia coli H489 656404 no rank Bacteria 4525 2301 0 1311 Escherichia coli B 37762 no rank Bacteria 97 2286 0 1306 Escherichia coli B str. 413997 no rank Bacteria 4144 2286 0 1306 REL606 Escherichia sp. 1_1_43 457400 species Bacteria 4478 2284 0 1298 Escherichia coli KTE197 1181743 no rank Bacteria 4336 2279 0 1295 Escherichia coli KTE51 1182658 no rank Bacteria 4908 2279 0 1295 Escherichia coli str. K-12 316407 no rank Bacteria 4355 2277 0 1297 substr. W3110 Escherichia coli 562 species Bacteria 695176 2325 15 1356 Escherichia sp. 1_1_43 457400 species Bacteria 4636 2284 0 1298 Bos taurus 9913 species Eukaryota 44032 5 2 2 Photorhabdus 291112 species Bacteria 4385 487 1 274 asymbiotica
2.3 Distances Between Taxa Evaluated With μorg.ID.
(57) The original method reported by Ciccarelli et al. to reconstruct a highly resolved tree of life relied on the selection of 31 Clusters of Orthologous Groups (COGs), which were conserved in all superkingdoms, and thus allowed the calculation of distance matrices between all known fully sequenced organisms at that time (year 2006). It was based on 191 taxa (23 archaea, 18 eukaryota, 150 bacteria), and resulted in a multiple sequence alignment (MSA) of supervectors obtained through a concatenation of the 31 COGs for a complete MSA of 8090 amino acid positions.
(58) This method was chosen for the phylopeptidomic approach as a starting point for the automation of the calculation of a phylogenetic distance between taxa for several reasons. First, the data used are similar to the data collected in tandem mass spectrometry experiments, i.e., partial protein sequences. With current instruments and complex samples, concentration ratios for visible peptides are limited by the dynamic range of the instruments, currently at about 4 orders of magnitudes. With protein concentration ratios ranging from 7 for bacteria to 10 orders of magnitudes for eukaryotes, only the MS/MS detectable peptides from the most abundant proteins are consistently analyzed, corresponding in general to more conserved proteins of the key cellular functions (such as proteins involved in translation) rather than proteins of higher specificity. A second reason is the availability of fast tools such as BLAST to identify COGs homologues in any given proteome. A third reason is the stringency of the method, where pre-existing alignment of COGs can serve as a frame, following some curations. A first synchronization of COGs sequences used in the alignment with current sequences and sequences-to-taxon associations has been performed, to match data reported in the NCBI taxonomy history files. In addition to this curation, 11 taxa have been removed from the reference alignment to suppress sequence redundancy.
(59) The methodology to add a new taxon to the root MSA is to identify the closest taxon (“reference taxon”) in the root MSA using NCBI taxonomy hierarchy information. COGs sequences for the reference taxon (COGs_fasta) are then queried using BLAST against the NCBInr filtered with the list of gis corresponding to the taxid to be added. This list of gis (gi_list) is easily extracted from the gi2taxid sqlite table for the new taxid. The BLASTp.exe utility is run with optimized parameters: COGs_fasta (query), gi_list (gilist), 1 (max_target_seqs), 9 (gapopen), 1 (gapextend), BLOSUM90 (matrix). Once identified, the gi representative of each COG in the new taxon is aligned against the reference COG using the MUSCLE software program. This alignment is characterized by a % coverage and a % identity index, and replaced by gaps if the result is obviously a bad match. For these conserved proteins, an identity percentage below 60% is used as a threshold to identify an adequately sequenced COG representative. Finally, residues and gaps for the reference COG are used as a template to define the portions of the new sequence to be added to the MSA. The supervector of the new taxon can thus be defined and added to the MSA.
(60) When all taxa are added, CLUSTALW is used to generate a phylip tree (.ph), using the neighbour-joining algorithm to construct the tree. Finally, a patristic distance matrix is extracted from the phylip tree, using for example the R program or the dendropy package in Python. Although clustalw is not a phylogeny tool, the distance error compared to PHYML or Phylip's PROTDIST are minor compared to the time taken for bootstrapping in more evolved methods.
(61) 2.4 Deconvolution of the PSMs Signals With μorg.ID.
(62) Correlation Function
(63) A plot of the number of PSMs assigned to each taxon against the phylogenetic distance between the taxon of highest number of PSMs and all others, using the methodology detailed above, demonstrates an excellent correlation between both variables, whatever the taxonomical rank considered.
Y=N×(A×e.sup.(-X/d1)+(1−A)×e.sup.(-X/d2)). Formula 0
(64) In this formula, N is the number of PSMs attributed to the taxon chosen as the reference for distances calculation (X-axis), A is the percentage of the exponential term in the form e.sup.−X/d1, with the complement to 1 attributed to the second exponential term in the form e.sup.−X/d2. The parameters of this function are fitted to minimize a convergence criterion, representative of the differences between data points and function.
(65) An improved model includes a function by parts, adapted to non-sequenced organisms. An additional parameter d is used, indicative of the phylogenetic distance between the best taxon found in databases, and the actual organism in the sample. The revised function is in the form:
0≤X<d:Y=N×(A×e.sup.(-d/d1)+(1−A)×e.sup.(-d/d2)).
d≤X:Y=N×(A×e.sup.(-X/d1)+(1−A)×e.sup.(-X/d2)). Formulae 1
(66) In
(67) Two important observations are drawn from the evaluation of the remaining signal after subtraction of a fitted function to the data points. In
(68) 2.5 Mixture of Organisms
(69) Data shown in
(70) For n taxa of indice k, the fit formula is:
(71)
(72) The fit in
Yc+Yr=Nc(Ac*e.sup.(-Xc/d1c)+(1−Ac)*e.sup.(-Xc/d2c))+Nr(Ar*e.sup.(-Xr/d1r)+(1−Ar)*e.sup.(-Xr/d2r)) where: Nc=number of deconvoluted PSMs corresponding to E. coli Xc=phylogenetic distance of each taxon relatively to taxid: 469008 Ac=fraction of the first exponential term for E. coli d1c=parameter for the first exponential term for E. coli d2c=parameter for the second exponential term for E. coli Nr=number of deconvoluted PSMs corresponding to R. pomeroyi Xr=phylogenetic distance of each taxon relatively to taxid: 246200 Ar=fraction of the first exponential term for R. pomeroyi d1r=parameter for the first exponential term for R. pomeroyi d2r=parameter for the second exponential term for R. pomeroyi
(73) Fitted values were, using Excel solver to minimize the objective function f=Y−Yc−Yr:
(74) Nr=1822, Nc=568, Ar=0.3091, d1r=0.0122, d2r=0.0885, Ac=0.527, d1c=0.0122, d2c=0.075.
(75) Deconvoluted number of PSMs, Nr and Nc, represent the number of spectra attributed to each organism, excluding the fraction of PSM resulting from the presence of other organisms in the mixture.
(76) The deconvolution process to identify taxa in a mixture sample can either be iterative or concurrent. At a given taxonomical level, the current data indicates that a generic value for parameter d1 can be found, and d2 and A might need to be slightly adjusted per Glade (for a given PSM confidence level). They have yet to be adapted to different values for different taxonomical levels.
(77) Function 2, by parts, can be used to deconvolute the signal of a non-sequenced organism.
(78) Data processing for
(79) The global process can be itemized in 8 stages following the prior sample preparation and processing on a LC-MS/MS instrument:
(80) 1. Association of spectrum to peptides using an external inference tool (such as the Mascot software program), and a standard or dedicated database (NCBInr or subset in the embodiment).
(81) 2. Association of peptides to taxa at the leaf level using a taxonomy database (NCBInr or subset in the embodiment).
(82) 3. Selection of a taxonomy level (genus in first loop depending on the complexity of the sample and the taxonomy level searched) and counting of spectra per taxon.
(83) 4. Selection of the most probable organism (taxon), and use of the (Total number of spectra per taxon)/(phylogenetic distance) correlation function to deconvolute the corresponding signal. The fitted parameters include the number of spectra per taxon.
(84) 5. Analysis of residual signal to identify remaining organisms in the sample, and loop to stage 4 until the residual signal is at the noise level.
(85) 6. Global fit with a taxon selection latitude.
(86) 7. Loop to stage 1 once, with a specific subset of the database (e.g. families containing identified genus) 8. Loop to stage 3 until lowest taxonomical level is reached depending on the objectives of the search (species for eucaryota, strain for bacteria for example).
(87) The deconvoluted amount of PSM is very similar to the number of spectral counts (SC) used in comparative proteomics to quantify ratios of proteins in different conditions.
(88) Mixtures of Escherichia coli BL21(DE3) and Ruegeria pomeryoi DSS-3 at three different ratios were used to exemplify relative quantification of mixtures. Table IV shows that ratios based on the deconvoluted number of spectra per taxon are correlated with ratios estimated by optical density measured at 600 nm prior mixing the organisms (an estimation of the number of cells or the protein-equivalent content for each strain could also be performed by other means, for example by Malassez cell counts, cell mass (wet or dry), or bradford quantitation of protein whole cellular extract). A more accurate relative quantification between the organisms can be performed by replacing each PSM by the intensity of the parent ion as detailed in Material and Methods. The XIC based quantification of a taxon can then be computed for example using the sum of a subset comprising the 100 most intense XICs associated to it, as shown in
(89) Any other mass spectrometry-based quantitative methods could also be applied to associate an intensity value to each PSM signal, such as the Total Ion current (TIC) of the MS/MS spectrum.
(90) TABLE-US-00004 TABLE IV Relative quantification of dilutions of E. coli BL21 to a reference organism: R. pomeroyi DSS-3, and comparison of OD ratios to the MS quantification methods using deconvoluted #PSMs and XIC values. % E. coli BL21/R. pomeroyi R. pomeroyi E.coli DSS-3, normalized with Mix R. pomeroyi E. coli DSS-3 BL21 the 1:1 measurement based DSS-3 BL21 Top100 Top100 based based based on OD # PSMs # PSMs XICs XICs on OD on #PSMs on XIC 1:1 781 1836 1.84E+09 2.99E+09 100% 100% 100% 1:0.5 1248 1237 2.21E+09 1.76E+09 50% 42% 49% 1:0.2 1822 568 2.62E+09 8.67E+08 20% 13% 20%
(91) Absolute quantification of pure samples is also within reach, but requires a calibration curve to correlate the taxon signal to a signal representative of a number of cells in the sample.
(92) This calibration depends on (i) the correspondence between number of cells and peptides amounts for different cell types, (ii) the protocol leading to a digested sample from a cell extract, (iii) the characteristics of the separation process before introduction in the mass spectrometer, (iv) the mass spectrometer capacity to process exhaustively all peptides ions at any elution stage (v) the mass spectrometer reproducibility for varying complex samples. A test was performed with a selected organism (E. coli BL21) for which two cell amounts were processed similarly. The two samples, M3 and M12, differed by a ratio of 10 in terms of Optical Density as measured at 600 nm. Results are shown in Table V. While spectral counts give a M12/M3 ratio of 1.4, the sum of the Top10 XICs gives a ratio of 10.3. Because the reference sample M3 corresponded to 2.5 E8 cells, then the M12 sample was estimated to contain 2.575 E9 cells based on the Top10 XICs method.
(93) TABLE-US-00005 TABLE V Quantification of 2 samples of E. coli BL21 of cells contents in a ratio of 1/10, as estimated by OD. Sum of Top10 XIC gives an accurate ratio estimate, thus absolute quantitation if a reference is known. Sum of Number of Sample n.sup.o. Eq. OD # PSMs Top10 XICs cells M3 (reference) 1 2255 5.90E+08 2.5 E8 cells M12 10 3063 6.08E+09 2.575 E9 Ratio M12/M3 10 1.4 10.3 10.3
2.6 Non-Sequenced Organisms
(94) Function 2, which is detailed above, can be used to analyze a sample containing organisms with unsequenced genomes or not present in the search database. Such condition was simulated using data from
(95) Interestingly, parameter d (0.0288) is close to the distance used to exclude taxa (0.0250), and is thus an indicator of the distance between the organism in the sample and the closest representative in the database used for fitting. The sensitivity of this indicator of phylogenetic distance has been assessed on the same sample by removing all Escherichia coli strains up to a distance of 0.0009 from strain E. coli BL21(DE3), corresponding to the first organism of a different species, namely Shigella flexneri 2a str. 2457T (taxid: 198215). This organism is from the same family as Escherichia coli, namely the Enterobacteriaceae family (Enterobacteriales order), and closely-related to Escherichia genus representatives. Setting this organism as the reference organism for distances selection from the distance matrix, the best fit is obtained with a parameter d of 0.0005, whereas d is found at 0 when the reference organism is the correct one for the sample. The method is thus sensitive enough to identify sequenced organisms at the strain level, but also to characterize if the identification of an unsequenced organism is correct even at the species level.
(96) 2.7 Confidence Level
(97) The most direct identification data is the number of spectra specific of a given taxon, at each taxonomical level. The dramatic increase of sequenced organisms leads to a lowering of this information to the noise level, in particular at the strain or even the species level. This is already the case for Escherichia coli strains, which identification at the strain level can no longer be performed relying only on this information as shown in Table III. The number of deconvoluted spectra per taxon is much more informative, and even essential in case of a mixture of organisms.
(98) Both information can however be confronted to estimate an identification confidence factor.
(99) Data available to evaluate the confidence level of an identification are: number of MS/MS in an experiment and total ion chromatogram (TIC) profile. This number is useful to characterize the sample in terms of complexity. number of PSMs at different expectation values for the spectrum to PSM inference (p-value of a match by chance using Mascot, see Table I). The ratio of assigned MS/MS spectra is modulated by sample quality, sample quantity, sample complexity, database suitability, mass spectrometer and reverse phase chromatography parameters and calibration, proteolysis efficiency and sample handling. number of PSMs per taxon, and number of specific PSMs per taxon for each taxonomical level. At a low resolution taxonomical level, the number of specific PSMs and thus the confidence level is high (Table II). For low specific PSMs numbers, a confidence level can be estimated from the search p-value and the number of specific PSMs. a search on a reduced database including all probable taxa can be performed to estimate the number of MS/MS spectra that are not associated with database sequences. This occurs either because taxon(s) are missing, or other reasons related to sample complexity such as: MS/MS spectra including a mixture of peptides, accuracy of MS masses degraded because of missing internal calibration. fit parameters: d in function 2 is directly indicative of database suitability deconvoluted numbers of PSMs and sum of deconvoluted numbers of PSMs at different taxonomical levels can be confronted fit estimator: the fit quality can be used to assign a probability of correct assignment. For example, in a mixture model, the number of components (number of taxa in a mixture at a given taxonomical model) can be evaluated.
(100) An example is given hereafter with data corresponding to
(101) 2.8 DNA or RNA Nucleotide Sequence Data
(102) To apply the proposed method to DNA sequencing data or RNA sequencing data, the inventors chose to process Sequence Read Archive (SRA) data from a randomly selected Escherichia coli Whole Genome Shotgun (WGS) project. Data shown was obtained for a subset of ERR163875.sra file, processed as detailed in Material and Methods. The length of reads was 200 base pairs, which is currently the higher range of reads for Next-generation Illumina HISEQ or Ion Torrent sequencing technologies (Shokralla, 2012), and thus the most informative.
(103) After matching components (here DNA reads) to taxa using blast searches, the inventors examined if the component-taxon relation was 1 to n, as indicated in
(104) A correlation function between DNA read counts per taxon and taxa distances can thus be established, and used to deconvolute species in a mixture of organisms, satisfying all requirements set forward in
(105) TABLE-US-00006 TABLE VI Number of SRA reads associated with top 10 species taxa, for a subset of 14428 reads from a WGS sequencing of a pure Escherichia coli sample. Species name Taxid # Reads # specific Reads Escherichia coli 562 14421 473 Shigella sonnei 624 11604 0 Shigella boydii 621 11276 0 Shigella flexneri 623 11260 0 Shigella dysenteriae 622 10422 0 Escherichia fergusonii 564 6561 0 Salmonella enterica 28901 2369 0 Enterobacter cloacae 550 2180 0 Citrobacter koseri 545 1973 0 Citrobacter rodentium 67825 1774 0
(106) The recent development of RNA-seq allows performing such identification and quantification of organisms present in mixtures taking RNA as starting material for next-generation sequencing of the nucleotide sequences.
(107) 2.9 Sample with a Mixture of 2 Closely-Related Organisms: Escherichia coli and Yersinia pestis, Both from the Enterobacteriaceae Family
(108) The proteins from the mixture of the two organisms were extracted, proteolyzed with trypsin, and the resulting peptides analyzed by tandem mass spectrometry.
(109) The NCBI nr database used for data assignation in all this document is dated Feb. 7, 2014 and contains 35,149,712 different protein sequences. The corresponding NCBI taxonomy files are dated Feb. 7, 2014 and correspond to a total of 1,176,883 taxa at all levels. Among these, 12,235 taxa have more than 500 associated protein sequences.
(110) When ordering the taxa by the number of PSMs (decreasing order) as shown in
x<d:y=N*(A*exp(−d/a)+(1−A)*exp(−d/b))
x>=d:y=N*(A*exp(−x/a)+(1−A)*exp(−x/b)),
the fitted parameters where d and N for each signature in the fit, and the fit stopping condition was: (R.sup.2i+1−R.sup.2i)/R.sup.2i<0.0005.
2.10. Sample with a Mixture of 3 Different Organisms: Sphingomonas wittichii (Alpha-proteobacteria class, Sphingomonadales Order), Escherichia coli (Gamma-Proteobacteria Class), and Ruegeria pomeroyi (Alpha-Proteobacteria Class, Rhodobacterales Order)
(111) The proteins from the mixture of the three organisms were extracted, proteolyzed with trypsin, and the resulting peptides analyzed by tandem mass spectrometry. When ordering the taxa by the number of PSMs (decreasing order) as shown in
BIBLIOGRAPHY
(112) Ciccarelli, F. D., Doerks, T., von Mering, C., Creevey, C. J., Snel, B. and Bork, P. (2006) Toward automatic reconstruction of a highly resolved tree of life. Science, 311(5765):1283-1287. Dworzanski, J. P., Deshpande, S. V., Chen, R., Jabbour, R. E., Snyder, A. P., Wick, C. H. and Li, L. (2006) Mass spectrometry-based proteomics combined with bioinformatic tools for bacterial classification. Journal of Proteome Research, 5(1):76-87. Dworzanski, J. P., Dickinson, D. N., Deshpande, S. V., Snyder, A. P. and Eckenrode, B. A. (2010) Discrimination and Phylogenomic Classification of Bacillus anthracis-cereus-thuringiensis Strains Based on LC-MS/MS Analysis of Whole Cell Protein Digests. Analytical Chemistry, 82(1):145-155. Dworzanski, J. P., Snyder, A. P., Chen, R., Zhang, H. Y., Wishart, D. and Li, L. (2004) Identification of bacteria using tandem mass spectrometry combined with a proteome database and statistical scoring. Analytical Chemistry, 76(8):2355-2366. Jabbour, R. E., Deshpande, S. V., Wade, M. M., Stanford, M. F., Wick, C. H., Zulich, A. W., Skowronski, E. W. and Snyder, A. P. (2010) Double-Blind Characterization of Non-Genome-Sequenced Bacteria by Mass Spectrometry-Based Proteomics. Applied and Environmental Microbiology, 76(11):3637-3644. Shokralla, S., Spall, J. L., Gibson, J. F. and Hajibabaei, M. (2012) Next-generation sequencing technologies for environmental DNA research. Molecular Ecology, 21(8):1794-1805.