Methods Associated With A Database That Stores A Plurality Of Reference Genomes
20180330044 ยท 2018-11-15
Assignee
Inventors
Cpc classification
G16B50/00
PHYSICS
G16B10/00
PHYSICS
International classification
Abstract
Methods are provided of using a database that stores a plurality of reference genomes and phylogenetic information which relates the stored reference genomes to each other in a phylogenetic structure. These methods are useful in analysing the bacteria and/or bacterial lineages present in a sample and to identify a bacterium for use in therapy.
Claims
1. A method of using a database that stores a plurality of reference genomes and phylogenetic information which relates the stored reference genomes to each other in a phylogenetic structure, the method including: using a plurality of sequence reads obtained from a sample to count the number of sequence reads deemed to uniquely map to each of a plurality of lineages and/or reference genomes within the phylogenetic structure; for each of the plurality of lineages and/or reference genomes to which at least one sequence read has been deemed uniquely mapped, normalizing the number of sequence reads counted as being deemed uniquely mapped to the lineage or reference genome using a measure that reflects the uniqueness of the lineage or reference genome so as to obtain an indication of the relative abundance of the lineage or reference genome within the sample.
2. A method according to claim 1, wherein the method includes, for each of the plurality of lineages and/or reference genomes, determining a measure that reflects the uniqueness of the lineage or reference genome or a precursor of such a measure by: for the/each reference genome: identifying one or more genetic sequences deemed to uniquely identify the reference genome; determining a measure that reflects the uniqueness of the reference genome or a precursor of such a measure based on the one or more genetic sequences deemed to uniquely identify the reference genome; for the/each lineage: identifying one or more genetic sequences deemed to uniquely identify the lineage; determining a measure that reflects the uniqueness of the lineage or a precursor of such a measure based on the one or more genetic sequences deemed to uniquely identify the lineage.
3. A method according to claim 2, wherein identifying one or more genetic sequences deemed to uniquely identify each of the plurality of lineages and/or reference genomes includes, for each reference genome in the database: defining a plurality of segments, each segment containing a different genetic sequence from the reference genome; for the genetic sequence contained in each segment: comparing the genetic sequence contained in the segment with the majority of the genetic content of all other reference genomes in the database to establish whether the segment maps to any of the other reference genomes; if the genetic sequence contained in the segment maps to no other reference genome in the database, identifying the genetic sequence contained in the segment as being deemed to uniquely identify the reference genome; if the genetic sequence contained in the segment maps to one or more other reference genomes in the database, and if it is determined using the phylogenetic information that the genetic sequence contained in the segment maps to at least a majority of the reference genomes in a lineage, identifying the genetic sequence contained in the segment as being deemed to uniquely identify the lineage.
4. A method according to claim 2, wherein identifying one or more genetic sequences deemed to uniquely identify each of the plurality of lineages and/or reference genomes includes, for each reference genome in the database: defining a plurality of segments, each segment containing a different genetic sequence from the reference genome; for the genetic sequence contained in each segment: comparing the genetic sequence contained in the segment with the entirety of the genetic content of all other reference genomes in the database to establish whether the segment maps to any of the other reference genomes; if the genetic sequence contained in the segment maps to no other reference genome in the database, identifying the genetic sequence contained in the segment as being deemed to uniquely identify the reference genome; if the genetic sequence contained in the segment maps to one or more other reference genomes in the database, and if it is determined using the phylogenetic information that the genetic sequence contained in the segment maps to all of the reference genomes in a lineage and to no other reference genomes in the database, identifying the genetic sequence contained in the segment as being deemed to uniquely identify the lineage.
5. A method according to any one of claims 2 to 4, wherein identifying one or more genetic sequences deemed to uniquely identify each of the plurality of lineages and/or reference genomes is performed before the sequence reads are obtained from the sample.
6. A method according to any one of claims 3 to 5, wherein the plurality of segments defined for each reference genome have a predetermined length, and include each possible segment of that length that could be defined for the reference genome.
7. A method according to any previous claim, wherein using a plurality of sequence reads obtained from a sample to count the number of sequence reads deemed to uniquely map to each of a plurality of lineages and/or reference genomes within the phylogenetic structure includes: for the/each reference genome: comparing the plurality of sequence reads with the reference genome to establish whether any sequence reads map to the reference genome and to no other reference genome stored in the database; if a sequence read maps to the reference genome and to no other reference genome stored in the database, counting the sequence read as being deemed to uniquely map to the lineage; for the/each lineage: comparing the plurality of sequence reads with one or more genetic sequences deemed to uniquely identify the lineage to establish whether any sequence reads map to any of the identified one or more genetic sequences; if a sequence read maps to any of the one or more genetic sequences deemed to uniquely identify the lineage, counting the sequence read as being deemed to uniquely map to the lineage.
8. A method according to any previous claim, wherein the database includes an entry for each reference genome and each lineage within the phylogenetic structure.
9. A method according to claim 8, wherein, the entry for each lineage/reference genome includes a parent field for storing a pointer to a parent of the lineage/reference genome within the phylogenetic structure.
10. A method according to claim 8 or 9, wherein the entry for each lineage/reference genome includes a uniqueness field for storing a measure that reflects the uniqueness of the lineage or reference genome or a precursor of such a measure.
11. A method according to claim 10, wherein the uniqueness field is recalculated each time a new reference genome is stored in the database.
12. A method according to any previous claim, wherein the method includes obtaining the plurality of sequence reads from the sample
13. A method according to claim 12, wherein the sequence reads are obtained by a shotgun sequencing process in which the DNA contained in the sample is broken up randomly into small segments which are then sequenced to obtain the plurality of sequence reads, wherein the plurality of sequence reads from the sample are obtained from across the complete DNA of organisms within the sample.
14. A method according to any previous claim, wherein the sequence reads each have a length of at least 80 or more base pairs.
15. An apparatus including a computer configured to perform a method according to any previous claim.
16. A computer-readable medium having computer-executable instructions configured to cause a computer to perform a method according to any of claims 1 to 14.
17. A method of analysing the bacteria and/or bacterial lineages present in a sample, wherein the method includes: obtaining a plurality of sequence reads from (i) a first portion of a sample and using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of one or more lineages and/or reference genomes within the first portion of the sample; obtaining a plurality of sequence reads from (ii) bacteria cultured from a second portion of the sample using a bacterial culturing method and using a method according to any one of claims 1 to 14, to obtain indications of the relative abundances of one or more lineages and/or reference genomes within the cultured portion of the sample; and comparing the indications of the relative abundances of the lineages and/or reference genomes within the first portion of the sample with the indications of the relative abundance of the lineages and/or reference genomes within the cultured sample.
18. A method according to claim 17, wherein the method is a method of determining the bacteria and/or bacterial lineages present in a sample which have or have not been cultured using a bacterial culturing method, wherein the method includes: determining the bacteria and/or bacterial lineages present in the first portion of the sample which were and/or were not cultured using the bacterial culturing method by comparing the indications of the relative abundances of the lineages and/or reference genomes within the first portion of sample with the indications of the relative abundance of the lineages and/or reference genomes within the cultured sample.
19. A method according to claim 17 or 18, further comprising: obtaining a plurality of sequence reads from (iii) bacteria cultured from a second sample, obtained from the same source as the sample in (i) and (ii), using an alternate bacterial culturing method and using a method according to any one of claims 1 to 14, to obtain indications of the relative abundances one or more lineages and/or reference genomes within the alternately cultured sample; and comparing the indications of the relative abundances of the lineages and/or reference genomes within the alternately cultured sample with the indications of the relative abundance of the lineages and/or reference genomes within the first portion of the sample and, optionally, the indications of the relative abundance of the lineages and/or reference genomes within the cultured sample.
20. A method according to claim 19, wherein the method is a method of determining the bacteria and/or bacterial lineages present in a sample which have been cultured using an alternate bacterial culturing method, wherein the method includes: (a) determining the bacteria and/or bacterial lineages present in the sample which were and/or were not cultured using the alternate bacterial culturing method by comparing the indications of the relative abundances of the lineages and/or reference genomes within the alternately cultured sample with the indications of the relative abundance of the lineages and/or reference genomes within the first portion of the sample; or (b) determining the bacteria and/or bacterial lineages present in the sample which were cultured using the alternate bacterial culturing method and which were not cultured with the bacterial culturing method by comparing the indications of the relative abundances of the lineages and/or reference genomes within the alternately cultured sample with the indications of the relative abundance of the lineages and/or reference genomes within the first portion of the sample and the relative abundance of the lineages and/or reference genomes within cultured sample.
21. A method according to claim 17, wherein the method is a method of preparing a culture collection of bacteria of interest present in a sample, the method including identifying a bacterial culturing method for culturing the bacteria of interest, wherein the method includes: determining the bacteria of interest present in the sample which were cultured using the bacterial culturing method by comparing the indications of the relative abundances of the lineages and/or reference genomes within the first portion of sample with the indications of the relative abundance of the lineages and/or reference genomes within the cultured sample; and employing the bacterial culturing method to prepare a collection of cultures of said bacteria of interest from the sample.
22. A method according to claim 17, wherein the method is a method of obtaining the genomic sequence of one or more bacteria of interest present in a sample, wherein the method includes: determining the bacteria of interest present in the sample which were cultured using the bacterial culturing method by comparing the indications of the relative abundances of the lineages and/or reference genomes within the first portion of sample with the indications of the relative abundance of the lineages and/or reference genomes within the cultured sample; employing the bacterial culturing method to prepare cultures of one or more of the bacteria of interest from the sample, and determining the genomic sequence(s) of said bacteria.
23. A method according to claim 22, further comprising adding the genomic sequence(s) of said one or more of the bacteria of interest to a database that stores reference genomes.
24. A method of identifying a bacterium for bacteriotherapy for a dysbiosis, the method comprising: obtaining a plurality of sequence reads from a sample obtained from a patient with the dysbiosis; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the sample with the relative abundance of the lineages and/or reference genomes in a control; selecting a bacterium with a genome, or belonging to a lineage, which is absent from the sample obtained from the patient but present in the control, or which is present at a lower relative abundance in the sample obtained from the patient compared with the control, for bacteriotherapy for the dysbiosis.
25. A method of identifying a bacterium for bacteriotherapy for a dysbiosis, the method comprising: obtaining a plurality of sequence reads from (i) a first sample obtained from a patient with the dysbiosis; obtaining a plurality of sequence reads from (ii) a second sample obtained from the same patient after the patient has received a faecal transplant; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the first sample with the relative abundance of the lineages and/or reference genomes in the second sample; selecting a bacterium with a genome, or belonging to lineage, which is absent from the first sample but present in the second sample, or which is present at a lower relative abundance in the first sample compared with the second sample, for bacteriotherapy for the dysbiosis.
26. A method of identifying a bacterium for therapy of a disease characterised by the presence of a pathogenic bacterium, the method comprising: obtaining a plurality of sequence reads from (i) a first sample obtained from a first asymptomatic carrier of the pathogenic bacterium and obtaining a plurality of sequence reads from (ii) a second sample obtained from a second asymptomatic carrier of the pathogenic bacterium; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the first sample and the second sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the first sample with the relative abundance of the lineages and/or reference genomes in the second sample; selecting a bacterium with a genome, or belonging to a lineage, which is common to the first and second sample for bacteriotherapy for the disease.
27. A method of identifying a bacterium for therapy of a disease characterised by the presence of a pathogenic bacterium, the method comprising: obtaining a plurality of sequence reads from a first sample obtained from an asymptomatic carrier of the pathogenic bacterium and obtaining a plurality of sequence reads from a second sample obtained from healthy individual; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the first sample and the second sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the first sample with the relative abundance of the lineages and/or reference genomes in the second sample; selecting a bacterium with a genome, or belonging to a lineage, which is present in the first sample but absent in the second sample for bacteriotherapy for the disease.
28. A bacterium identified using a method according to any one of claims 24 to 27, wherein the bacterium is for use in a method of treating a dysbiosis or disease in an individual.
29. A bacterium for use in a method of treating a dysbiosis or disease in a patient, the method comprising identifying the bacterium using a method according to any one of claims 24 to 27, and administering the identified bacterium to the patient.
30. A method of treating a dysbiosis or disease in an individual, the method comprising identifying the bacterium using a method according to any one of claims 24 to 27, and administering a therapeutically effective amount of the identified bacterium to the patient.
31. A method of diagnosing a disease in a patient, the method comprising: obtaining a plurality of sequence reads from a sample obtained from the patient; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the sample with the relative abundance of the lineages and/or reference genomes in a control; identifying a bacterium with a genome, or belonging to lineage, which is present in the sample obtained from the patient but absent from the control, or which is present at a higher abundance in the sample obtained from the patient compared with the control; wherein the presence, or higher abundance, of the bacterium in the sample is indicative of the disease.
32. A method according to claim 31, wherein the method further comprises: selecting the individual for treatment for the disease; or subjecting an individual for treatment for the disease.
33. A diagnostic system for use in a method according to claim 31 or 32, the system comprising a tool or tools obtaining a plurality of sequence reads from a sample obtained from a patient; and a computer programmed to compute indications of the relative abundances of lineages and/or reference genomes within the sample using a method according to any one of claims 1 to 14 from the sequencing data.
34. A method of treating a disease in a patient, the method comprising: (i) requesting a test providing the results of an analysis, the test including: obtaining a plurality of sequence reads from a sample obtained from the patient; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the sample with the relative abundance of the lineages and/or reference genomes in a control; identifying a bacterium with a genome, or belonging to lineage, which is present in the sample obtained from the patient but absent from the control, or which is present at a higher abundance in the sample obtained from the patient compared with the control; wherein the presence, or higher abundance, of the bacterium in the sample is indicative of the disease; (ii) treating the individual for the disease.
35. A method of identifying the bacterial causative agent of a disease in a patient, the method comprising: obtaining a plurality of sequence reads from a sample obtained from a patient; using a method according to any one of claims 1 to 14 to obtain indications of the relative abundances of lineages and/or reference genomes within the sample; comparing the indications of the relative abundances of the lineages and/or reference genomes within the sample with the relative abundance of the lineages and/or reference genomes in a control; identifying a bacterium with a genome, or belonging to lineage, which is present in the sample obtained from the patient but absent from the control, or which is present at a higher abundance in the sample obtained from the patient compared with the control; wherein said bacterium is the causative agent of the disease.
36. A method analysing the bacteria and/or bacterial lineages present in a sample, wherein the method includes: performing whole genome shotgun sequencing of (i) DNA extracted from a first portion of the sample and (ii) DNA extracted from bacteria cultured from a second portion of the sample using a bacterial culturing method; identifying one or more reference genomes and/or lineages in a database to which at least one of the plurality of sequence reads obtained in (i) is deemed to uniquely map, wherein the database stores a plurality of reference genomes and phylogenetic information which relates the stored reference genomes to each other in a phylogenetic structure; identifying one or more reference genomes and/or lineages in the database to which at least one of the plurality of sequence reads obtained in (ii) is deemed to uniquely map; comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (ii) were deemed to uniquely map.
37. A method according to claim 36 comprising: identifying all reference genomes and/or lineages in a database to which at least one of the plurality of sequence reads obtained in (i) is deemed to uniquely map; and identifying all reference genomes and/or lineages in the database to which at least one of the plurality of sequence reads obtained in (ii) is deemed to uniquely map.
38. A method according to claim 36 or 37, wherein the method is a method of determining the bacteria and/or bacterial lineages present in a sample which have or have not been cultured using a bacterial culturing method, wherein the method includes: determining the bacteria and/or bacterial lineages present in the first portion of the sample which were and/or were not cultured using the bacterial culturing method by comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (ii) were deemed to uniquely map.
39. A method according to claim 36 or 37, further comprising: performing whole genome shotgun sequencing of (iii) DNA extracted from bacteria cultured from a second sample, obtained from the same source as the sample in (i) and (ii), using an alternate bacterial culturing method; identifying one or more reference genomes and/or lineages in the database to which at least one of the plurality of sequence reads obtained in (iii) is deemed to uniquely map; comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (iii) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map and, optionally, the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (ii) were deemed to uniquely map.
40. A method according to claim 39, wherein the method is a method of determining the bacteria and/or bacterial lineages present in a sample which have been cultured using an alternate bacterial culturing method, wherein the method includes: (a) determining the bacteria and/or bacterial lineages present in the sample which were and/or were not cultured using the alternate bacterial culturing method by comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (iii) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map; or (b) determining the bacteria and/or bacterial lineages present in the sample which were cultured using the alternate bacterial culturing method and which were not cultured with the bacterial culturing method by comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (iii) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (ii) were deemed to uniquely map and the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map.
41. A method according to claim 36 or 37, wherein the method is a method of preparing a culture collection of bacteria of interest present in a sample, the method including identifying a bacterial culturing method for culturing the bacteria of interest, wherein the method includes: determining the bacteria of interest present in the sample which were cultured using the bacterial culturing method by comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (ii) were deemed to uniquely map; and employing the bacterial culturing method to prepare a collection of cultures of said bacteria of interest from the sample.
42. A method according to claim 36 or 37, wherein the method is a method of obtaining the genomic sequence of one or more bacteria of interest present in a sample, wherein the method includes: determining the bacteria of interest present in the sample which were cultured using the bacterial culturing method by comparing the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (i) were deemed to uniquely map with the one or more reference genomes and/or lineages to which at least one of the plurality of sequence reads obtained in (ii) were deemed to uniquely map; employing the bacterial culturing method to prepare cultures of one or more of the bacteria of interest from the sample, and determining the genomic sequence(s) of said bacteria.
43. A method according to claim 42, further comprising adding the genomic sequence(s) of said one or more of the bacteria of interest to a database that stores reference genomes.
44. A method substantially as described herein with reference to any embodiment shown in the accompanying drawings.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0283] Examples of these proposals are discussed below, with reference to the accompanying drawings in which:
[0284]
[0285]
[0286]
[0287]
[0288]
[0289]
[0290]
[0291]
[0292]
[0293]
DETAILED DESCRIPTION
[0294] In general, the following discussion describes examples of our proposals that provide a method suitable for quantifying relative species and strain abundance from high-throughput metagenomic sequencing samples. This is achieved through specific normalization methods in the context of high quality reference genomes.
[0295] The example method shown in
[0296] The database 100 stores a plurality of reference genomes and phylogenetic information which relates the stored reference genomes to each other in a phylogenetic structure.
[0297] The interrogation engine 110 uses a plurality of sequence reads 120 obtained from a sample to count the number of sequence reads deemed to uniquely map to each of a plurality of lineages and/or reference genomes within the phylogenetic structure.
[0298] As described in more detail below, the interrogation engine 110, for each of the plurality of lineages and/or reference genomes to which at least one sequence read has been deemed uniquely mapped, normalizes the number of sequence reads counted as being deemed uniquely mapped to the lineage or reference genome using a measure that reflects the uniqueness of the lineage or reference genome so as to obtain a value indicative of the relative abundance of the lineage or reference genome within the sample, thereby obtaining indications of relative abundances 130.
[0299] In the practical example discussed below, the database 100 is the HPMC database described in more detail in Annex A.
[0300] However, to provide a reader with a better understanding of the methods described herein, illustrate a method of using the database 100 in accordance with the invention, a simplified example of a database 200 is illustrated in
[0301] As shown in
[0302] As shown in
[0303] As shown in
[0304] As shown in
[0305] The content of the parent fields can be viewed as phylogenetic information which relates the stored reference genomes to each other in a phylogenetic structure, since an entire phylogenetic tree can be constructed from the information contained in the parent fields. Of course, phylogenetic information could be stored in numerous other ways, as would be appreciated by a skilled person.
[0306] Computational techniques for inferring such a phylogenetic structure from stored reference genomes are well known. For the HPMC database described below, the present inventors used the 16S/18S sequence to define the broad tree with closely related species resolved through whole genome alignment (preferably an on-going exercises within the database) e.g. using Mauve (PMID: 15231754), Muscle (PMID: 15034147), or MAFFT (PMID: 23329690).
[0307] As shown in
[0308] The internal uniqueness value for each entry may be calculated by identifying one or more genetic sequences deemed to uniquely identify the corresponding lineage (if the entry is a lineage) or by identifying one or more genetic sequences deemed to uniquely identify the corresponding reference genome (if the entry is a reference genome), and then dividing the combined length of these sequences by the average length of the reference genomes in the corresponding lineage (if the entry is a lineage) or by the length of the corresponding reference genome (if the entry is a reference genome). Techniques for identifying one or more genetic sequences deemed to uniquely identify a lineage/reference genome have already been described in detail above.
[0309] Preferably, identifying one or more genetic sequences deemed to uniquely identify each lineage and reference genome in the database includes, for each reference genome in the database: [0310] defining a plurality of segments, each segment containing a different genetic sequence from the reference genome; [0311] for the genetic sequence contained in each segment: [0312] comparing the genetic sequence contained in the segment with the entirety of the genetic content of all other reference genomes in the database to establish whether the segment maps to any of the other reference genomes; [0313] if the genetic sequence contained in the segment maps to no other reference genome in the database, identifying the genetic sequence contained in the segment as being deemed to uniquely identify the reference genome; [0314] if the genetic sequence contained in the segment maps to one or more other reference genomes in the database, and if it is determined using the phylogenetic information that the genetic sequence contained in the segment maps to all of the reference genomes in a lineage and to no other reference genomes in the database, identifying the genetic sequence contained in the segment as being deemed to uniquely identify the lineage
[0315] In the present examples, the plurality of segments were obtained using a sliding window technique of length 100 base pairs and comparing the genetic sequence contained in a segment with all other reference genomes was performed with bowtie2 read aligner (see e.g. PMID: 22388286).
[0316] Referring back now to
[0317] Next, for each of the plurality of lineages and/or reference genomes to which at least one sequence read has been deemed uniquely mapped, the interrogation engine 110 normalizes (by dividing) the number of sequence reads counted as being deemed uniquely mapped to the lineage or reference genome using a measure that reflects the uniqueness of the lineage or reference genome so as to obtain a value indicative of the relative abundance of the lineage or reference genome within the sample, thereby obtaining indications of relative abundances 130.
[0318] As discussed in more detail below, where all of the reference genomes stored in the database are similar or equal in length (as is assumed to be the case for the database 200 of
[0319] However, as discussed in more detail below, where all of the reference genomes stored in the database are unequal in length, the internal uniqueness value is preferably adjusted (e.g. on the fly by the interrogation engine 110) based on the average length of the reference genomes in the corresponding lineage (if the entry is a lineage) or based on the length of the corresponding reference genome (if the entry is a reference genome) in order to provide a measure that reflects the uniqueness of the corresponding lineage or reference genome. In this case, the internal uniqueness value stored in the database can be viewed as a precursor to a measure that reflects the uniqueness of the corresponding lineage or reference genome.
[0320] Storing an internal uniqueness value in the uniqueness field 240 of the database can be useful in analyses which are not the focus of this disclosure, since this value allows a direct comparison of the percentage uniqueness between reference genomes and lineages of different lengths. Nonetheless, in other embodiments (not exemplified herein), the uniqueness field 240 of the entry for each lineage/reference genome could instead store an global uniqueness value that is proportional to the combined length of one or more genetic sequences deemed to uniquely identify the corresponding lineage or reference genome. In this case, the global uniqueness value could be used as the measure that reflects the uniqueness of the corresponding lineage or reference genome regardless of whether all of the reference genomes stored in the database are equal/unequal in length, thereby avoiding any need to adjust the internal uniqueness value on the fly where reference genomes stored in the database are unequal in length.
[0321] Methods described herein may be viewed as extending on the lowest common ancestor approach described in the Background section, above. Given thorough genome coverage provided by the HPMC database described in Annex A (which prior to the HPMC database most people would not have), a problem with applying existing approaches to uniquely classify the content of a given sample to lineages or reference genomes using the HPMC database is that very few reads could uniquely be mapped to closely related reference genomes. Adding more reference genomes to the HPMC database is helpful to identify/reduce/avoid inaccurate classification, but further reduces the number of reads that can be uniquely classified to reference genomes, especially if reference genomes share a large proportion of their genetic content (consider the extreme case of a single nucleotide polymorphism, SNP, between two reference genomes: only sequence reads from a sample that cover that SNP could be used to distinguish the two reference genomes in the sample).
[0322] To correct for this problem, methods described herein preferably use a measure that reflects the uniqueness of each lineage and/or reference genome, thereby taking into account the uniqueness of each lineage and/or reference genome, so as to obtain indications of the relative abundances of lineages and/or reference genomes within a sample.
[0323] Indications of relative abundances determined according to a method as described herein may be utilised in a number of different downstream applications. An example workflow in which the indications of relative abundances determined according to a method as described herein may be used is shown in Annex B.
Theoretical Example Using the Database of FIG. 2(a)
[0324] The following theoretical example, which is provided to provide a reader with a better understanding of the methods described herein, uses the simplified database 200 of
[0325]
[0326] In the example of
[0330] From this, and the phylogenetic information shown in
[0331] For simplicity, GENOME A, GENOME B and GENOME C are assumed to have the same length.
[0332] Starting with Sample A that has equal representation of each genome. Random sequence reads from Sample A should result in an equal number of sample reads being uniquely mapped to each genome.
[0333] However, due to the differing uniquenesses of the three genomes, sequence reads from Sample A will not be uniquely mapped to the three genomes in equal numbers.
[0334] For example, when you classify 1000 sequence reads from Sample A, one would expect: [0335] ?500 to map to each of GENOME A, GENOME B or GENOME C and therefore be uniquely mapped to LINEAGE ABC (since these sequence reads can't distinguish between the 50% of genome shared between GENOME A, B and C) [0336] ?166 to map uniquely to GENOME A [0337] ?268 to map to each of GENOME B and GENOME C and therefore be uniquely mapped to LINEAGE BC (since these sequence reads can't distinguish between the 90% of genome shared between GENOME B and C) [0338] ?33 to map uniquely to GENOME B [0339] ?33 to map uniquely to GENOME C.
[0340] By counting the number of sequence reads deemed to uniquely map to each genome, the total sequence reads for each genome would be reported as: [0341] GENOME A: ?166 (71.5% uniqueness at genome level, 33.2% of total assigned)?5 [0342] GENOME B: ?33 (14.2% uniqueness at genome level, 6.6% of total assigned)?1 [0343] GENOME C: ?33 (14.2% uniqueness at genome level, 6.6% of total assigned)?1
[0344] However, normalizing the counted number using internal uniqueness values provides a more accurate indication of the real relative abundances present in Sample A and allows direct comparison between all genomes and lineages in the database 200: [0345] GENOME A: (166)/(0.5), relative abundance ?1 [0346] GENOME B: (33)/(0.1), relative abundance ?1 [0347] GENOME C: (33)/(0.1), relative abundance ?1
[0348]
[0349] In the above calculations, internal uniqueness (which represents the proportion of the individual lineage or reference genome that is unique, relative to the genetic content of the individual lineage or reference genome) is used as a measure that reflects the uniqueness of the lineage or reference genome.
[0350] However, if the genomes were not equal in length, then the internal uniqueness value is preferably adjusted (e.g. on the fly) based on the length of the corresponding reference genome to provide a measure that reflects the uniqueness of the lineage or reference genome, which would adjust the above calculation as follows: [0351] GENOME A: (no. reads)/(0.5?I.sub.A), relative abundance ?1 [0352] GENOME B: (no. reads)/(0.1?I.sub.B), relative abundance ?1 [0353] GENOME C: (no. reads)/(0.1?I.sub.C), relative abundance ?1
[0354] Where I.sub.A is the length of GENOME A, I.sub.B is the length of GENOME B, I.sub.C is the length of GENOME C.
[0355] Obviously, if the database 200 were to incorporate many more reference genomes and many more sequence reads, the internal uniqueness of the reference genomes might drop. However, a fundamental benefit of the normalization approach is it allow one to adjust the read counts so that indications of relative abundances can be obtained.
[0356] Importantly, the method is not limited to obtaining relative abundances of reference genomes in a sample.
[0357] For example if for Sample A one wished to compare the relative abundance of GENOME A to LINEAGE BC, once could perform the following calculations: [0358] GENOME A: (166)/(0.5), relative abundance ?1 [0359] LINEAGE BC: (268)/(0.4), relative abundance ?2
[0360] Thus, the above-exemplified method provides the ability to compare relative abundances of any genome or lineage combination through normalizing the counted numbers of sequence reads uniquely mapped to those genomes and/or lineages.
[0361] The method also works regardless of the starting composition of the sample.
[0362] For example, considering Sample B that has unequal representation of each genome in a ratio of 1:2:2 (GENOME A:GENOME B:GENOME C) the total sequence reads for each genome would be reported as: [0363] GENOME A: ?100 (55.6% uniqueness at genome level, 20.0% of total assigned)?5 [0364] GENOME B: ?40 (22.2% uniqueness at genome level, 8.0% of total assigned)?2 [0365] GENOME C: ?40 (22.2% uniqueness at genome level, 8.0% of total assigned)?2
[0366] Again, normalizing the counted number using internal uniqueness values provides a more accurate indication of the real relative abundances present in Sample B: [0367] GENOME A: (100)/(0.5), relative abundance ?1 [0368] GENOME B: (40)/(0.1), relative abundance ?2 [0369] GENOME C: (40)/(0.1), relative abundance ?2
[0370] Similarly comparing GENOME A to LINEAGE BC: [0371] GENOME A: (100)/(0.5), relative abundance ?1 [0372] LINEAGE BC: (320)/(0.4), relative abundance ?2
[0373] The accuracy of the genome/lineage identification and quantification is fundamentally dependent on the quality of available reference genomes in the database. As described with reference to a practical example below, the HPMC database described in Annex A, which was populated with reference genomes using techniques described in this application, can be used to provide useful results in the case of gut flora. Without access to a database storing a comprehensive collection of reference genomes relevant to a sample under study, results may be less useful.
[0374] Assuming the database provides a comprehensive collection of reference genomes, the resolution of classification may be limited by sequencing depth. Accordingly, the number of sequence of sequence reads to be obtained may be chosen to be at least m/u, where m is a minimum number of reads deemed appropriate for confident detection of a lineage or reference genome of interest, and where u is the internal uniqueness, which represents the proportion of the individual lineage or reference genome that is unique (relative to the genetic content of the individual lineage or reference genome). For a typical experiment, m is preferably 100 or more, more preferably 1000 or more.
[0375] A skilled person would appreciate that whilst internal uniqueness has been used to normalize counts in the above example, the specific form of the measure used to normalize counted numbers of sequence reads is not important, so long as it reflects the uniqueness of the lineages and/or reference genomes being studied.
Practical Example Using the HPMC Database Described in Annex AControlled Data
[0376] To demonstrate the practical effectiveness of the methods described herein, it is possible to considered 5 species Aspergillus fumigatus, Bifidobacterium breve 689b, Bifidobacterium breve S27, Clostridium difficile 630 and Staphylococcus phage K. This selection simultaneously demonstrates the method is effective on eukaryotic components of the microbiota (Aspergillus) with large genomes (29.3 Mb) and bacteriophage (Staphylococcus phage K, 0.01 Mb genome). It also demonstrates the ability to differentiate the two strains of B. breve (genome size ?2.3 Mb) and a distantly related bacteria C. difficile.
[0377] To demonstrate the effectiveness of this method it is necessary to utilize real sequencing reads to capture variability observed in real sequence reads. However, in this case it is not possible to know the true metagenomic content of a metagenomic sample. To overcome this limitation sequencing reads obtained from direct genome sequencing are sampled at a prescribed percentage to generate pseudo-metagenomic sequencing reads at known proportions.
[0378] The measure used to normalize counts is essential to the method, but the specific form of the measure and the detail with which it is calculated is not important for the method's success, so long as it reflects the uniqueness of the lineages and/or reference genomes being studied.
[0379] For this example, the uniqueness measure used to normalize counts was calculated by using a 100 bp sliding window approach. The genome and lineage uniqueness used to normalize counts was reported as the percentage of 100 bp regions that would uniquely identify the genome or lineage against all other genomes within the database. The comparison was performed using the bowtie2 algorithm with standard parameters. Read abundance levels were then weighted by this measure as described above to determine the relative species abundance from the relative read abundance.
[0380] Sample containing equal proportions (
[0382] Sample containing mixed proportions (
[0384] Sequence reads were randomly selected from the complete genome sequences of each species and assembled into a pseudo-metagenomic sample with known read proportions. Read abundance levels are then weighted by this uniqueness factor as described above to determine the relative species abundance from the relative read abundance.
[0385] Applying the uniqueness normalization to the sample containing equal proportions: [0386] Aspergillus: ?1 [0387] B. breve 689b: ?1 [0388] B. breve S27: ?1 [0389] C. difficile 630: ?1 [0390] Saphylococcus phage K: ?1
[0391] Applying the uniqueness normalization to the sample containing mixed proportions: [0392] Aspergillus: ?2 [0393] B. breve 689b: ?3 [0394] B. breve S27: ?2 [0395] C. difficile 630: ?3 [0396] Saphylococcus phage K: ?10
[0397] It is also possible to calculate the relative abundances of any particular lineage using this method. In this example there are two strains of B. breve represented. Considering these two strains as a single B. breve lineage, uniqueness normalization for the sample containing equal proportions provides results as follows: [0398] Aspergillus: ?1 [0399] B. breve Lineage: ?2 [0400] C. difficile 630: ?1 [0401] Saphylococcus phage K: ?1
[0402] Using uniqueness normalization for the sample containing equal proportions provides results as follows: [0403] Aspergillus: ?2 [0404] B. breve Lineage: ?5 [0405] C. difficile 630: ?3 [0406] Saphylococcus phage K: ?10
[0407] Note that calculating relative abundance for a lineage involves counting the number of sequence reads deemed to uniquely map to the lineage and normalizing that count using a measure that reflects the uniqueness of the lineage, rather than just adding the relative abundances determined for individual members of the lineage (though the result should come out as similaras aboveassuming that there is good coverage of the lineage in the database).
Practical Example Using the HPMC Database Described in Annex AReal data
[0408] To demonstrate the practical benefits of this approach it is possible to consider many examples where identification of specific species or strains can provide important insights to biology or bacteriotherapy candidate design as it provide exact species or strains as opposed to genera or family level approximations.
Example 1: Bacteriotherapy Candidate Identification
[0409] One specific example is the identification of C. difficile bacteriotherapy candidates. When applying this analysis approach to 435 public metagenomic samples where C. difficile is detected in individuals that report normal health it is possible to also identify commonly co-occurring species that are likely to play a role in maintaining health and preventing uncontrolled C. difficile expansion (and thus disease). This analysis identifies 30 species that commonly associate with asymptomatic C. difficile carriers (p<0.01). When compared to the publicly available RePOOPULATE study (PMC3869191) 24 of the 25 species identified were represented in this list (Eubacterium desmolans was absent).
[0410]
Example 2: Biomarker Identification
[0411] Accurate, species and strain level pathogen and commensal identification will provide an important tool for metagenomic based diagnostics and biomarker identification. The proposed method could be utilized to identify the specific strains of a particular pathogen, such as identification of epidemic (027 ribotype) in a C. difficile infected individual. This approach has many applications in clinical setting where the rapid, accurate, pathogen identification is of critical importance. Such an approach could also be critical in the identification of biomarkers suitable for identification or stratification of those at risk to microbiota mediated disease.
[0412] Described below are examples of methods of the invention for identifying bacteria and/or bacterial lineages present in a sample, such as bacteria and/or bacterial lineages present in a sample which have/have not been cultured using a set of bacterial culture conditions, adjusting culture conditions as necessary to culture bacteria not cultured using an original set of bacterial culture conditions, obtaining whole genomic sequences and/or cultures of bacteria cultured using a suitable set of bacterial culture conditions.
[0413] A schematic diagram of a work-flow encompassing the above methods is shown in
Example 3Analysis of the Proportion of Bacteria Present in a Sample which Has been Cultured
[0414] Faecal samples from 6 healthy humans were collected and the resident bacterial communities defined using a combined metagenomic sequencing and bacterial culturing approach using the complex, broad range culture medium, YCFA (Duncan et al., 2002). Applying shotgun metagenomic sequencing we profiled and compared the bacterial species present in the original faecal samples to those that grew on YCFA agar plates (by scraping the colonies off the plate for DNA isolation and sequencing). Importantly, we observed a strong correlation between the two (R.sup.2=0.85) (
[0415] The human intestinal microbiota is dominated by strict anaerobic bacteria that are extremely sensitive to ambient oxygen, so it is not known how these bacteria survive environmental exposure to transmit between individuals. Certain pathogenic Firmicutes, such as the diarrheal pathogen Clostridium difficile, produce metabolically dormant and highly resistant spores during colonization that facilitate both persistence within the host and environmental survival once shed (Francis et al., 2013; Janoir et al., 2013; Lawley et al., 2009). C. difficile spores have evolved mechanisms to resume metabolism and vegetative growth after intestinal colonisation by germinating in response to digestive bile acids (Francis et al., 2013). Relatively few intestinal spore-forming bacteria have been cultured to date and their genomes, phylogenies and phenotypes remain poorly characterised (Rajilic-Stojanovic et al., 2014). Recently, metagenomic studies have suggested that other unexpected members of the intestinal microbiota possess potential sporulation genes, even though these bacteria have never been grown in a laboratory and are not known to produce spores (Galperin et al., 2012; Abecasis et al., 2013; Meehan et al., 2014).
[0416] We hypothesized that sporulation is an unappreciated basic phenotype of the human intestinal microbiota that may have a profound impact on microbiota persistence and spread between humans. Spores from C. difficile are resistant to ethanol and this phenotype can be used to select for spores from a mixed population of spores and sensitive vegetative cells (Riley et al., 1987). Faecal samples were treated with ethanol and analysed using our combined culture and metagenomic approach. Principle component analysis demonstrated that ethanol treatment profoundly altered the culturable bacterial composition compared to the original profile and efficiently enriched for ethanol resistant bacteria, facilitating their isolation (
[0417] In total, we isolated bacteria from 97% of the most abundant genera and 90% of the most abundant species detected in our cohort by metagenomic sequencing. Even bacterial genera that were present at low relative abundance (<0.1%) in the faecal samples were cultured. Overall, we cultured and archived 137 distinct bacterial species which included 45 novel species, and isolates representing 20 novel genera and 2 novel families. Our collection contains 90 species from the Human Microbiome Project's most wanted list of previously uncultured and unsequenced microbes (Fodor et al., 2012). Thus, our broad-range culture approach led to massive bacterial discovery and challenges the notion that the majority of the intestinal microbiota is unculturable.
[0418] By obtaining and then storing the genomic sequences of the bacterial isolates in a database, a database having more thorough genome coverage of intestinal microbiota, such as the HPMC database described in Annex A can be established.
FINAL REMARKS
[0419] When used in this specification and claims, the terms comprises and comprising, including and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the possibility of other features, steps or integers being present.
[0420] The features disclosed in the foregoing description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.
[0421] While the invention has been described in conjunction with the exemplary embodiments described above, many equivalent modifications and variations will be apparent to those skilled in the art when given this disclosure. Accordingly, the exemplary embodiments of the invention set forth above are considered to be illustrative and not limiting. Various changes to the described embodiments may be made without departing from the spirit and scope of the invention.
[0422] For the avoidance of any doubt, any theoretical explanations provided herein are provided for the purposes of improving the understanding of a reader. The inventors do not wish to be bound by any of these theoretical explanations.
REFERENCES
[0423] All references referred to herein are hereby incorporated by reference.
[0424] Abecasis, A. B. et al. A genomic signature and the identification of new sporulation genes. Journal of bacteriology 195, 2101-2115, doi:10.1128/JB.02110-12 (2013).
[0425] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403-410, doi:10.1016/S0022-2836(05)80360-2 (1990).
[0426] Bosshard, P. P., Abels, S., Zbinden, R., Bottger, E. C. & Altwegg, M. Ribosomal DNA sequencing for identification of aerobic gram-positive rods in the clinical laboratory (an 18-month evaluation). J Clin Microbiol 41, 4134-4140 (2003).
[0427] Clarridge, J. E., 3rd. Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clinical microbiology reviews 17, 840-862, table of contents, doi:10.1128/CMR.17.4.840-862.2004 (2004).
[0428] Duncan, S. H., Hold, G. L., Harmsen, H. J., Stewart, C. S. & Flint, H. J. Growth requirements and fermentation products of Fusibacterium prausnitzii, and a proposal to reclassify it as Faecalibacterium prausnitzii gen. nov., comb. nov. Int J Syst Evol Microbiol 52, 2141-2146 (2002).
[0429] Fodor, A. A. et al. The most wanted taxa from the human microbiome for whole genome sequencing. PloS one 7, e41294, doi:10.1371/journal.pone.0041294 (2012).
[0430] Francis, M. B., Allen, C. A., Shrestha, R. & Sorg, J. A. Bile acid recognition by the Clostridium difficile germinant receptor, CspC, is important for establishing infection. PLoS pathogens 9, e1003356, doi:10.1371/journal.ppat.1003356 (2013).
[0431] Galperin, M. Y. et al. Genomic determinants of sporulation in Bacilli and Clostridia: towards the minimal set of sporulation-specific genes. Environmental microbiology 14, 2870-2890, doi:10.1111/j.1462-2920.2012.02841.x (2012).
[0432] Goodman et al., PNAS, vol. 108, 6252-6257 (2011)
[0433] Janoir, C. et al. Adaptive strategies and pathogenesis of Clostridium difficile from in vivo transcriptomics. Infect Immun 81, 3757-3769, doi:10.1128/IAI.00515-13 (2013).
[0434] Lawley, T. D. and A. W. Walker (2013). Intestinal colonization resistance. Immunology 138(1): 1-11.
[0435] Lawley, T. D. et al. Antibiotic treatment of clostridium difficile carrier mice triggers a supershedder state, spore-mediated transmission, and severe disease in immunocompromised hosts. Infect Immun 77, 3661-3669, doi:10.1128/IAI.00558-09 (2009).
[0436] Meehan, C. J. & Beiko, R. G. A phylogenomic view of ecological specialization in the Lachnospiraceae, a family of digestive tract-associated bacteria. Genome biology and evolution 6, 703-713, doi:10.1093/gbe/evu050 (2014).
[0437] Petrof, E. O., G. B. Gloor, S. J. Vanner, S. J. Weese, D. Carter, M. C. Daigneault, E. M. Brown, K. Schroeter and E. Allen-Vercoe (2013). Stool substitute transplant therapy for the eradication of Clostridium difficile infection: RePOOPulating the gut. Microbiome 1(1): 3.
[0438] Rajilic-Stojanovic, M. & de Vos, W. M. The first 1000 cultured species of the human gastrointestinal microbiota. FEMS microbiology reviews 38, 996-1047, doi:10.1111/1574-6976.12075 (2014).
[0439] Riley, T. V., Brazier, J. S., Hassan, H., Williams, K. & Phillips, K. D. Comparison of alcohol shock enrichment and selective enrichment for the isolation of Clostridium difficile. Epidemiology and infection 99, 355-359 (1987).
[0440] Seekatz, A. M., J. Aas, C. E. Gessert, T. A. Rubin, D. M. Saman, J. S. Bakken and V. B. Young (2014). Recovery of the gut microbiome following fecal microbiota transplantation. MBio 5(3): e00893-00814.
[0441] Stewart, E. J. (2012). Growing unculturable bacteria. J Bacteriol 194(16): 4151-4160.
[0442] van Nood, E., A. Vrieze, M. Nieuwdorp, S. Fuentes, E. G. Zoetendal, W. M. de Vos, C. E. Visser, E. J. Kuijper, J. F. Bartelsman, J. G. Tijssen, P. Speelman, M. G. Dijkgraaf and J. J. Keller (2013). Duodenal infusion of donor feces for recurrent Clostridium difficile. N Engl J Med 368(5): 407-415.
[0443] Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology 73, 5261-5267, doi:10.1128/AEM.00062-07 (2007).