Method and Apparatus For Analysing a Sample

Abstract

We describe a method and apparatus for analysing a sample. The method may comprise extracting a plurality of sequence reads from within the sample. Genomic analysis is then performed on the plurality of sequence reads by comparing the plurality of sequence reads to reference genomes stored in a reference database, wherein each stored reference genome comprises a set of reference sequences. Before performing the genomic analysis, the method further comprises comparing screening sequences with at least one of the set of reference sequences and the plurality of sequence reads from the sample. When it is determined that a screening sequence matches at least one sequence within either the set of reference sequences or the plurality of sequence reads, the at least one matching sequence is masked.

Claims

1. A method for analysing a sample, the method comprising: extracting a plurality of sequence reads from within the sample; and performing genomic analysis on the plurality of sequence reads by comparing the plurality of sequence reads to reference genomes stored in a reference database, wherein each stored reference genome comprises a set of reference sequences; wherein before performing the genomic analysis, the method further comprises: comparing screening sequences with at least one of the set of reference sequences and the plurality of sequence reads from the sample; and when it is determined that a screening sequence matches at least one sequence within at least one of the set of reference sequences and the plurality of sequence reads, masking the at least one matching sequence, wherein the screening sequences include at least one of mobile genomic elements and horizontally transferred elements.

2. The method of claim 1, wherein masking the at least one matching sequence comprises converting the at least one matching sequence to a sequence of predetermined characters.

3. The method of claim 1, wherein masking the at least one matching sequence comprises deleting the at least one matching sequence.

4. The method of claim 1, comprising comparing the set of reference sequences with the screening sequences, masking any reference sequences within the reference genomes which match the screening sequences and storing the masked reference genomes.

5. The method of claim 4, wherein the storing step is completed before extracting the plurality of sequence reads.

6. The method of claim 1, comprising comparing the plurality of sequence reads from the sample with the screening sequences and masking any sequence reads which match the screening sequences.

7. The method of claim 1, wherein it is determined that a screening sequence matches at least one sequence when the at least one sequence is identical to the screening sequence.

8. The method of claim 1, wherein the genomic analysis comprises performing phylogenetic analysis for the plurality of sequence reads.

9. The method of claim 1, wherein the genomic analysis comprises performing metagenomic analysis.

10. The method of claim 9, wherein performing metagenomic analysis comprises performing lowest common ancestor reference-based metagenomic assembly analysis.

11. The method of claim 1, wherein performing genomic analysis comprises performing analysis of meta-transcriptomics data.

12. The method of claim 1, wherein extracting a plurality of sequence reads from within the sample comprises using shotgun sequencing.

13. A non-transitory computer readable medium comprising computer executable instructions which when executed by a computing device cause the computing device to carry out the method of claim 1.

14. An apparatus for analysing a sample from a patient, the apparatus comprising: a processor which is configured to receive a plurality of sequence reads from within the sample; and perform genomic analysis on the plurality of sequence reads by comparing the plurality of sequence reads to reference genomes stored in a reference database, wherein each stored reference genome comprises a set of reference sequences; wherein before the genomic analysis is performed, the processor is further configured to: compare screening sequences with at least one of the set of reference sequences and the plurality of sequence reads from the sample; and when it is determined that a screening sequence matches at least one sequence within at least one of the set of reference sequences and the plurality of sequence reads, mask the at least one matching sequence, wherein the screening sequences include at least one of mobile genomic elements and horizontally transferred elements.

Description

BRIEF DESCRIPTION OF THE FIGURES

[0033] For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example only, to the accompanying diagrammatic drawings in which:

[0034] FIG. 1a is a flowchart illustrating the steps in a standard process for analysing a received sample;

[0035] FIGS. 1b to 1g illustrate how the method of Figure la can lead to incorrect analysis of a plurality of genetic sequences;

[0036] FIGS. 2a and 2b are flowcharts illustrating the steps in the two new methods for analysing a received sample;

[0037] FIG. 2c is an example of the analysis using the method of FIG. 2a;

[0038] FIG. 3a is a phylogenetic tree showing the diversity of the genomes within the custom database which is used in the method of FIG. 2a or 2b;

[0039] FIG. 3b is a graph showing the classification results for the method of FIG. 2a or 2b and illustrates the percentage of classified reads for 13415 metagenomic sequenced samples split into the three most highly represented continents;

[0040] FIG. 4 is a schematic block diagram of the components used in the method of FIG. 2a or 2b;

[0041] FIG. 5a is a graph showing the relative abundance of the dominant bacterial species, ordered by prevalence, within the 13,415 human gastrointestinal metagenomic samples;

[0042] FIG. 5b is a graph showing the number of samples in which each novel species was identified and is ordered by prevalence of the species; and

[0043] FIG. 5c is a DAPC analysis of functional categories showing the distinct functions associated with each dominant phylum.

DETAILED DESCRIPTION

[0044] As detailed in the background section, Figures la to lg illustrate a standard process for analysing a received sample and the possible incorrect analyses that may occur as a result. FIG. 2a shows the steps of a proposed method which is more accurate and cost-effective shotgun metagenomics analysis.

[0045] The first step of the method (step S200) is to receive a sample. The sample may be a faecal sample, e.g. from a number (20) of healthy adults (North America n=12, United Kingdom n=8) who had not taken antibiotics within the last 6 months. Samples may be frozen and stored at −80 degrees Celsius. They may then be purified, for example using bacterial culturing as described in “Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation” by Browne et al published in Nature 533, 543-546, doi:10.1038/nature17645 (2016). Such culturing may use a supplemented YCFA medium with or without ethanol pre-treatment. The sample processing and culturing may take place under anaerobic conditions, e.g. in Whitley DG250 and A95 workstations at 37° C. using phosphate buffered saline and culture media incubated under anaerobic conditions for 24 hours prior to use. As an example, the faecal samples may be homogenized in reduced PBS (0.1 g stool per ml PBS), serially diluted and plated directly onto YCFA agar supplemented with 0.002 g ml.sup.−1 each of glucose, maltose and cellobiose in (13.5 cm diameter) Petri dishes. Colonies may then be picked, re-streaked to purity and identified using 16S rRNA gene sequencing. Species were defined on the basis of a 16S rRNA gene sequence identity threshold of >97.8% to discriminate species for example as described in “Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences” by Yarza et al published in Nat Rev Microbiol 12, 635-645, doi:10.1038/nrmicro3330 (2014).

[0046] In the next step of the method (step S202), the genomic sequences within the sample may be determined. This may be done by extracting the DNA from the sample, e.g. from pelleted cells using a phenol-chloroform method as described for example in “Molecular cloning: a laboratory manual” 4th edn, (Cold Spring Harbor Laboratory Press, 2012). DNA may then be prepared and sequenced using standard techniques such as the Illumina Hi-Seq platform with library fragment sizes of 200-300 bp. The sequences may have a read length of 100 or 150 bp.

[0047] The genomic sequences to be used in subsequent steps may preferably be of at least high quality draft standard as defined for example in “Genome Project Standards in a New Era of Sequencing” by Chain et al published in Science 2009 Oct. 9; 326(5950): doi:10.1126/science.1180614. In other words, overall coverage represents at least 90% of the genome or target region with an effort made to exclude contaminating sequences. Sequence errors and mis-assemblies may still be possible.

[0048] Determining whether the sequences meet the standard may be achieved for example by producing annotated assemblies as described in “Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data” by Page et al published in Microb Genom 2, e000083, doi:10.1099/mgen.0.000083 (2016). For example, a first step in producing the annotated assemblies may be to create multiple assemblies used Velvet v1.2 and VelvetOptimiser as described in “Velvet: algorithms for de novo short read assembly using de Bruijn graphs” by Zerbino et al and published in Genome Res 18, 821-829, doi:10.1101/gr.074492.107 (2008). An assembly improvement step may then be applied to the assembly in which the best N50 and contigs were scaffolded using SSPACE described in “Scaffolding pre-assembled contigs using SSPACE” by Boezter et al and published in Genome Biol 13, R56, doi:10.1186/gb-2012-13-6-r56 (2012) and sequence gaps were filled, e.g. using GapFiller described in Toward almost closed genomes with GapFiller by Boetzer et al published in Bioinformatics 27, 578-579, doi:10.1093/bioinformatics/btq683 (2011). Automated annotation may then be performed, e.g. using PROKKA v1.11 described in “Prokka: rapid prokaryotic genome annotation” by Seemann published in Bioinformatics 30, 2068-2069, doi:10.1093/bioinformatics/btu153 (2014). In this example, genomes with less than 400 contigs, a genome size less than 8 Mb and presence of 16s rRNA sequences with greater than 97.5% homology were considered pure and included in further analysis.

[0049] Some analysis may be performed on the genomic sequences before the subsequent steps. For example, an initial phylogenetic analysis may be performed as an optional step (S204). This may be done by extracting amino acid sequences of 40 universal core marker genes from the bacterial collection, e.g. using Spec! which is described in “Accurate and universal delineation of prokaryotic species” by Mende et al published in Nat Methods 10, 881-884, doi:10.1038/nmeth.2575 (2013). The protein sequences may be concatenated and aligned using standard tools such as MAFFT which is described in “MAFFT multiple sequence alignment software version 7: improvements in performance and usability” by Katoh et al published in Mol Biol Evol 30, 772-780, doi:10.1093/molbev/mst010 (2013). Maximum likelihood trees may be constructed using known techniques, such as RAxML with the standard LG model and 100 rapid bootstrap replicates for example as described in “RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies” by Stamatakis published in Bioinformatics 30, 1312-1313 doi:10.1093/bioinformatics/btu033 (2014). The trees may then be visualised e.g. using iTOL described in “Interactive Tree of Life v2: online annotation and display of phylogenetic trees made easy” by Letunic et al published in Nucleic Acids Res 39, W475-478, doi:10.1093/nar/gkr201 (2011).

[0050] Screening against a screening database (step S206) may then be carried out. This screening database may comprise mobile genetic elements, for example plasmids, transposons and insertion sequences or may comprise other elements which it is desirable to remove, e.g. horizontally transferred elements (e.g. DNA encoding a specific operon). An example of a screening database is included as Supplementary Table 4 in the paper entitled “A human gut bacterial genome and culture collection for improved metagenomic analysis” by Forster et al. published in February 2019 in Nature Biotechnology and is incorporated by reference. Merely as examples, some plasmids, bacteriophages, transposons and insertion sequences are listed below using their accession numbers from the National Center for Biotechnology Information (NCBI).

TABLE-US-00001 Plasmids Bacteriophage Transposon Insertion sequences AY171301 MG711460.1 HG475346.1 AY168958 AJ001708 GQ141189 U00004.1 AF238307 AY112722 MH550421.1 KX231277.1 X92945 NC_004587.1 HM243621 AM180356 MH382198.1 AB426620

[0051] A mobile genetic element may be a type of genetic material which can move around within a genome or that can be transferred from one species or replicon to another as described above. A horizontally transferred element is also described above. An example database comprising mobile genetic elements is known as ACLAME and comprises all known phage genomes, plasmids and transposons. The mobile genetic elements which are available in ACLAME, include plasmids, viruses and prophages. Another example database is prepared by the Millard Lab has a database of over 9000 complete bacteriophage genomes which have been extracted from GenBank. The TREP transposable elements platform is another example database which comprises transposable elements. As defined on this platform, transposable elements are mobile genomic DNA sequences found in nearly all organisms. Transposable elements have the ability to replicate in a host genome using various transposition mechanisms and they are divided into two classes based on their replication mechanism: retrotransposons (class I) and DNA transposons (class II).

[0052] Another example database is ImmeDB (Intestinal microbiome mobile element) which is a database dedicated to the collection, classification, and annotation of mobile genetic elements (MGEs) from gut microbiome. The classes of mobile genetic elements which are found in the gut microbiome may include prophages, integrative conjugative elements, integrative mobilizable elements, group II introns, transposons and genomic islands or genomic islets. Illustrative examples of some mobile elements which are provided in this database are set out below and are referenced by their ID used in the database:

TABLE-US-00002 Genomic islands Genomic islets NZ_ADMO01000031.1: 84616-95833 NC_004663.1: 4599796-4601480 NC_006347.1: 3408664-3439675 NZ_JH724114.1: 370089-371752 NZ_LRGD01000028.1: 9771-24727 NZ_JH815524.1: 1786629-1788226 NZ_GL945019.1: 415119-429441 NZ_NQMG01000019.1: 79992-81706 NC_021030.1: 369579-410404 NZ_LARL01000004.1: 17735-19268

TABLE-US-00003 Integrative conjugative elements Integrative mobilizable element NZ_CBXG010000004.1: 4988-55362 NC_021012.1: 2873017-2884015 NZ_AENZ01000040.1: 66378-115551 NZ_CP011531.1: 4119779-4145888 NZ_BAIB02000011.1: 61343-89533 NZ_AXVN01000013.1: 27402-38179 NZ_CP023819.1: 1567305-1600880 NZ_JH724079.1: 2703611-2729544 NZ_GG774984.1: 239828-297170 NC_021040.1: 960056-964669

TABLE-US-00004 Group II introns NZ_ADKP01000084.1: 51339-53257 NZ_CP022754.1: 3475637-3478065 NZ_JGES01000028.1: 109378-117028 NC_021015.1: 2325689-2327578 NZ_JH594448.1: 1968650-1971086

[0053] A match may be determined when one of the sequences in the screening database and a determined genomic sequence satisfy a sequence identity condition, e.g. they are at least 95% alike, and may be preferably identical.

[0054] Once a sequence is identified as a match to a screening sequence, i.e. identified as a mobile element or a horizontally transferred element, it is then masked (step S208), e.g. by changing the sequence to a sequence of predetermined characters, e.g. Ns and effectively filtered from the plurality of read sequences which were identified in step S202. The genomics analysis is then performed at step S210 and may be metagenomics analysis or other appropriate analysis. The metagenomics analysis which is performed may be lowest common ancestor RBMA analysis which allows a determination of overall taxonomic classification efficiency across a dataset. The analysis may be performed against a custom generated Kraken database, e.g. as explained in more detail below.

[0055] As an alternative, the screening against known elements to be masked may be used when building the custom database. Thus, as shown in FIG. 2b, in a first step reference genomes for purified bacteria are generated (step S300). These reference genomes are preferably high quality draft. Each of these generated genomes is then compared to the screening database of known mobile element sequences (step S302) as described above. One method of performing the comparison is to divide each generated genome into a plurality of sequences and compare each of the plurality of sequences to the sequences within the screening database. However, it will be appreciated that other methods of identifying sequences within the reference genome which match a mobile element sequence are equally valid.

[0056] Once a sequence within the reference genome is identified as a match to a screening sequence, it is then masked (step S304), e.g. by changing the sequence to a sequence of predetermined characters, e.g. Ns. The masked reference genome is then stored together with any other reference genomes (masked or unmasked) to create the custom database (step S306). Once the database has been created, it can be used for metagenomic analysis of a sample. For example, the sample may be received and optionally purified at step S308 as in step S200 above. The genomic sequences may then be obtained as described in step S202 above. The final step of performing the genomic analysis, e.g. metagenomic analysis is then done against the database which includes any masked reference genomes.

[0057] Screening and subsequently masking reference genomes as described in FIG. 2b is likely to be a more efficient method and thus less time-consuming method than screening and subsequently masking the plurality of sequences in the sample as described in FIG. 2a. Nevertheless, it will be appreciated that the method of FIG. 2b, like that of FIG. 2a, addresses the problems of incorrect analysis due to mobile elements or other similar elements which result in mis-classification of sequence reads against a phylogenetically organised reference database. Returning to the problem identified in the background, the read sequence 21 which is incorrectly assigned in FIG. 1f is identified as a mobile genetic element in the reference genome using the method of FIG. 2b. As schematically shown in FIG. 2c, the mobile genetic element 48 is masked in the reference genome. Accordingly, any read sequences which match this region of Genome 3 are not assigned and do not result in false positive identification.

[0058] For a faecal sample, this custom generated database may be termed the Human Gastrointestinal Microbiota Genome Collection (HGG). As an example, one version of the database comprises 1354 genomes representing 530 species from 57 families within the phyla Actinobacteria (129 genomes; 55 species), Bacteroides (231 genomes; 69 species), Firmicutes (772 genomes, 339 species), Fusobacteria (26 genomes; 9 species), Proteobacteria (194 genomes; 56 species) and Synergistetes (2 genomes; 2 species). To understand the phylogenetic relationship between these taxa, we extracted 40 universal core genes from each genome and performed phylogenetic analysis which is shown in FIG. 3a. The branch colour distinguishes bacterial phyla belonging to Actinobacteria (gold), Bacteroides (green), Firmicutes (blue), Fusobacteria (brown), Proteobacteria (red) and Synergistetes (black). Overall, the maximum phylogenetic diversity was observed in Firmicutes, particularly the classes Clostridia, Erysipelotrichia and Negativicutes. However, as shown in FIG. 3a, a broad range of species and phylogenetic group are represented across all phyla.

[0059] This custom generated database may be created by combining data from other databases, e.g. the Human Gastrointestinal Bacteria Culture Collection (HBC) which contains isolates from the human gastrointestinal tract and other sources. Currently the HBC comprises 737 purified and archived isolates which represents 273 species (105 novel species and 173 novel genome sequences of isolates from healthy individuals) from 31 families in the phyla Actinobacteria (53 genomes; 16 species), Bacteroides (143 genomes, 40 species), Firmicutes (496 genomes, 203 species) and Proteobacteria (45 genomes; 14 species). A genome sequence is available for each isolate within the HBC. Another source which is used is the National Center for Biotechnology Information (NCBI) genome database which has 617 publicly available, high-quality human gastrointestinal associated bacterial genomes. It is noted that 53% of species in the custom HGG database have been archived in the HBC database. Many of the species currently absent from the HBC database but present in the custom HGG database include members of the Fusobacteria, Proteobacteria and Synergistetes which are typically absent from healthy individuals from the developed world.

[0060] As explained above, these genomes and the sequences against which they are compared may contain masked sequences corresponding to screening sequences contained with a reference screening database (ENI and ICEberg databases). The metagenomics samples may also be filtered by quality, e.g. using Trimmomatic which is described in “Trimmomatic: a flexible trimmer for Illumina sequence data” by Bolger et al published in Bioinformatics 30 2114-2120 doi:10.1093/bioinformatics/btu170 (2014). Human contaminating reads may also be filtered by mapping to the Human reference genome (hg19) for example using Bowtie described in “Fast gapped-read alignment with Bowtie 2” by Langmead et al published in Nat Methods 9, 357-359, doi:10.1038/nmeth.1923 (2012). Samples with fewer than one million reads after filtering may be discarded. Filtered sequences were then classified at the genus and species levels using lowest common ancestor analysis as described in “HPMCD: the database of human microbial communities from metagenomic datasets and microbial reference genomes” by Forster et al published in Nucleic Acids Res 44 D604-609, doi:10.1093/nar/gkv1216 (2016).

[0061] A set of results is shown in FIG. 3b which illustrates the percentage of classified reads for 13415 metagenomic sequenced samples. Applying lowest common ancestor RBMA with mobile element filtering achieved an average taxonomic assignment of 83% at the genus level and 79% at species level. Taken together, these analyses reveal high resolution classification of the majority of human gastrointestinal microbiota derived metagenomics reads in each of the three most highly represented continents as shown in FIG. 3b.

[0062] FIG. 4 is a schematic block diagram showing a system which may be used to implement the method of FIG. 2a or FIG. 2b. The sample 50 is collected and optionally purified as described above. The sequences which are read from the sample are then input into a computing device 52 which may be a server, a stand-alone PC or any standard device which is capable of processing data. The computing device 52 may comprise standard components such as a processing module 58, a display 60 and a user interface 62. The display 60 may be used to display the phylogenetic tree to the user and the user interface 62 may be used to allow a user to input data about the sample.

[0063] The processing module 58 performs the steps of performing phylogenetic analysis, searching for incorrect families, comparing to known agents, masking identified sequences and performing the final metagenomics analysis. This module is shown as a single module but it will be appreciated that the processing steps may be split across several components, e.g. to increase processing performance. When comparing the sequence in the sample or the reference genome to known screening sequences or when comparing the sequence in the sample to reference genomes (in the RBMA analysis), the processor may access one or more reference databases 54 to which it is communicatively coupled. In this example, the reference database 54 is shown as a separate, single component but it will be appreciated that it may also be stored in memory on the computing device 52 or may be split across multiple components. In one example, the reference database 54 may be a mobile element reference database which may include 2826 plasmids and 466 transposons and insertion sequences which may be local to the process. The reference database 54 may also include a reference genome database into which the reference genomes (masked and unmasked as per FIG. 2b) are stored.

[0064] The metagenomic analysis which is performed in the final step of the method may comprise generating a collection of genome sequences from the sample, e.g. bacterial isolates from a faecal sample. This collection may be stored in a results database 56 which is communicatively coupled to the processing module 58. Like the reference database 54, the results database 56 may be local or remote from the computing device 52.

[0065] As an example, the method may also be used to understand which species are most prevalent within the human gastrointestinal microbiota. Considering only species that are present at a level greater than 0.01% within any one of the 13,415 samples described above, 165 species were identified as present in more than two unrelated samples. This group of dominant species included members of the phyla Bacteroidetes (n=41), Firmicutes (n=82), Proteobacteria (n=27) and Actinobacteria (n=15). Given the background prevalence of each phylum, this represents a significant over-representation of species from Bacteroidetes (p<0.05) this represents a significant under-representation of species from Firmicutes (p<0.01).

[0066] FIG. 5a illustrates the relative abundance of the top 20 prevalent species within each phyla. The colour denotes Bacteroidetes (green), Firmicutes (blue), Proteobacteria (red) and Actinobacteria (yellow). Considering all species that were detected above background levels, the majority of dominant species are members of the Bacteroides genus. When corrected for the number of species within each phylogenetic group, the Bacteroidetes generally and the Bacteroides and Parabacteroides genera specifically are significantly over-represented (p<0.001). Despites over 346 species within the Firmicutes, there are only 6 distantly related Firmicutes species that were highly represented across many individuals. Overall, all detected genera within the Firmicutes phylum were statistically under-represented in their occurrence. The majority of Proteobacteria were not detected within the samples. Interestingly, no members of the Fusobacteria or Synergistetes were found to be prevalent at the level of detection considered, suggesting that they are only found during certain conditions or stages of life that were not included in this analysis.

[0067] This data suggests a potential key role for specific members of the Bacteroides within the human gastrointestinal tract. In contrast, the significantly greater diversity observed for species from the Firmicutes suggests a dynamic, functionally redundant group.

[0068] FIG. 5b shows the relative abundance within a wider human cohort of the novel species which are contained within the HGG database. Importantly the availability of these genomes allows us to perform species level assessment of the prevalence of these species in metagenome samples for the first time. As shown in FIG. 5b, in total 106 of the 173 novel species (60.9%) are found at greater than 0.001% abundance in at least one sample within the 13,415 public metagenome samples available. Notably almost half (87; 46%) were found in more than 100 samples but less than one quarter (389; 21.8%) were found in more than 1000 samples. Interestingly, three novel species all within the Clostridiales were found almost half the sample analysed. Two novel species Lachnospiraceae were found in 7797 (55.9%) and 7074 (50.7%) samples respectively and a novel Ruminococcaceae species was found in 6777 (48.6%) samples. This suggests that may of the novel species and genomes identified through this work occur frequently within the human population and potentially represent integral parts of the human gastrointestinal microbiota.

[0069] As shown in FIG. 5c, the custom database HGG enables high resolution functional analysis in addition to detailed taxonomic analysis. This provides the capacity to enhance our understanding of the functions provided by bacteria that inhabit the gastrointestinal tract. COG analysis for example as described in “The COG database: a tool for genome-scale analysis of protein functions and evolution” by Tatusov et al published in Nucleic Acids Res 28, 33-36 (2000) is applied to identify features which are prevalent in the HGG bacteria. This analysis identified 4,696 protein domains represented in at least one isolate. As expected, bacterial housekeeping functions including ribosomal protein function, amino acid synthesis and other translation-associated functions dominate the 30 functions found in all bacteria within the collection. The 4696 protein domains which were identified were then compared using discriminant analysis of principle components (DAPC) as described in “Discriminant analysis of principal components: a new method for the analysis of genetically structured populations” by Jombart et al published in BMC Genet 11, 94, doi:10.1186/1471-2156-11-94 (2010). This comparison demonstrates clear functional differences between key phyla of the human gastrointestinal microbiota as shown in FIG. 5c in which Bacteroidetes are shown in green, Firmicutes in blue, Proteobacteria in red and Actinobacteria in gold.

[0070] An enrichment analysis was then undertaken to identify those functions which were over-represented in each phylum relative to all functions present within the HGG database. This analysis identified 8, 122, 152 and 389 statistically enriched functions (q<0.001) in Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria respectively. This analysis was calculated using for example one-sided Fisher's exact test with p-value adjusted by Hochberg method. Enriched functions within the Actinobacteria were limited and primarily associated with lipid (q<1.99×10.sup.−83) and carbohydrate metabolism (q<7.57×10.sup.−77). Equivalent analysis of the Bacteroidetes specific functions identified many key functions, including iron (q<1.18×10.sup.−114) and sulfur transporter functions (q<6.82×10.sup.−97) and specific sodium transporting NADH ubiquinone oxidoreductases (q<3.47×10.sup.−124). Firmicutes were dominated by uncharacterised functions. However, spore formation (q<3.48×10.sup.−123), thiamine (q<2.76×10.sup.−101) and riboflavin transport (q<7.04×10.sup.−101) were all highly enriched. Proteobacteria were dominated by fructose bisphosphatase (q<4.50×10.sup.−140) glucokinases (q<4.55×10.sup.−125) and regulators of iron cluster formation (q<9.20×10.sup.−98). These results demonstrate the distinct differences in the unique functions provided by the key phyla of the human gastrointestinal microbiota.

[0071] The HGG collection contains novel genomes of 173 species which include HBC genomes of 105 novel species and HBC genomes of 68 species not previously isolated and genome sequenced from the human gastrointestinal tract. Analysis of these genomes to identify novel functions in the human gastrointestinal microbiota identified 45 newly described functions, of which 41 were found in the Firmicutes. While these functions were dominated by uncharacterised proteins, novel functions included those associated with tetrahydromethanopterin S-methyltransferase, present in five species, preprotein translocase, also present in five species and formaldehyde-activating enzyme necessary for methanogenesis found in four of the uncharacterised Firmicutes. This analysis also identified Type III, IV and VI secretion system components in Bacteroidetes that were previously associated with gastrointestinal Proteobacteria and Firmicutes. Equally Proteobacteria associated ABC transporters were identified in gastrointestinal Firmicutes for the first time. The prevalence of these numerous uncharacterised functions further demonstrates the need for better genome annotation and functional genomics to understand these bacteria.

[0072] A spore signature was also applied, for example as described in “Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation” by Browne et al published in Nature 533, 543-546, doi:10.1038/nature17645 (2016). This revealed that 83.2% of these newly genome sequenced isolates and 85.8% of the novel species are predicted to form spores suggesting an immense amount of diversity within these groups.

[0073] It is noted that mobile genetic elements and horizontally transferred elements differ from repetitive or low complexity sequences because these simply follow a pattern without a function. The fact that mobile genetic elements have function as described above means that they are unlikely to be ignored or “masked” by individuals who are interested in generating genome assemblies to understand the biology of the species harbouring these mobile elements. Their biological relevance means that they should not be masked in this context.

[0074] Repetitive DNA sequence regions are not usually reliably assembled with reads from the short-read sequencing technologies that are commonly used to generate genome sequences. Therefore, in practice, these repeat regions can be ignored for the purposes of genome assembly, without majorly compromising the interpretation of an organism's biology. Routinely ignoring or masking genetic loci that encode several genes all of which are required to confer a particular biological function would be less benign.

[0075] Moreover, DNA or genes that are horizontally transferred become apparent and are more relevant when considered in the context of sequence reads from microbial communities. In this scenario, the true biological source of the reads encoding these elements cannot be reliably assigned. This obviously complicates any de novo assembly approaches that would attempt to reconstruct whole genomes from the reads representing a mixed community.

[0076] Reference based read-assignment may be a solution to this problem. However, given that mobile elements can be reliably assembled from reads representing a single species and that horizontally transferred genes may not be identified as such without investigation beyond genome assembly, it may not be obvious that misassignment of reads would occur using this method. The method and apparatus described above addresses the issues with the known techniques. Furthermore, the method and apparatus may also be relevant if long-read sequence technologies replace short-read sequence technologies for metagenomics in the future.

[0077] At least some of the example embodiments described herein may be constructed, partially or wholly, using dedicated special-purpose hardware. Terms such as ‘component’, ‘module’ or ‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more processors. These functional elements may in some embodiments include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components, modules and units discussed herein, such functional elements may be combined into fewer elements or separated into additional elements. Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term “comprising” or “comprises” means including the component(s) specified but not to the exclusion of the presence of others.

[0078] Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.

[0079] Although a few preferred embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

Method and Apparatus For Analysing a Sample

Inventors

Cpc classification

Classification Explorer

G06F30/20

PHYSICS

Classification Explorer

G16B10/00

PHYSICS

Classification Explorer

G16B30/10

PHYSICS

International classification

Classification Explorer

G16B10/00

PHYSICS

Classification Explorer

G06F30/20

PHYSICS

Classification Explorer

G16B30/10

PHYSICS

Abstract

Claims

Description