SYSTEMS, METHODS, AND APPARATUSES FOR SEQUENCE ALIGNMENT
20180004892 · 2018-01-04
Assignee
Inventors
Cpc classification
G16B50/00
PHYSICS
G16B10/00
PHYSICS
International classification
Abstract
Systems, methods, and apparatuses are disclosed for reducing the computational time of assigning a species to an infection isolate. A method for dividing a search index into one or more sub-indices based on a phylogenetic tree of reference sequences is disclosed. A method for dividing reads into test sets and aligning to sub-indices for assigning a species to an infection isolate is disclosed. A system for aligning sequence reads to a database of reference sequences using sub-indices is disclosed.
Claims
1. (canceled)
2. (canceled)
3. The method of claim 8, wherein the index comprises a hash table.
4. The method of claim 8, wherein the index comprises a sorted dictionary.
5. The method of claim 8, wherein the phylogenetic tree is generated by a neighbor joining method.
6. (canceled)
7. (canceled)
8. A method of assigning a species to a plurality of sequence reads from an organism, the method comprising: receiving the plurality of sequence reads; selecting a test set of the plurality of sequence reads, wherein the test set includes selected ones of the plurality of sequence reads; selecting a plurality of sub-indices of an index, wherein the index points to all sequences of a set of sequences corresponding to a plurality of species, wherein the sub-indices point to selected sequences of the set of sequences, wherein each of the sub-indices corresponds to sectors of a phylogenetic tree of the set of sequences, wherein the phylogentic tree is based on evolutionary distances between each of the sequences of the set of sequences; aligning the test set to the plurality of sub-indices, wherein the aligning is performed in parallel by one or more processing units; identifying, with the one or more processing units, a certain sub-index of the plurality of sub-indices, wherein the certain sub-index includes sequences that have a highest alignment to the test set, based on said aligning the test set to the plurality of sub-indices; aligning, with the one or more processing units, the plurality of sequence reads to the certain sub-index; and assigning the species to the plurality of sequence reads based on said aligning the plurality of sequence reads to the certain sub-index, wherein the species assigned is based on a sequence of the certain sub-index that had a highest alignment to the plurality of sequence reads.
9. The method of claim 8, wherein selecting the test set of the plurality of sequence reads comprises selecting random ones of the plurality of sequence reads.
10. The method of claim 8, wherein selecting the plurality of sub-indices includes selecting all of the sub-indices of the index.
11. The method of claim 8, wherein the test set includes a plurality of test sets, wherein each of the plurality of test sets is aligned to a different one of the plurality of sub-indices.
12. The method of claim 8, wherein the aligning, with the one or more processing units, the plurality of sequence reads to the certain sub-index includes grouping the plurality of sequences reads into a plurality of sub-sets, wherein each sub-set is aligned to the certain sub-index.
13. The method of claim 8, wherein assigning the species is based, at least in part, on probabilistic methods.
14. (canceled)
15. (canceled)
16. (canceled)
17. A system for determining a species of an infection isolate, the system comprising: a processing unit; a memory accessible to the processing unit; a database accessible to the processing unit; and a display coupled to the processing unit; wherein the processing unit is configured to: align a plurality of sequence reads of the infection isolate stored in the memory to at least one sub-index of an index stored in the database to determine a species of the infection isolate, wherein the index points to sequences of a set of reference sequences corresponding to a plurality of species, wherein the sub-index points to selected sequences of the set of sequences, wherein the sub-index corresponds to a sector of a phylogenetic tree of the set of reference sequences, wherein the phylogenetic tree is based on evolutionary distances between each of the sequences of the set of sequences; wherein the species is determined based on a sequence of the set of reference sequences having a highest alignment to the plurality of sequence reads; and provide to the display a determination of the species of the infection isolate.
18. The system of claim 17, further comprising a sequencing unit configured to provide the plurality of sequence reads to the memory.
19. The system of claim 17, further comprising a computer system accessible to the processing unit, wherein the processing unit is configured to provide the determination of the species of the infection isolate to the computer system.
20. The system of claim 17, wherein the processing unit comprises a plurality of processing units configured to process in parallel.
21. The system of claim 17, wherein the database includes a plurality of databases.
22. The system of claim 17, wherein the processing unit is further configured to provide a probability that the determination of the species of the infection isolate is correct.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
DETAILED DESCRIPTION
[0015] The following description of certain exemplary embodiments is merely exemplary in nature and is in no way intended to limit the invention or its applications or uses. In the following detailed description of embodiments of the present systems and methods, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration specific embodiments in which the described systems and methods may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the presently disclosed systems and methods, and it is to be understood that other embodiments may be utilized and that structural and logical changes may be made without departing from the spirit and scope of the present system.
[0016] The following detailed description is therefore not to be taken in a limiting sense, and the scope of the present system is defined only by the appended claims. The leading digit(s) of the reference numbers in the figures herein typically correspond to the figure number, with the exception that identical components which appear in multiple figures are identified by the same reference numbers. Moreover, for the purpose of clarity, detailed descriptions of certain features will not be discussed when they would be apparent to those with skill in the art so as not to obscure the description of the present system.
[0017] Although identification of pathogens is described, this application is provided for exemplary purposes only. The methods, systems, and apparatuses described herein may be used for a wide variety of applications not limited to pathogen identification. Other applications may include, but are not limited to, genealogy, forensics, and botany.
[0018] An infection may be caused by a pathogen such as bacteria, a virus, a fungus, a parasite, or other organism. Some infections may be caused by multiple types of organisms present at the same time.
[0019] When an infection is detected, samples may be collected by medical staff from the patient. Samples may include tissue, blood, and/or bodily fluid. The samples may then be processed to isolate the pathogen causing the infection from other materials in the sample. The infection isolate may then be analyzed by a variety of methods. The analysis may determine the pathogen type, species, drug resistance, and/or other properties.
[0020] The genetic material of the infection isolate may be sequenced. Examples of sequencing methods include single-molecule real-time sequencing, pyrosequencing, and polony sequencing. Other sequencing methods may also be used. Using high-throughput sequencing (also known as next generation sequencing), the genetic material is sequenced in parallel, which may generate thousands to millions of sequence fragments. The sequence fragments generated by the sequencing method are generally referred to as “sequence reads,” or simply “reads.” The reads may be anywhere from a few tens to tens of thousands of base pairs long. In some sequencing methods, the reads may be the entire length of the infection isolate sequence. The reads of the infection isolate may then be analyzed to find a match to a reference sequence. The process of matching one or more reads to a known sequence is generally referred to as alignment.
[0021] Several algorithms for aligning sequences and/or reads have been developed. Certain algorithms, such as Smith-Waterman and Needleman-Wunsch algorithms, may be used to find the optimal alignment between a read and a reference sequence. Even once an optimal alignment, there may be mismatches and/or gaps between the read and reference sequence. A gap may be due to a string of mismatched bases in a row and/or a difference in length between the reference sequence and the read. A score or other measure (e.g., total number of matched nucleic acids, length of longest gap, etc.) of how well the read and reference sequence are aligned at the optimal alignment may be provided. Optimal alignment algorithms may provide the most accurate result, but the computational intensity of most of these algorithms make them difficult to implement when a large number of reads and/or reference sequences are to be aligned.
[0022] Probabilistic algorithms have been developed that may increase the speed of sequence alignment, but at the expense of not guaranteeing the optimal alignment. These algorithms may provide a measure of probability of having found the best alignment and/or the probability of having found the closest match between a read and a reference sequence from a set of reference sequences in a database.
[0023] A family of probabilistic algorithms breaks the reads and sequences into k-mers, or “words” consisting of a number (k) of base pairs. The algorithm then searches for matches in the read k-mers and the reference sequence k-mers. An example of such an algorithm is the Basic Local Alignment Search Tool (BLAST). Another family of probabilistic algorithms apply a transform to the reads and sequences such as a Borrows-Wheeler Transform (BWT). The transform may reduce the number of identical copies of a portion of a sequence, reducing alignment time. An example of this type of alignment algorithm is the Bowtie algorithm. Other probabilistic algorithm families may be used. (Li H, Homer N, “A survey of sequence alignment algorithms for next-generation sequencing,” Briefings in Bioinformatics, 2(5), 473-483, 2010.)
[0024] Many probabilistic algorithms generate secondary data structures called search indices that point to elements in the primary data structure. For example, in a book, a search index provides a list of major topics and points to the pages on which a major topic is discussed. In this application, the elements in the primary data structure are sequences. These indices may be generated for the references sequences and/or sequence reads from a sample. The search indices may provide a data structure that is optimized for searching by the chosen algorithm to find matching sequences and/or sequence segments. A search index may allow the algorithm to align the reads and sequences more rapidly and/or accurately. The trade-off for this improvement in performance may be additional databases and storage space in a memory to store the search index. Examples of search index structures include, but are not limited to, hash tables, suffix/prefix trees, binning, and linear indices. An alignment algorithm may utilize one or more search indices.
[0025] An example of a system 100 used for aligning reads to one or more reference sequences according to an embodiment of the disclosure is shown as a block diagram in
[0026]
[0027] The reference sequences, reads, and/or indices may be stored in different databases. The databases may be stored in different memories accessible by one or more processing units. In some cases, one or more of the databases may be divided across a plurality of memories. For example, a portion of the database of reference sequences may be stored in one memory while a second portion of the database of reference sequences may be stored in another memory. Each memory may contain a unique portion of the database or each memory may contain a portion of the database that is also stored in another memory. This may provide back-up protection in the case of a hardware failure and/or faster access for commonly used data.
[0028] Separating the data into multiple databases and/or multiple memories may allow for one or more processing units to perform alignment of reads in parallel. This may decrease computation time. For example, a search index pointing to 1,000 reference sequences may be divided into ten sub-indices each pointing to 100 reference sequences. A processing unit or processing units may access 5,000 reads in a memory and align the reads to each sub-index in parallel. This may result in an alignment in less time than if the 5,000 reads were aligned to the full search index.
[0029]
[0030] Phylogenetics is the study of evolutionary relationships between organisms. Such relationships are often represented as weighted graphs such as trees. An example of a phylogenetic tree 400 is shown in
[0031] Multiple phylogenetic methods exist, including methods based on evolutionary distances, parsimonious, and maximum likelihoods. Distances based methods are where an evolutionary distance is calculated between each organism. The evolutionary distance is calculated based on the degree of similarity between genetic sequences of organisms. One such method for determining evolutionary distances is called the Jukes-Cantor (Evolution of protein molecules In Mammalian protein metabolism, Vol. III (1969), pp. 21-132 by T. H. Jukes, C. R. Cantor edited by M. N. Munro) method where the transition from any particular nucleotide in the genome to another, i.e. transitions or transversions, can occur with the same probability:
[0032] In Equation 1, above, the instantaneous rate matrix Q represents the rates of change between a pair of nucleotides per instant of time. P—the probability transition matrix is given as
p(t)=e.sup.Qt Equation 2
[0033] As a result, the evolutionary distance between any two organisms under this model is simply:
[0034] Where p is the number of sites along the single nucleotide polymorphisms (SNPs)/DNA that differ between the sequences. The distance goes to infinity as p approaches the equilibrium value (75% of sites differ). This simple model, however, does not take into account the biological consideration that transitions (purine to purine (a-g) or pyrimidine to pyrimidine (t-c)) and transversions (purine to pyrimidine or vice-versa) occur at different rates. Another distance model, the Kimura 2-parameter model (Kimura, Motoo. “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.” Journal of molecular evolution 16.2 (1980): 111-120), attempts to correct for this. In this case:
[0035] For p (proportion of transitions) and q (proportion of transversions).
[0036] Once reference sequences have been compared to determine their evolutionary distances, rates of evolution may be determined. The evolutionary distances and relationships between reference sequences may then be plotted in graphical form, such as a tree plot. Neighbor Joining (Saitou N, Nei M. “The neighbor-joining method: a new method for reconstructing phylogenetic trees.” Molecular Biology and Evolution, volume 4, issue 4, pp. 406-425, July 1987) is one method of building unrooted trees. The method corrects for unequal evolutionary rates between sequences by first finding a pair of neighboring leaves i and j which have the same parent node k. That is, leaves i and j may be pathogens that evolved from a common pathogen k. Leaves i and j may then be removed from the list of leaf nodes and k is added to the current list of nodes, and node distances are recalculated. This algorithm is an example of a greedy “minimum evolution” algorithm.
[0037] Another method of building phylogenetic trees is the unweighted pair group method with arithmetic mean (UPGMA) (Sokal R., Michener C. “A statistical method for evaluating systematic relationships.” University of Kansas Science Bulletin 38: 1409-1438, 1958). The UPGMA algorithm is agglomerative and generates a rooted tree. Initially, each sequence defines a single cluster. With each iteration, clusters are combined to form larger clusters. This continues until all sequences are included in a single cluster. With each iteration, two clusters of sequences that are found to have the shortest evolutionary distance are combined into a higher-level cluster. The evolutionary distance between clusters is the average of all evolutionary distances between corresponding pairs of sequences in each of the clusters. The algorithm reiterates until all reference sequences are placed in the tree.
[0038] Single-linkage clustering is a method of building rooted trees similar to UPGMA. However, rather than using the average evolutionary distance between all corresponding pairs of sequences between clusters, the evolutionary distance between clusters is defined by the minimum distance between a sequence in a first cluster and a sequence in a second cluster. That is, the distance of a single pair of sequences defines the distance between clusters.
[0039] Complete-linkage clustering is also a method of building rooted trees similar to UPGMA and single-linkage clustering. As with single-linkage clustering, the evolutionary distance between a single pair of sequences, each included in a different cluster, defines the evolutionary distance between two clusters. However, in complete-linkage clustering, the pair of sequences that has the greatest evolutionary distance defines the evolutionary distance between the two clusters.
[0040] Unlike neighbor joining, the UPGMA algorithm and related clustering algorithms assume a constant rate of evolution. The above methods of generating phylogenetic trees are provided for example purposes only. Other methods of generating phylogenetic trees may be used without departing from the scope of the disclosure.
[0041]
[0042] As mentioned previously, next generation sequencing methods may generate thousands to millions of reads for a single infection isolate. Even if an index has been divided into sub-indexes, aligning all of the reads to each sub-index in parallel may take a long period of time.
[0043] Other permutations of subdividing the set of reads are possible. For example, the set of reads may be divided into any number of test sets, and each test set may be aligned to one or more sub-indices. The test sets may be stored in one or more memories accessible by one or more processing units. The test sets may be stored in the same memory or a different memory than the set of reads. The test sets may be stored in the same memory or different memory than the search index and/or sub-indices. The test sets, if combined, may comprise the entire set of reads. However, the test sets, if combined, may comprise only a subset of the entire set of reads. This may reduce the computation time required for aligning the test sets. The sub-index found to have the best alignment with one or more test sets may then be aligned to the complete set of reads. This step may be omitted if a less accurate result is adequate.
[0044] When the sub-indices are sectors of a phylogentic tree generated from the reference sequences pointed to by the index, the alignment may return phylogenetic information as at least a portion of the result. For example, after a set of reads generated from sequencing an infection isolate sample have been aligned to one or more sub-indices, a result may include the most likely species of the infection isolate. Other phylogenetic information may also be provided.
[0045]
[0046] A database 710 of reference sequences may also be stored in the memory or in a separate memory. The reference sequences may be the entire set of known reference sequences for all organisms or may be a subset of the entire set of known reference sequences, for example, only reference sequences from bacteria. A search index may be generated for the reference sequences in the database 710. The search index may also be stored in database 710. In some embodiments, database 710 may include multiple databases. A phylogenetic tree may be generated for the reference sequences. Based on the phylogenetic tree, one or more sub-indices 710a-e may be generated. Each sub-index 710a-e may point to reference sequences in a corresponding sector of the phylogenetic tree. For example, each sub-index 710a-e may represent a clade of the phylogentic tree of the reference sequences. Although five sub-indices are shown, any number of sub-indices may be generated from the search index. The generation of the search index and sub-indices may be performed prior to receiving a set of reads. The generation of the search index and sub-indices may only need to be performed once, and the resulting search index and sub-indices may be utilized multiple times for any number of sets of reads.
[0047] One or more processing units may access the test sets 706a-e and align each test set 706a-e to a corresponding sub-index 710a-e. The alignment of each test set with each sub-index may be performed in parallel. In some embodiments, the test sets 706a-e may only be aligned to certain ones of the sub-indices 710a-e. For example, some sub-indices may correspond to sectors of the phylogenetic tree that are known to contain no pathogenic species. These sub-indices may then be excluded from alignment when searching for an infection isolate species. The processing unit may analyze the result 715 of the alignments and identify the sub-index with the optimal alignment or the highest probability of containing the optimal alignment. In the example shown in
[0048] The set of reads 705 may then be divided into one or more sub-sets 707a-e. Although five sub-sets are shown, any number of sub-sets may be generated from the set of reads 705. The sub-sets 707a-e, when combined, may include the entire set of reads 705. The sub-sets 707a-e may be identical or different from the test sets 706a-e. For example, if the combined test sets 706a-e only included a portion of the reads of the set of reads 705, the sub-sets 707a-e may be different. In another example, more or fewer sub-sets may be generated than test sets.
[0049] One or more processing units may access the sub-sets 707a-e and align each sub-set 707a-e to the optimal index 710c in parallel. In some embodiments, multiple copies of the optimal index 710c may be generated to facilitate parallel processing of the alignment. Other methods of facilitating parallel processing may also be used. The processing unit may analyze the results of the alignments of the subsets 707a-e to the optimal sub-index 710c. The processing unit may then return a result 717 of the most likely species of the infection isolate. Probabilistic methods, as described previously, may be used for the assignment of the most likely species. Other information may also be provided with the result 717. For example, a probability that the correct species has been identified, a degree of similarity between the reference sequence of the most likely species and the sequence reads of the infection isolate, and/or other likely species may be included. The result 717 may be provided to a user on an electronic display, transmitted to an external computer system, and/or stored in a memory.
[0050] The systems, methods, and apparatuses described above may improve patient outcomes by reducing the computational time of assigning a species to sequence reads from an infection isolate. When a patient is determined to have an infection, a sample may be collected from the patient. The sample may be processed to obtain an infection isolate. The infection isolate may then be sequenced by a sequencing device that generates a plurality of sequence reads. The sequence reads may be converted into electronic form and provided to a system according to an embodiment of the disclosure to compare the sequence reads to reference sequences to determine the species of the infection isolate. The system may use one or more methods of sub-dividing of the reads and/or search index described above, which may reduce the computation time required to assign a species to the infection isolate. The species assignment of the infection isolate may allow clinicians to implement the most effective treatments against the particular pathogen infecting the patient. This may reduce the time between infection and initiation of the most effective treatment. This may also reduce treating patients with ineffective or less effective treatments which may have undesirable side effects. For example, broad spectrum antibiotic treatment may be avoided if it is determined that the infection is caused by bacteria resistant to broad spectrum antibiotics.
[0051] The systems, methods, and apparatuses described above may allow for lower cost memories, databases, and/or processing units to be used for implementation. This may increase access to sequencing and alignment capabilities.
[0052] It is to be appreciated that any one of the above embodiments or processes may be combined with one or more other embodiments and/or processes or be separated and/or performed amongst separate devices or device portions in accordance with the present systems, devices and methods.
[0053] Finally, the above-discussion is intended to be merely illustrative of the present system and should not be construed as limiting the appended claims to any particular embodiment or group of embodiments. Thus, while the present system has been described in particular detail with reference to exemplary embodiments, it should also be appreciated that numerous modifications and alternative embodiments may be devised by those having ordinary skill in the art without departing from the broader and intended spirit and scope of the present system as set forth in the claims that follow. Accordingly, the specification and drawings are to be regarded in an illustrative manner and are not intended to limit the scope of the appended claims.