SYSTEMS AND METHODS FOR ENCODING GENOMIC GRAPH INFORMATION
20180373839 ยท 2018-12-27
Assignee
Inventors
Cpc classification
G16B20/20
PHYSICS
G16B50/00
PHYSICS
G16B45/00
PHYSICS
H03M7/40
ELECTRICITY
International classification
Abstract
The invention provides systems and methods for encoding specific portions of a genomic graph in a way that is useful for copying or moving those portions from one location to another. The genomic graph can comprise nodes connected by edges, wherein the nodes and edges are stored as edge objects in non-transitory memory of a computer system and wherein an edge within the graph represents a genomic sequence. In on embodiment, the a method of encoding a genomic graph comprises selecting at least one edge from the genomic graph, the at least one edge representing a single genetic sequence and having a position on the genomic graph. The method can further comprise encoding the at least one edge as a string of bits, in which at least a first component of the string of bits stores at least a portion of the position of the edge on the genomic graph.
Claims
1. A method of encoding a genomic graph representing variations in genetic sequence, the method comprising: providing a genomic graph comprising nodes connected by edges, wherein the nodes and edges are stored as objects in non-transitory memory of a computer system and wherein an edge within the genomic graph represents a genomic sequence; selecting at least one edge from the genomic graph, the at least one edge representing a single genetic sequence and having a position on the genomic graph; encoding the at least one edge as a string of bits, in which at least a first component of the string of bits stores at least a portion of the position of the branch on the genomic graph.
2. The method of claim 1, wherein selecting at least one edge comprises selecting edges having an associated identifier.
3. The method of claim 1, wherein selecting at least one edge from the genomic graph further comprises traversing the genomic graph and identifying paths beginning with nodes having multiple outgoing edges, ending at nodes having multiple incoming edges, and having at least one directed edge specifying a linear path between.
4. The method of claim 3, wherein selecting at least one edge from the genomic graph comprises selecting the at least one directed edge specifying a linear path between a node having multiple outgoing edges and a node having multiple incoming edges.
5. The method of claim 1, wherein selecting at least one edge from the genomic graph further comprises determining an ordering of the edges in the genomic graph.
6. The method of claim 5, wherein determining an ordering comprises identifying a longest path in the graph from a source vertex to a sink vertex.
7. The method of claim 1, wherein the position of the at least one edge on the genomic graph further comprises a starting position and an ending position; and further wherein encoding the branch as a string of bits comprises encoding at least a portion of the starting position and a least a portion of the ending position of the at least one edge on the genomic graph.
8. The method of claim 1, wherein the starting position and the ending position share a common prefix, and encoding the branch as a string of bits further comprises including third component that includes the common prefix.
9. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the length of the associated genomic sequence.
10. The method of claim 1, wherein encoding the at least one edge as a string of bits further comprises encoding the associated genomic sequence.
11. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the type of variation represented by the genomic sequence.
12. The method of claim 11, wherein the type of variation is encoded as a vector.
13. The method of claim 11, wherein the type of variation is a deletion of a single nucleotide, a single nucleotide polymorphism, a single nucleotide insertion, a multiple nucleotide insertion, or a replacement of one nucleotide by multiple nucleotides.
14. The method of claim 13, wherein encoding the at least one edge as a string of bits comprises encoding only a start position of the at least one edge on the genomic graph.
15. The method of claim 1, wherein encoding the branch as a string of bits further comprises encoding the associated genomic sequence.
16. The method of claim 1, further comprising removing the at least one edge from the graph.
17. The method of claim 16, further comprising selecting and encoding at least one second edge from the genomic graph.
18. The method of claim 1, wherein the first component includes length bits indicating a length of the first component and representation bits that include the portion of the starting coordinate.
19. The method of claim 18, wherein each component is stored as a sub-string of the string of bits wherein a first portion of the sub-string represents a length of the sub-string and a second portion of the sub-string includes stored information.
20. The method of claim 19, further comprising performing the selection and encoding steps to convert the genomic graph into a linear string of bits comprising a representation of each of the plurality of edges.
21. The method of claim 1, further comprising storing a plurality of edges from the genomic graph, each as its own string of bits.
22. The method of claim 1, further comprising storing all of the edges from the genomic graph so that the entire genomic graph is stored as a sequence of strings of bits, each string of bits comprising a first portion representing a length of the string and a second portion representing the position.
23. A method of decoding a genomic graph from a serialized specification, the method comprising: receiving a representation of a genomic sequence; associating the genetic sequence with a first edge of a genomic graph; receiving a representation of a variation of the genetic sequence, the representation further comprising a position on the genomic graph of the variation; associating the representation of the variation with a second edge; and inserting the second edge into the genomic graph at the position.
24. The method of claim 23, further comprising segmenting the first edge at the position of the variation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0033] Features and advantages of the claimed subject matter will be apparent from the following detailed description of embodiments consistent therewith, which description should be considered with reference to the accompanying drawings.
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059] For a thorough understanding of the present disclosure, reference should be made to the following detailed description, including the appended claims, in connection with the above-described drawings. Although the present disclosure is described in connection with exemplary embodiments, the disclosure is not intended to be limited to the specific forms set forth herein. It is understood that various omissions and substitutions of equivalents are contemplated as circumstances may suggest or render expedient.
DETAILED DESCRIPTION
[0060] Described herein are systems and methods for describing a genetic sequence as an insertion of an edge to a genomic graph and compressing the description for compact storage and ease of transfer. Various embodiments of systems and methods disclosed herein may be referred to as Compressed Insert Specification (CIS) Format. Further, various methods and systems described throughout this disclosure provide an approach to encoding information (e.g., genetic information) that minimizes the number of required bytes, which is particularly advantageous when the information includes genomic or graph related data. For example, genetic information can be structured as a reference graph that can store genetic information from multiple individuals, or even multiple organisms. Using a minimal number of bytes, the invention encodes both coordinates and paths through a genomic reference graph using a variable number of bytes, while also storing information in the encoded information (e.g., in a header octet) about the total size of the total number of bytes. This minimizes wasted bytes as the total number of bytes is tuned to the specific information or query and simplifies the decoding process as a decoder is automatically provided with this key information.
[0061] One way to represent a genome is as a directed graph having a plurality of vertices and a plurality of directed edges in such a reference graph representation of a genome (e.g., a sequence graph or a genomic graph), genetic variations within species can be incorporated as divergent paths along the graph. Genetic similarities within a species can be represented as a single path. In other words, the data associated with genetic similarities must only be stored once in the sequence graph. Genomic data, such as nucleotide or protein sequences, may be associated with either the vertices or edges of the graph, and thus the genomic graph can be either vertex-based or edge-based. The direction of the edges indicates the allowable sequences that may be represented by following paths through the graph. Different paths through the graph may represent variation in sequences. An edge-based genomic graph can be built, for example, by first associating a single edge with a reference sequence. While the present disclosure illustrates edge-based graphs, however it should be noted that methods and systems described herein are equally applicable to vertex-based graphs.
[0062] In practice, a graph may be serialized and deserialized in order to store and load the graph in computer memory.
[0063]
[0064] Variation of the genetic sequence can be described by adding subsequent edges to the is graph. An edge can be added to the genomic graph 201 by referring information specifying the edge, such as a sequence and a position on the genomic graph 201. As shown in this embodiment, the genetic sequence is associated with a single edge 209, and each letter within the sequence is associated with an increasing offset value starting from zero. Offset values may be used to identify particular positions within the graph for selecting desired insertion points of new edges, for example.
[0065]
[0066]
[0067] In certain embodiments, the offset values are retained as edges are segmented by the addition of new inserts. For example, as shown in
[0068] Further, as each edge is sequentially inserted into the graph, the new edge may also be associated with a unique identifier (e.g., a number that increments by one, relative to the identifier associated with the existing edge upon which the insertion is occurring). As previously inserted edges are segmented into multiple edges (i.e., by the addition of a new nodes in order to accommodate the inserted edge), these segmented edges may retain their initial identifier. Identifiers may be utilized later when determining an order for deconstructing the graph, for example. Typically a graph is deconstructed in the order in which it was constructed, i.e., the inserted edges are removed in the reverse order in which they were inserted.
[0069] The above-described method may also be reversed to serialize a genomic graph,
[0070]
[0071] Identifying 509 an edge or set of edges to encode and remove can be performed in a variety of ways. In certain embodiments, the edge or set of edges may have an associated identifier (e.g., an identifier that was associated with the edge when it was first added to the graph). In these embodiments, the associated identifier indicates the order in which the edges were added to the graph. The most recently added edge or set of edges may then be selected for removal. Edges segmented during graph construction may retain the same identifier, and are thus selected together. Identifiers may be associated with edges as metadata, for example.
[0072] An ordering for edge removal and graph deconstruction can also be determined if no previously existing information is available. In certain embodiments, the ordering may be determined by identifying edges that were most recently added, i.e., in the reverse order of construction. For example, one may traverse the graph 201 to identify an edge or set of edges for removal. This can be performed by traversing the graph (e.g., by a depth-first search) and noting those paths through the graph which begin with nodes having multiple outgoing directed edges, end at nodes having multiple incoming edges, and have at least one directed edge specifying a linear path between them, without any intervening nodes. If no paths satisfy this rule, then the only remaining set of edges is linear (i.e., has no divergent paths) and can thus be selected as the final set of edges for encoding and removal. For example, as shown in the embodiment of
[0073] In other embodiments, an ordering for removal may be determined by identifying an order in which the edges may have initially been inserted into the graph and then annotating the graph with this information. In one embodiment, a first path is selected from the graph that traverses from a source vertex (i.e., the first vertex) to a sink vertex (i.e., the last vertex). This path is considered to be the backbone branch, and the edges and/or vertices in this path may be annotated with an initial identifier (e.g., 1 or 0), indicating that it should be inserted as a first edge insertion when creating the graph (or, if deconstructing the graph, the last to be encoded and removed). (The relevance of a backbone path, source vertex, and sink vertex are discussed in more detail below with respect to
[0074] Preferably, path selection identifies the longest available path available through the graph. For example, when selecting the backbone path, the graph may be traversed (e.g., via a depth-first search) to identify the longest path in terms of the number of symbols, edges, vertices, or other measure. Selecting the longest available path is beneficial because it reduces the number of subsequent edge insertions that need to be encoded and decoded, resulting in a more efficient and compact representation of the graph in serialized form. Accordingly, in certain embodiments, a graph may be deconstructed in a different manner from how it was initially constructed.
[0075] When annotating paths with ordering information, those edges connected to vertices having multiple edges may be annotated with an additional identifier indicating its order in terms of the number of available outgoing paths from that vertex. For example, after the backbone branch is first identified, the first directed edge not followed (which is then followed when identifying the second path) may be annotated with a 2, indicating that it is the second available directed edge and specifying the second path that can be followed from that vertex. Similarly, the directed edge associated with the backbone branch may be annotated with a 1.
[0076] Once an edge or set of edges have been identified 509 (i.e., the edge 219 representing A in the graph 201 of
[0077] Once the edge has been serialized and encoded, it may then be removed from the graph 201, e.g. as shown in
[0078] Linearizing graphs is useful because graphs can take up a large amount of computer memory, and thus it can be more efficient to load the graph or subsets of the graph only when needed. However, at the scales associated with genomic data, procedures associated with linearizing and de-linearizing graphs may also present a performance bottleneck. These issues can result in complexities during analysis, causing a non-negligible impact on genomic research. Systems and methods of the invention provide improvements in storing and representing graph-related data at genomic scales. While shown in
[0079] The present disclosure recognizes that a graph can be described as a plurality of edge insertions. Every inserted edge can be described by two parameters: a position of the edge within a graph, and the variation in genetic sequence that the edge represents. The position within the graph may comprise a start position and an end position. The variation in genetic sequence may comprise the length of the sequence or number of symbols associated with that edge, the genetic sequence itself, and/or an encoding of the type of variation. This information can be used to store graphs in a serialized format for ease of transport and instantiation. Further, this information can be efficiently represented to improve access and loading times, e.g., by using embodiments of encoding schemes as described below.
Graph Genome Coordinate Encoding
[0080]
[0081] The LR8 and LFR8 encoding schemes described herein represent coordinate components as a variable number of bytes, wherein the representation itself includes information about its size. In particular, an encoding scheme for storing coordinate components is proposed in which a header octet indicates the number of additional bytes (Length) following the header octet that is used to accommodate the stored coordinate component (Representation). In this way, the encoding scheme is able to minimize the number of bytes required to store a coordinate component, thus achieving efficiencies in storage and speed while also not suffering from problems associated with pure variable length encoding schemes. Further, the header may also store other information related to the integer or position (e.g., whether the integer represents the last component in the coordinate) as additional binary information, such as a set of flags (Flags). The LFR8/LR8 encoding schemes can be used to efficiently encode any kind of data, including graph-related data, such as coordinates and paths.
Assigning a Coordinate
[0082] Methods of the disclosure provide a coordinate system to identify locations or positions in a graph. Coordinates may comprise a series of integers (which may be referred to as coordinate components) describing a path through a graph that may be taken to arrive at a particular position. The positions at which serialized edges should be inserted into the graph can be defined by coordinates as described herein, for example.
[0083] The disclosed method encodes graphs and accompanying variation data, and is compatible with a variety of graph coordinate systems. For purposes of the present disclosure, a coordinate refers to an expression that defines a particular or relative position within the graph. Once a coordinate system has been established, pairs of coordinates can be used to define a trajectory specifying a path from a first position to a second position within the graph.
[0084]
[0085] Often, reference graphs designate a backbone branch representing a particular linear reference sequence (e.g., the nucleotide sequence of chromosome 1 of the human genome). Additional branches diverging from and reconnecting to the backbone branch thus represent known variation from the linear reference sequence, and represent events such as single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and larger structural variants. As shown in
[0086] The reference graph 100 includes three additional branches specifying variations from the backbone branch 102. For example, a first branch 104 represents a nucleotide sequence A, a second branch 106 represents a nucleotide sequence GACTG, and a third branch 108 represents a nucleotide sequence TACTG. As shown, the first branch 104 and the second branch 106 are each connected to the backbone branch 102, and the third branch 108 is connected to the second branch 106. Similar to the backbone branch 102, a nucleotide sequence for each of the branches 104, 106, 108 is associated with an increasing offset value starting from zero.
[0087] Each branch describes a path or a particular set of edges and vertices within the reference graph 100 that may be followed through a series of increasing offset values. While the branches 102, 106, and 108 (and offsets) are shown at specific locations relative to the backbone branch 102, these or other branches could be specified in a variety of ways. For example, the backbone branch 102 could instead begin with AAT instead of ACT. In that case, the edge representing the nucleotide A in 104 would instead be reassigned to the backbone branch 102, with a corresponding reassignment of the offset values as well the edge within branch 102 that specifies C would instead be assigned an offset of 0 as a new alternate branch, whereas the edge specifying A would have offset 1 in the backbone branch). Accordingly, for a particular graph, designated branches and offset values may vary.
[0088] In practice, designating offsets and branches may be specified as a graph is created. The first reference sequence incorporated can be specified as the backbone branch, and its associated nucleotide sequence can be assigned increasing offset values. Subsequent sequences representing variations from the reference can be inserted as new branches connected to the backbone branch. These new branches will have new relative offset values starting from zero.
[0089] One exemplary coordinate system to identify particular positions on the graph specifies a sequence of (offset, branch) pairs. Pairs may be specified as a list of integers, each separated by a period .. Each integer in the coordinate may also be referred to as a coordinate component. The sequence of (offset, branch) pairs indicates a unique path that may be taken to reach a particular position within the graph. For example, starting from the source vertex 130 (i.e., the first vertex on the backbone branch 102), each (offset, branch) pair represents (i) the desired offset to be reached on the current branch, followed by (ii) an indication of whether an edge to an alternate branch (and having a new relative offset) is followed before that offset. Subsequent pairs in the coordinate are similarly followed. In this exemplary system, the first integer is zero-based (i.e., the first offset is 0, the second offset is 1, etc.), whereas the second integer begins from 1 (i.e., the first outgoing edge is 1, the second outgoing edge is 2, etc.).
[0090] For example, an A 110 in the first branch 104 may be identified by proceeding to the first offset (1), and moving to an alternate branch, i.e., by following a second outgoing edge 120 (2) from the vertex immediately before that offset. Therefore, the location of the A 104 in the reference graph 100 can be represented by the coordinate 1.2. While not necessary, the coordinate may also include that no further offset movements are required. Therefore, the A 110 can also be represented by the coordinate 1.2.0.
[0091] Similarly, if no divergent paths from the current branch are required to reach the desired position, the second integer in the coordinate can be omitted. In another example, a T nucleotide 112 on the backbone branch 102 may be specified by the coordinates 2, which simply represents a traversal to the nucleotide having offset 2 in the backbone branch 102 from the source vertex 130.
[0092] More complicated coordinates can include multiple (offset, branch) pairs, indicating the traversal across multiple branches to reach a desired position. For example, a C nucleotide 114 in the third branch 108 can be reached by proceeding, from the source vertex 130 on the backbone branch 102, to the nucleotide having offset 3 (3), following a second outgoing edge 122 (2) before that offset to the second branch 106, proceeding to offset 3 (3) on the second branch 106, taking a second outgoing edge 124 (2) just before that offset to the third branch 108, and then traversing to offset two (2) on the third branch 108. As a result, the location of the C nucleotide 114 in the reference graph 100 can be represented by the coordinate 3.2.3.2.2.
[0093] In use, an initial offset may be much larger than the prior examples due to the large size of a reference genome. For example, a SNP on chromosome 1 at position 203,422,877 could be identified by the coordinates: 203422877.2, representing a traversal at that position along a second outgoing edge (2) to an alternate branch representing the SNP.
[0094] In some examples, an offset may be larger than the number of nucleotides on a current branch. In these cases, the offset is considered to fall off the branch (e.g., along an outgoing edge 120 from the first branch 104, an outgoing edge 126 from the second branch 106, or an outgoing edge 128 from the third branch 108) and continue along the connected edges until the desired relative offset is reached. For example, the second C nucleotide 116 in the backbone branch 102 (at offset 5) can also be reached by coordinates 3.2.6, representing a traversal to the vertex having offset 3 on the backbone branch 102, following the second outgoing edge 122 to the second branch 106 before that vertex, and then traversing an additional six offsets. However, after offset 4, the only available outgoing edge is edge 126, which falls off, or returns to the backbone branch 102. An additional two offsets are then traversed to arrive at the C nucleotide (having offset 5 on the backbone branch 102).
[0095] The coordinates essentially describe a path that may be taken to identify a particular position in the reference graph 100 from the source vertex 130. As graphs are multi-dimensional structures, a single position may be represented (or reached) by multiple different coordinates. Each unique coordinate describes a unique path to the selected position within the reference graph 100. For example, the T nucleotide 112 in the backbone branch 102 may also be represented by, the coordinate 1.2.1. The coordinate 1.2.1 represents a path that: traverses to the nucleotide having offset 1 (i.e., the first C nucleotide in the backbone branch 102), follows the second outgoing edge 120 before that offset to the first branch 104, and traverses an additional two offsets to the T nucleotide 112. Similar to the previous example, the offset specified by the coordinate component is relative because the first branch 104 ends and returns to the backbone branch 102. Similarly, the C nucleotide 114 in the third branch 108 may be represented by the coordinates 3.2.3.2.2 and also 1.2.2.2.3.2.2.
[0096] When a single coordinate is used to represent a single point or offset within the graph, they essentially describe a path that may be taken, starting with the leftmost or source vertex 130, to identify a particular position in the reference graph 100. However, pairs of coordinates may be specified indicating a start position and an end position. The trajectory between the start position and end position defines a path that may be taken through the reference graph 100. By specifying a pair of coordinates, a path may be specified that starts from a coordinate other than the source vertex 130.
[0097] Briefly, next-generation sequencing reads may be aligned against a graph genome (such as the reference graph 100) using a graph-aware alignment algorithm or another algorithm that has been modified to align to a graph. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. The start position of each sequence read may be located at any position within the graph genome. Further, alignments that describe variation (such as SNPs, short insertions or deletions, or larger structural variations) may have paths that span multiple branches. Paths and alignments can be represented in the reference graph 100 by defining a trajectory or path between a first coordinate and a second coordinate. The first coordinate can represent the position at which the 5 end of the sequence read aligns with the reference sequence, and the second coordinate can represent the position at which the 3 end aligns. Further, the path defined by the second coordinate can be relative to the first coordinate, and describe the path taken from the first coordinate over which the sequence read aligns.
[0098] For example, to describe the path 200 representing the position of the nucleotide sequence AAT (as generally indicated by edges 202, 120, 204), the path 200 begins at the source vertex 130 (e.g., offset 0), proceeds to the C nucleotide at offset 1, follows the second outgoing edge 120 before that offset, and then traverses two offsets to the T nucleotide 112. This path can be represented by the starting coordinate 0 and the ending coordinate 1.2.1. The coordinates may be combined by a delimiter (e.g., ..) to define a path 0..1.21.
[0099] In another example, to describe the path 220 representing the position of the nucleotide sequence TGACTGA (as generally indicated by edges 204, 122, 222, 224, 226, 228), the path can be specified by the pair of coordinates 2..3.2.6. In this case, the path begins at offset two on the backbone branch 102 at the T nucleotide 112 (2), and then proceeds (..) to offset three (3), follows the second outgoing edge 122 to the third branch 106 before the offset (2), and then proceeds seven offsets to the C nucleotide 116 (6). Note that proceeding seven offsets returns the path to the backbone branch 102.
[0100] When defining paths, the second coordinate may be specified relative to the branch reached by first coordinate. For example, the path 240 representing the position of the nucleotide sequence GACTAC begins at coordinates 3.2, which begins the path at offset 0 in the second branch 106 (G). The second coordinate is 3.2.2, meaning proceed to the third offset (3) on the current branch (i.e., the second branch 106), follow an outgoing directed edge (i.e., the edge 124) to an alternate branch (i.e., the third branch 108) before the third offset, and then proceed to offset two (2) on the alternate branch.
[0101] Specifying the second coordinate relative to the first coordinate has an advantage in that fewer bytes are required for storing the coordinates. However, in certain embodiments, both the first and second coordinates may be specified starting from a source vertex.
[0102] As noted above, the second coordinate defines a particular path to be taken from the first coordinate, which can be used to describe the path associated with a sequence read aligned to the reference graph 100. Therefore, the second coordinate may only be described by a single unique sequence of integers, and not by multiple different choices, as these would define different paths. However, the first coordinate may be specified in any manner. For example, the path 240 GACTAC may be specified by either 3.2..3.2.2 or 1.2.2.2..3.2.2. In some cases, the coordinate requiring fewer or smaller integers is selected so as to maximize efficiency.
[0103] Reverse coordinates and paths (i.e., coordinates and paths that follow incoming as opposed to outgoing edges) may also be specified. As the second coordinate is relative to the branch of the first coordinate, when on the base branch, a reverse path may simply be specified by switching the first and second coordinates (e.g., 7..5). In cases where the reverse path takes a divergent path along a vertex, negative numbers may be used. With coordinates specifying forward paths (as detailed above with reference to
[0104]
[0105] A second reverse path 280 is specified by the pair of coordinates 2..1.2.1. This describes a path having a start position at coordinate 2 (i.e., the T nucleotide 112 on the backbone branch 102) and proceeds .. to offset 1 on the same branch. The second path 280 identifies the vertex connected in the direction of the edge at offset 1, and then follows the second incoming edge to an alternate branch (2) (A). The second path 280 then traverses one additional offset against the directionality of the edge to arrive at offset 0 in the backbone branch 102 (A). Taken together, the pair of coordinates 2..1.1.1 thus represents the position of the nucleotide sequence TAA.
[0106] A third reverse path 300 is specified by the pair of coordinates 3.2.3.2.2..2. This describes a path having a start position at coordinate 3.2.3.2.2 (i.e., the C nucleotide in the third branch 108). The third reverse path 300 then proceeds (..) to offset 2, which is equivalent to offset 1 on the second branch 106. This represents the position of the nucleotide sequence CATCA.
[0107] Reverse paths may be used, for example, to indicate alignments that are the reverse complement of a forward path. In these cases, the 3 end of a sequence read may have a mapped position before the mapped position of the 5 end of the read.
Encoding the Coordinate
[0108] Conventional variable length encoding maps source symbols to a variable number of bits. For example, a conventional variable length code would store coordinate components using only those bits required for each component. However, this can be troublesome because a decoder will then require prior knowledge of the number of bits required for each component, which introduces additional overhead costs. Moreover, conventional variable length coding will result in a non-byte aligned format. However, most operating systems expect binary data to be represented in bytes, i.e., as 8-bit octets. Therefore, deviating from this norm can lead to problems and confusion during the development of software that uses a graph coordinate system.
[0109]
[0110] In an embodiment, the data is divided into two parts: header values and a data representation. The header provides information about the length of the data representation and allows the data representation to be interpreted accurately. A computer system may be programmed with instructions that when executed by a processor cause the system to interpret the header and use that information to decode the data. In certain embodiments, the header is encoded according to a scheme that depicts a length of what is encoded, certain useful flags, and the encoded representation, all using 8-bit octets (where the encoded representation here will be a length of the data section, such as the number of octets following the header octet). Thus, methods and systems of the invention implement a length-flag-representation 8-bit scheme.
[0111]
[0112]
[0113]
[0114] As shown in
[0115] Once encoded, a decoding agent or algorithm would receive the first octet of a coordinate component and then determine, based on information within the first octet, the total size of the representation.
Modifying the Encoding Schemes
[0116] In the encoding schemes discussed above (e.g., LFR8 and LR8 formats), the total number of length bits and flag bits does not exceed eight so that the header information remains within a first octet. If some bits are unused, these bits can be allocated to any position within the encoded representation. In the examples described above, the value representation is shifted to the rightmost position of the octets such that the last bit in the last octet is the first bit of the encoded integer. However, these value representations can be shifted in a different position or permuted in a different manner.
[0117] Any unused bits (i.e., to the left of the value and after the length and flag bits) may be padded as zeros. However, the LFR8 and LR8 formats could be modified to use additional header octets to store additional flags as needed. Similarly, flags could also be stored in the additional representation octets (e.g., LFR8<3,6> would require at least one flag bit in an additional octet).
[0118] As previously described, an LFR8 format is compatible if the particular application requires storage of additional flags. In these cases, encoding an integer may require an additional octet because there will be additional flag bits in the header. For example, encoding the integer 4,954 requires only two or three octets, depending on whether LR8 or LFR8 format is used. For example, a position 201,348,366 on human chromosome 1, represented in binary form, becomes: 1100000000000101010100001110. The binary form of this integer requires at least 28 bits, and thus requires at least four octets in order to be byte-aligned (32 bits), of which half of one octet is unused. When stored using LFR8 format, the header octet that includes the value representation begins with 011, indicating that three additional octets follow the header octet.
[0119] The encoded representation, including the length (L), padded bits (P), and binary value (V) thus becomes:
TABLE-US-00001 LLLPVVVV VVVVVVVV VVVVVVVV VVVVVVVV 01101100 00000000 01010101 00001110
[0120] If flags are desired, the encoding scheme can also accommodate flags (F), for example, following the length bits. In this case, if we would like to include four flags in the header octet, we would require five octets, as the binary value of the integer is then shifted to the right, as shown in the example below:
TABLE-US-00002 LLLFFFFP PPPPVVVV VVVVVVVV VVVVVVVV VVVVVVVV 10010110 00001100 00000000 01010101 00001110
[0121] In the exemplary coordinate system described above, the first coordinate component (here, an integer) in a coordinate will typically be much larger than subsequent coordinate components. This is because the subsequent components typically describe relative offsets or branch numbers. (The use of relative offsets or branch numbers helps to reduce the size of the encoded representation.) Thus, the first component in a coordinate may be quite large and require a large number of bits, but subsequent components will be small and require a minimal number of bits. For example, if a coordinate or path describes traversing to offset 201,348,366 on the main branch, taking the second branch from that offset, and then proceeding to offset 154, the coordinate component 154 could be encoded as follows:
TABLE-US-00003 LLLPPPPP VVVVVVVV 00100000 10011010
[0122] The number 2 (for the second branch) is small enough to fit entirely within the leftover value representation bits of a header octet, and above it has been shown that the number 201,348,366 can be encoded using four octets. Thus, using an LR8 encoding scheme, the entire coordinate (201,348,366.154) is represented using only 7 octets.
[0123] Padding (e.g., 0 bits) can be used when there are unused bits in the representation. Of course, these bits can be used to store additional binary information (e.g., up to 5 binary symbols or flags if three representation bits are used). For example, if we need to store 3 flags, we can convert the LR8<3> schema to LFR8<3,3>, e.g.:
TABLE-US-00004 LLLFFFPP VVVVVVVV 00101100 10011010
[0124] This particular encoding scheme (i.e., one in which three length bits are present in the header) allows for up to seven additional octets to be used. However, if the header octet is entirely occupied by length bits and flag bits (that is, there are no bits left over for value representation), then, in certain embodiments, the encoding scheme imposes a special rule: add at least one extra octet, and the numbers encoded in the length bits hold the number of additional value representation octets to append above and beyond this. So, for example, in an LFR8<3,5> encoding scheme, the representation would include one header octet, and the number of additional value representation octets would range from one to eight, for a total code size (including the header) in the range of two to nine bytes. Accordingly, eight additional representation octets are available in addition to the header octet, allowing for the representation of a value having 2.sup.64 positionsa value far in excess of the number of nucleotides in any presently known genome.
[0125] One reason to incorporate this special rule is so that when we encode the number 0, it does not end up being represented as a null value (i.e., just a header octet, with no assigned value representation bits). The reason this is useful is because if the value is changed later on, e.g., from 0 to 1, then the representation does not need to be resized in order to get the number 1 to fit. For example, imagine several million LFR8/LR8 representations, all assigned to a single contiguous block of memory. If one representation is resized by adding an extra byte, then every representation to the right would presumably have to be moved over by one byte in order to accommodate the change, which can be extremely inefficient.
[0126] When encoding, an encoder first determines, based on a coordinate component or value to be encoded, the number of octets required for representation in a particular LFR8 or LR8 format. Exemplary pseudocode is provided below which gives one example.
TABLE-US-00005 # Invalid input: negative values should be represented # by setting a flag bit. if value < 0 code_size_in_bytes = 1 else n_repbits_in_header = 8 (n_lenbits + n_flagbits) # Hypothetically, if we used the value representation # bits in the header to store the least significant # binary digits of the input value, this would be the # number encoded by the remaining binary digits that we # would have to fit inside the variable length value # representation bytes. value_rightshifted = value / (2 {circumflex over ()} n_repbits_in_header) if value_rightshifted == 0 # Edge case: the only way both if clauses are true # is if value == 0, but you should newer represent # the number zero as a null, always reserve some # value representation bits in case we want to # increment later, then we don't need to resize # immediately. if n_repbits_in_header == 0 n_repbytes = 1 # No representation bytes needed, non-zero value # fits entirely in header else n_repbytes = 0 else n_repbytes = ceil(log256(value_rightshifted)) # Code size includes 1 byte for the header, plus any value # representation bytes, if required. code_size_in_bytes = 1 + n_repbytes
[0127] As shown in the above pseudocode, first the encoder receives a value (such as an integer or coordinate component) to encode. (In this example, if the value is negative, the encoder outputs an error.) The encoder then determines the number of available representation bits in the header by subtracting the number of length bits and the number of flag bits from the size of the header (i.e., 8 bits). For example, in an LFR8<3,2> representation, the number of available representation bits would be 832=3 bits. The encoder then calculates a right shifted value, which represents the number that would need to be encoded by the remaining binary digits once the least significant binary digits are stored in the (e.g., 3 bits of) the header. If the right shifted value is not zero, the encoder determines the number of representation octets required by calculating the logarithm of the right shifted value with base 256, and then identifying the ceiling the smallest integer value greater than or equal to this value) of the result to yield the number of additional octets required for the representation. The resulting code size is thus 1 plus this number (i.e., the header octet plus the additional octets required for the representation). However, if the right shifted value is zero, the encoder considers whether there are any available representation bits in the header. If not (meaning that the value itself is zero), then the encoder provides for one additional representation octet. (This is to ensure that if the value itself later changes, additional octets do not need to be added; see the description of the special rule above.) However, if it is the case that there are available representation bits in the header, the entire representation is fit into a single octet.
[0128]
[0129] Note in the table of
Encoding Graph Coordinates and Paths
[0130] Graph coordinates and paths as described above may be efficiently encoded and stored using combinations of LFR8 and LR8 encoding schemes. As previously noted, graph paths and coordinates according to the disclosure comprise a set of integers describing a path (whether relative or absolute) to arrive at a particular position within the graph. For example, a path having coordinate components X.sub.1, X.sub.2, . . . X.sub.k could be represented as a sequence of LFR8 encoded integers as shown below in Equation (1).
PathRep(X.sub.1,X.sub.2, . . . X.sub.k)=LFR83,2
(|X.sub.1|,[.sub.1,.sub.1]),LFR8
3,2
(|X.sub.2|,[.sub.2,.sub.2]), . . . LFR8
3,2
(|X.sub.k|,[.sub.k,.sub.k])(1)
In Equation (1), is a single flag bit representing whether the encoded integer is the last integer in the coordinate, and, is a single flag bit representing the sign (whether positive or negative) of the stored integer. If only forward paths (i.e., no negative integers) need to be considered, then the path representation can be simplified as shown in Equation 2.
ForwardPathRep(X.sub.1,X.sub.2, . . . X.sub.k)=LFR83,1
(X.sub.1,.sub.1),LFR8
3,1
(X.sub.2,.sub.2), . . . LFR8
3,1
(X.sub.k,.sub.k)(2)
In contrast to Equation (1), Equation (2) requires only a single flag bit E, which indicates whether the stored integer is the last component in the coordinate, and represents the end of the path followed to reach that coordinate. (Delineating the end of the path in this way allows many paths to be stored together in a single contiguous block of memory without ambiguity as to where one path ends and the next one begins.) Accordingly, a decoder reading a sequence of LFR8 encoded integers would understand that a particular integer is the last integer in a coordinate by reading the value for the flag bit E.
[0131] As an example, the coordinates 1056.2.5 (representing a traversal to offset 1056 along the base branch, moving to a new branch having identifier 2, and traversing to offset 5) could be encoded as follows using the forward path representation in Equation (2):
ForwardPathRep(1056.2.5)=LFR83,1
(1056,0),LFR8
3,1
(2,0),LFR8
3,1
(5,1).
This would encode the coordinates 1056.2.5 as:
TABLE-US-00006 LLLFVVVV VVVVVVVV LLLFPPVV LLLFPVVV 00101100 00100000 00000010 00010101
[0132] A decoder configured to decode format, when encountering the first octet above, would read the first three length bits of the first octet and understand there to be one additional representation octet. The entire representation may then be decoded to yield the component 1056. The decoder would then read the next header octet, which indicates no additional representation octets, and is decoded to yield the component 2. The next header octet indicates that the encoded integer is the last in the series of integers for the coordinate, and can be decoded to 5. Thus, the decoded representation yields the coordinates 1056.2.5. In this way, the decoder does not require any prior knowledge of the size of the representation of each component, but determines the size by analyzing the header octet. Further, additional octets are only used if required, leading to efficient encoding of the component.
Efficient Encoding of Edge Insertions
[0133] Systems and methods herein may be used to provide a specification that describes an edge as an insertion to a genomic graph. The specification makes use of coordinates that define the insertion and that specification is compactly encoded using the described LFR8/LR8 variable length encoding, allowing for fast storage, querying, and instantiation times. Since an edge is described as an insertion to a graph using a specification that is compactly encoded, the format of the string of bits describing an edge may be referred to as a compressed insert specification, or CIS format.
[0134] Before the variable length encoding, an edge is described as an insertion to a graph using a specification includes at least a position on the graph and information that defines the variation in genetic sequence represented by the edge. In preferred embodiments, an edge can be represented by a tuple that includes the 5 starting position, 3 ending position, and the length of the associated genetic sequence:
X.sub.1,X.sub.2, . . . X.sub.m,Y.sub.1,Y.sub.2, . . . Y.sub.n,L),
[0135] Wherein each X.sub.i and Y.sub.i are the coordinate components of the 5 and 3 coordinates of the inserted edge, respectively. In the case of a SNP, L can be 1, for storing the single nucleotide associated with the SNP. As will be described in detail below, each edge can be efficiently and compactly encoded using LFR8/LR8 encoding, allowing for fast storage, querying, and instantiation times.
Single Coordinate Insertions
[0136] In certain cases, the end coordinate of an inserted edge can be inferred from the start coordinate and the type of variation represented by the edge. In these cases, the end coordinate can be omitted from the representation, leading to a more efficient representation of the edge. In particular, edges that affect only a single position in the reference genome (one of the most commonly observed types of genomic variation between of individuals within a population) allow for the 3 end to be omitted from the edge representation. For these kinds of variations, the edge can be represented in CIS format in the following manner:
InsertionSpecificationRep(X.sub.1,X.sub.2, . . . L)=LFR83,5
(X.sub.1,[.sub.1,
3
(L)(3)
wherein X represents each coordinate component in the stored coordinate, L represents the length of the associated nucleotide sequence, .sub.1 is a flag that indicates whether the position is the last integer in the coordinate (i.e., it represents the end of the path followed to identify the position in the graph), and
[0137]
[0138] As shown in
[0139] In this way, only a single coordinate needs to be associated with the edge, and information regarding whether the edge reconnects with the vertex after or prior to the specified edge can be encoded in the vector
Two Coordinate Insertions
[0140] Other variations may be more complex, and have edges with ending coordinates that differ from their starting coordinates. In these cases, both the starting and ending coordinates must be specified in the representation. However, often edges will have starting (5) and ending (3) coordinates that share several coordinate components. For example, a structural variant representing a CT.fwdarw.GAC mutation (e.g., as shown in the table in
X.sub.1,X.sub.2, . . . X.sub.k,X.sub.k+1, . . . X.sub.m,Y.sub.1,Y.sub.2, . . . Y.sub.k,Y.sub.k+1, . . . Y.sub.n,L),
where X.sub.1=Y.sub.1, . . . X.sub.k=Y.sub.k is a common path prefix (10.2), X.sub.k+1, . . . X.sub.m is a unique path suffix for the 5 end (5), Y.sub.k+1, . . . Y.sub.n is a unique path suffix for the 3 end (7), and L represents the number of symbols, such as nucleotides, that are associated with the edge. In this case, L is 1, for storing the nucleotide G. In CIS format, this representation can be specified as follows:
TABLE-US-00007 (4) InsertionSpecificationRep(X.sub.1. ... X.sub.k .X.sub.k+1 . ... X.sub.m, Y.sub.1. ... Y.sub.k . T.sub.k+1 . ... Y.sub.n, L) = CommonPathPrefixRep(X.sub.1. ... X.sub.k), End5PathSuffixRep(X.sub.k+1 . ... X.sub.m), End3RelativePathSuffixRep(X.sub.k+1 . Y.sub.k+1. ... Y.sub.n), LR8(3)(L).
[0141] As shown above, the first component of the insertion specification representation is the common path prefix representation, followed by the representation of the unique suffixes for the start (5) and end (3) coordinates, and finally the representation of the number of symbols or nucleotides in the insertion. The common path between the start and end coordinates can be represented by:
CommonPathPrefixRep((X.sub.1, . . . ,X.sub.k)=LFR83,5
(X.sub.1,[.sub.1,.sub.1,
3,2
(X.sub.2,[.sub.2,.sub.2]), . . . LF8
3,2
(X.sub.k,[.sub.k,.sub.k])(5)
wherein is a flag that represents whether the position is the last integer in the coordinate (end of path), and the remaining four flag bits [.sub.1,
[0142]
[0143] The second component of the insertion specification representation is the unique suffix representation of the starting coordinate. Occasionally this component may not be necessary. For example, an insertion starting at position 10.2 and ending at position 10.2.5 will have no unique suffix for the start coordinate. If the start coordinate does have a unique suffix, this can simply be represented by the forward path representation in Equation (2) above:
End5PathSuffixRep(.sub.k+1, . . . X.sub.m)=ForwardPathRep(X.sub.k+1, . . . X.sub.m)(6)
[0144] The third component is the suffix of the ending (i.e., 3) coordinate. To further minimize the storage of integers, the 3 coordinate may be encoded differently, i.e., by having a suffix that is relative to the 5 coordinate. In other words, the coordinate component of the 3 coordinate that is aligned with the 5 coordinate is subtracted from the coordinate component of the 5 coordinate. This representation can be:
End3RelativePathSuffixRep(X.sub.k+1,Y.sub.k+1, . . . Y.sub.n)=LFR83,1
(|Y.sub.k+1X.sub.k+1|,.sub.k+1),LFR8
3,1
(Y.sub.k+2,.sub.k+2), . . . LFR8
3,1
(Y.sub.n,.sub.n)(7)
[0145] Finally, the number of nucleotides in the new branch can be encoded by LR83
(L). It should be noted that in the case of variations representing deletions or single bases, the LR8
3
(L) parameter may be omitted because it can be inferred from the encoded variation type. For example, a decoder reading the representation could be configured to understand that the length parameter is omitted due to the associated vector; e.g., for the first 3 cases shown in the table of
3
(L) parameter can be omitted because for those three variant types, by definition, L=1.)
[0146] It should also be noted that while
Random Access of CIS Graph Genome Elements
[0147] Once a graph has been linearized, it is often advantageous to load or deserialize only, portions or subsets of the graph into memory for use, e.g., by loading only those CIS representations that are needed. In practice this presents some issues. A CIS-encoded graph is typically presented as a serialized stream. CIS efficiently encodes each insertion into a minimal number of bits, but this results in a variable length format, because each insertion will be represented by a variable number of bits. Thus, a decoder is not able to predict the location of a particular offset in an encoded stream simply based on its magnitude. To identify particular insertion representations for deserialization, a decoder must start at the first insertion in the encoded stream and continue until the desired insert is identified. Thus, if a graph has N insertions, it requires on average O(N/2) access time to locate the desired encoded insert for instantiation. When applied to genomic data scales, this access time may be too slow. In these cases, randomly accessing serialized portions of the graph for deserialization can be aided by use of a random-access entry point map.
[0148]
[0149] Typically, the value chosen for K will be much smaller than N, as otherwise the random access entry point map itself may begin to take up a significant amount of memory. K can be adjusted to a variety of settings. In practice, suitable values for K are those in which O(N/K) is between 100 and 1,000. This presents a good compromise in random-access entry point map size and CIS lookup speed. For example, if an encoded CIS stream has 10.sup.6 insertions, a random-access entry point map having 1000 entries will result in a distance of 1000 insertions between each entry. A decoder thus would only need to traverse, on average, about 500 entries before locating a desired insertion.
[0150] While this disclosure refers to a random-access entry point map, various other data structures could be used to more efficiently locate a particular (IS insert in an encoded stream. For example, a tree-based structure having multiple levels of random access could be used. The tree could be divided into several levels in which each branch is traversed until the desired level indicating the location in memory of a particular CIS insert is reached.
[0151] One application of the random-access entry point map is to generate specialized graphs from a larger graph specification. For example, one may wish to build a graph that includes only variations from certain databases, such as dbSNP. Alternately, one may wish to build a graph appropriate for particular genetic populations. In these cases, one could leverage the random-access entry point map to quickly identify and deserialize only those desired insertions into the graph. In such examples, the backbone branch would be loaded first, followed by the selected insertions. Note also that the edges should be inserted in an order in which they do not depend on other edges, which may not have been inserted. Care should be taken during graph construction in this regard; for example, preferably dbSNP insertions should not depend on (i.e., branch from) insertions from other variation databases, and similarly insertions representative of one genetic population should not depend on those from another. However, various embodiments are considered to be within the scope of the disclosure.
[0152] It should also be noted that the random-access entry point map discloses the location in memory of CIS insertions, which may not correspond to actual genome position. For example, the map may be able to inform a decoder as to the location of the 100,000.sup.th insertion, but is not necessarily useful in finding insertions located near a particular position (e.g., at the middle of chromosome 17). In these cases, a random-access entry point map may also contain additional information specifying the relative genome position of certain insertions. Alternately, a second table or map storing this information may be used.
Integration with Existing Representations of Variant Data
[0153] As noted above, a graph genome can be constructed from a reference sequence by, continuously adding new edges representing variants. Each edge can be defined in Compressed Insertion Specification (CIS) format. Converting existing variant information to CIS format can be relatively straightforward. For example, Variant Call Format (VCF) is an existing file format that describes variations from a reference sequence as an alternate base at a particular position on a reference genome. A VCF->CIS conversion program could iterate through each entry of a VCF file and create a new CIS entry for that variant describing an insertion on a graph. In this way, existing VCF data and data represented in other formats can be converted directly to CIS format.
Advantages of CIS
[0154] The proposed specification efficiently stores graphs as a plurality of efficiently encoded edges. In particular, the proposed specification minimizes the number of edges that need to be stored by constructing a graph by adding edges individually, thus segmenting previously inserted edges into multiple edges. Further, when combined with the LFR8/LR8 encoding scheme, the specification requires only a minimal amount of storage space, leading to greatly compressed structures. For example, a single SNP in the human genome (the most common type of variant) requires only 40 bits to store. Further, use of a random-access entry point map or other indexing structure allows for fast random data access by allowing for particular inserts or components of graphs to be quickly identified and loaded into memory when needed.
Representative System
[0155]
[0156] Processor refers to any device or system of devices that performs processing operations. A processor will generally include a chip, such as a single core or multi-core chip, to provide a central processing unit (CPU). A processor may be provided by a chip from Intel or AMD, A processor may be any suitable processor such as the microprocessor sold under the trademark XEON E7 by Intel (Santa Clara, Calif.) or the microprocessor sold under the trademark OPTERON 6200 by AMD (Sunnyvale, Calif.).
[0157] The memory subsystem 475 contains one or any combination of memory devices. A memory device is a mechanical device that stores data or instructions in a machine-readable format. Memory may include one or more sets of instructions (e.g., software) which, when executed by one or more of the processors of the disclosed computers can accomplish some or all of the methods or functions described herein. Preferably, each computer includes a non-transitory memory device such as a solid state drive, flash drive, disk drive, hard drive, subscriber identity module (SIM) card, secure digital card (SD card), micro SD card, or solid-state drive (SSD), optical and magnetic media, others, or a combination thereof.
[0158] Using the described components, the system 401 is operable to, cause the system to assign a coordinate that defines a path from a first position to a second position within the genomic graph, and encode the coordinate using a variable length encoding scheme where the encoded coordinate includes length information. The system may receive or obtain a genomic graph representing genomic sequences from one or more organism through the use of one or more input/output device, labelled I/O in
[0159] In certain embodiments, the reference graph 100 is stored in the memory subsystem 475 using adjacency lists or index-free adjacency, which may include pointers to identify a physical location in the memory subsystem 475 where each vertex is stored. In a preferred embodiment, the graph is stored in the memory subsystem 475 using adjacency lists. In some embodiments, there is an adjacency list for each vertex. For discussion of implementations see Chapter 4, Graphs at pages 515-693 of Sedgewick and Wayne, 2011, Algorithms, 4th Ed., Pearson Education, Inc., Upper Saddle River N.J., 955 pages, the contents of which are incorporated by reference and within which pages 524-527 illustrate adjacency lists.
Sequencing Technologies
[0160] In use, sequence reads are aligned to edges or paths through a reference graph, and then genotyping or variant calling can be performed by counting the number of reads across each path or by identifying mismatches across each path. Because the sequence reads are aligned to the reference graph which includes known variation, the subsequent step of identifying mutations by, comparing to a table of known mutations can be eliminated. Further, alignment to reference graphs results in greatly improved results for larger structural variants, as well as indels and SNPs located near structural variants. Additionally, in some cases, there may be no need to perform an additional variant calling step because an ideal reference graph will not have any mismatches from the sequence read (of course, variant calling can still be performed to detect unknown variations).
[0161] Sequence reads may be obtained in a variety of ways.
[0162] In certain embodiments, sequence reads are obtained by performing sequencing on the sample 2202 from a subject (however in some embodiments, sequence reads are obtained when a read file is transferred into a system of the invention). Sequencing may be by any method and sequencing instrument known in the art. See, generally, Quail, et al., 2012, a tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences, and Illumina MiSeq sequencers, BMC Genomics 13:341. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing.
[0163] A sequencing technique that can be used includes, for example, use of sequencing-by-synthesis systems and sequencing instruments sold under the trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life Sciences, a Roche company (Branford, Conn.), and described by Margulies, M. et al., Genome sequencing in micro-fabricated high-density picotiter reactors, Nature, 437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No. 5,700,673, each incorporated by reference. 454 sequencing involves two steps. First, DNA is sheared into blunt-end fragments attached to capture beads and then amplified in droplets. Second, pyrosequencing is performed on each DNA fragment in parallel. Addition of nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument.
[0164] Another sequencing technique and instrument that can be used is SOLID technology by Applied Biosystems from Life Technologies Corporation (Carlsbad, Calif.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to generate a fragment library. Clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and enriched and the sequence is determined by a process that includes sequential hybridization and ligation of fluorescently labeled oligonucleotides.
[0165] Another example of a DNA sequencing technique and instrument that can be used is ion semiconductor sequencing using, for example, a system sold under the trademark ION TORRENT by Ion Torrent by Life Technologies (South San Francisco, Calif.). Ion semiconductor sequencing is described, for example, in Rothberg, et al., An integrated semiconductor device enabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S. Pubs. 2009/0026082, 2009/0127589, 2010/0035252, 2010/0137143, 2010/0188073, 2010/0197507, 2010/0282617, 2010/0300559, 2010/0300895, 2010/0301398, and 2010/0304982, each incorporated by reference. DNA is fragmented and given amplification and sequencing adapter oligos. The fragments can be attached to a surface. The addition of one or more nucleotides releases a proton (H+), which signal is detected and recorded in a sequencing instrument.
[0166] Another example of a sequencing technology that can be used is Illumina sequencing. Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented and attached to the surface of flow cell channels. Four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. Sequencing according to this technology is described in U.S. Pub. 2011/0009278, U.S. Pub. 2007/0114362, U.S. Pub. 2006/0024681, U.S. Pub, 2006/0292611, U.S. Pat. No. 7,960,120, U.S. Pat. No. 7,835,871, U.S. Pat. No. 7,232,656, U.S. Pat. No. 7,598,035, U.S. Pat. No. 6,306,597, U.S. Pat. No. 6,210,891, U.S. Pat. No. 6,828,100, U.S. Pat. No. 6,833,246, and U.S. Pat. No. 6,911,345, each incorporated by reference.
[0167] Other examples of a sequencing technology that can be used include the single molecule, real-time (SMRT) technology of Pacific Biosciences (Menlo Park, Calif.) and nanopore sequencing as described in Soni and Meller, 2007 Clin Chem 53:1996-2001.
[0168] As shown in
[0169] In some embodiments, the sequence reads 904 are assembled to provide a contig or consensus sequence, which contig or consensus sequence may be used in finding alignments to a reference graph. Sequence assembly may include any suitable methods known in the art including de novo assembly, reference-guided assembly, others, or combinations thereof. In a preferred embodiment, sequence reads are assembled using graph-based alignment methods. See, e.g., U.S. Pub. 2015/0057946 and U.S. Pub. 2015/0056613, both incorporated by reference. Embodiments of a graph and its use are discussed in greater detail below. The result of assembly is a sequence representing the corresponding portion of the original nucleic acid. The contig or consensus sequence or one or more of individual sequence reads 2204 may be mapped to a reference graph to find an alignment with an optimal score.
[0170] In certain embodiments, each sequence read 2204 is mapped to a reference graph and an alignment is found. As previously noted, alignments of sequence reads to a reference graph may be used as evidence to identify the likelihood of the possible diploid genotypes from an interval in a reference graph.
Exemplary Encoded and Decoded Coordinates
[0171]
[0172]
Exemplary Serialized Genomic Graphs
[0173] Genomic graphs, such as the reference graphs that may find use in sequence assembly, may be serialized, e.g. by the method 101 of
[0174]
INCORPORATION BY REFERENCE
[0175] References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.
EQUIVALENTS
[0176] Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification, and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.