SYSTEMS AND METHODS FOR PROCESSING SEQUENCE DATA FOR VARIANT DETECTION AND ANALYSIS
20170372005 · 2017-12-28
Inventors
Cpc classification
G16B50/00
PHYSICS
G16B20/20
PHYSICS
G16B20/00
PHYSICS
International classification
Abstract
Systems and methods for processing sequence data are disclosed herein. In an embodiment, the system is comprised of a computing device that is configured for receiving, storing, and processing sequence data utilizing object-oriented functions. Sequencing is disclosed herein which provides for the customization of sequencing and analysis processing for next generation sequence processing and analysis. The system may be characterized as a bioinformatics system, which uses object oriented functions to process and store sequencing data efficiently and without the need for extensive programming knowledge. Object instances configured as part of the system may be manipulated, transformed, probed, and shared in memory, yet still saved to the disk. Due to the nature of sequence representation within the system, the required disk space needed is much less than existing bioinformatics programs. In another embodiment, MATLAB is utilized as part of the configuration of the system. Due to its object-oriented approach it may be adapted to more complex development functions and processing. This provides for much needed flexibility and ease of use.
Claims
1. A system for processing sequence data for variant detection and analysis comprising: a computing device configured to receive and/or store sequence data; said computing device further configured to utilize a system object for processing and analyzing said sequence data.
2. The system of claim 1, wherein said computing device is configured to detect variants.
3. The system of claim 1, wherein said computing device is configured to characterize variants.
4. The system of claim 1, wherein said computing device is configured to detect and characterize variants.
5. The system of claim 1, wherein said system object is comprised of general properties and reference-based properties.
6. The system of claim 5, wherein said general properties are comprised of a system version, sample header, creation date, nucleotide dictionary, and read filter metrics.
7. The system of claim 5, wherein said reference-based properties are comprised of a sequence dictionary, sequence profile, quality profile, indel profile, depth, and consensus.
8. The system of claim 7, wherein said referenced-based properties are further comprised of a reference header and reference sequence.
9. The system of claim 7, wherein said referenced-based properties are further comprised of an annotation sequence and annotation feature.
10. The system of claim 1, wherein said system is configured with object-oriented functions for receiving, storing, and processing sequence data.
11. The system of claim 10, wherein said object-oriented functions are instructions written in non-compiled code.
12. The system of claim 10, wherein said system is configured with Matlab and using at least one Matlab class.
13. The system of claim 10, wherein said Matlab classes can be manipulated, transformed, probed, and shared in memory, yet still saved to disk.
14. The system of claim 10, wherein said computing device is configured to detect variants.
15. The system of claim 10, wherein said computing device is configured to characterize variants.
16. The system of claim 10, wherein said computing device is configured to detect and characterize variants.
17. The system of claim 10, wherein said system object is comprised of general properties and reference-based properties.
18. The system of claim 17, wherein said general properties are comprised of a system version, sample header, creation date, nucleotide dictionary, and read filter metrics.
19. The system of claim 17, wherein said reference-based properties are comprised of a sequence dictionary, sequence profile, quality profile, indel profile, depth, and consensus.
20. The system of claim 17, wherein said referenced-based properties are further comprised of a reference header, reference sequence, annotation sequence, and annotation features.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0014] The accompanying drawings, which are incorporated in and form a part of the specification, illustrate a preferred embodiment of the present invention, and together with the description, serve to explain the principles of the invention. It is to be expressly understood that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. In the drawings:
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION OF THE INVENTION
[0024] Disclosed herein are systems and methods directed to processing sequence data for variant detection and analysis. The numerous innovative teachings of the present invention will be described with particular reference to several embodiments (by way of example, and not of limitation).
[0025] In embodiments, the invention is an object class configured to be used in sequence processing systems. In other embodiments, the system is comprised of a computing device that is configured for receiving, storing, and processing sequence data. The system being further configured in embodiments of the system with object-oriented functions for processing and analyzing sequence data. The computing device in an embodiment is comprised of a processor, memory, and disk space or storage. The disk space, or storage medium is used for long-term storage of programs, data, an operating system, and other persistent information. In some embodiments, the disk space may be higher latency than memory, but characteristically have higher capacity. In other embodiments, a single hardware device may serve as both memory and disk space. In embodiments, the computing device may also be comprised of hardware and software interfaces to other components of the system such as additional computing devices configured as interfaces or sources of files and/or data to be processed by the system.
[0026] In an embodiment, the object-oriented functions are classes written in non-compiled code such as interpreted instructions. In other embodiments, the interpreted instructions non-compiled code is implemented in a Matlab environment. Embodiments of the system utilize system classes implemented as a self contained Matlab class. Like any other object-oriented programming language class, it contains a set of properties and methods specific to the class which will be discussed in more detail under
[0027] Reference is first made to
[0028] At its core the system class(es) is/are designed to improve upon and replace the way in which reference-mapped NGS sequence data is contained. Currently, the sequence/binary alignment map (SAM/BAM) file format is used to hold this NGS data as a list of sequence reads, associated quality scores, CIGAR alignments, and the location of where each read aligns to its reference. The sum of this information often requires a fast computer processor, ample memory size, and large amounts of disk space to store and process due to the sheer number of sequence reads that can be generated by NGS. Though the BAM format is the compressed version of the SAM format, these files may still require tens of megabytes to tens of gigabytes of storage space, with many above one gigabyte. The SAM/BAM format is a serialized representation of the full scale alignment of sequence reads to a reference sequence, but this set of information can be further compressed by transforming it into a sequence profile. A sequence profile is a two-dimensional numeric matrix that represents the number of molecular monomers (nucleotide/amino acid) that occurs at each position along a multiple sequence alignment, such as that represented in a SAM/BAM file. The caveat in alignment to sequence profile conversion is that quality score information and insertions that do not exist in the reference sequence cannot be maintained by the two-dimensional sequence profile.
[0029] By taking an object-oriented approach to this problem, the disclosed systems' and methods' class object(s) can contain all of this information at a fraction of the size of a BAM file. Only two parts of the information in the read alignment is lost: (1) the sequence permutation of each read and (2) the coupling of individual quality scores to individual nucleotides. However, for many types of downstream analysis, this information is unnecessary. Additionally, the manner in which read information is stored in a SAM/BAM file requires that it be reconstructed into an alignment by some means before it becomes tractable to interpretation. With the system's object(s), the alignment information can be easily accessed without reconstruction or further interpretation. At the same time, with the system being configured with a high-level interpreted programming language (rather than a compiled language) an advantage is achieved for novel method development. Combined with the ease in which the sequence data can be accessed, creating new methods is much less complicated than doing the same using other systems and software tools written in compiled languages
[0030] Most NGS data systems and software tools are procedural and sequential in nature, or they are completed step-by-step both within and between each tool. Those skilled in the art of bioinformatics develop and use individual tools to manipulate, convert, transform, or interpret data with unique file formats as intermediate information containers; this process is oftentimes referred to as a workflow or pipeline and is the means by which raw data is turned in human-interpretable output. While this system is beneficial for points where different programs can be used to process information from the same file format, the same stepwise analysis can be achieved by the disclosed systems by containing the sequencing as a class object variable specific for holding said sequencing data. In using this system, rather than develop and implement entirely novel methods, users can tailor the system without having to develop and compile complex programs or perform complex system configurations. In addition, the disclosed systems and methods allow for manipulation of objects in memory rather than having to save information to a file, though, multidimensional object instances can also be saved as serialized and compressed .mat files.
[0031] Most current bioinformatics software tools—typically freeware—require extensive computer skills and are intractable to customization without extensive software development skills and experience. More user-friendly tools—typically paid software—are necessary for inexperienced users, but are then limited in their functionality and also intractable to novel method development. The system's class relies on the principle of least astonishment (POLA) in both use and development to simplify NGS data analysis. At present, there is a widening gap between the ability to collect and analyze NGS data as only experienced individuals have the capability to process it.
[0032] By using an object-oriented approach of POLA applied to NGS data analysis, the researcher can focus on the analysis and method development, rather than learning how to use multiple software tools to their advantage. In addition, by reducing the size of NGS data, it becomes more transportable and manipulatable than current methods of data containment. Those who would be most interested in using the disclosed systems' class would fit into one of two categories of biological researchers: (1) those who are inexperienced and are willing to pay for software that is easy to use and (2) those who are semi- or fully-experienced bioinformaticians and/or genomicists who desire a method development environment where access to data is easy, simple, and compliant. Because the system's class is more of a framework for method development, usefulness to the end-user cannot be predicted beyond the variant detection and characterization method included in the system's configuration instructions. Though, compared to current practice for this procedure alone, the disclosed system offers considerable improvements over the typical workflow as a testament to the ease in which novel methods can be developed and implemented.
[0033] Reference is next made to
[0034] Reference is now made to
[0035] Reference is next made to
[0036] Reference is now made to
[0037] Reference is next made to
[0038] Lastly, reference is made to
[0039] Appendix A reflects an embodiment of a configuration implemented.
[0040] The disclosed systems and methods are generally described, with examples incorporated as particular embodiments of the invention and to demonstrate the practice and advantages thereof. It is understood that the examples are given by way of illustration and are not intended to limit the specification or the claims in any manner.
[0041] To facilitate the understanding of this invention, a number of terms may be defined below. Terms defined herein have meanings as commonly understood by a person of ordinary skill in the areas relevant to the present invention. Terms such as “a”, “an”, and “the” are not intended to refer to only a singular entity, but include the general class of which a specific example may be used for illustration. The terminology herein is used to describe specific embodiments of the invention, but their usage does not delimit the disclosed device or method, except as may be outlined in the claims.
[0042] Alternative applications for this invention include using the disclosed systems and methods for performing other sequence processing analysis and variant detection which can be achieved utilizing invention disclosed herein. Consequently, any embodiments comprising a one piece or multi piece system having the structures as herein disclosed with similar function shall fall into the coverage of claims of the present invention and shall lack the novelty and inventive step criteria.
[0043] It will be understood that particular embodiments described herein are shown by way of illustration and not as limitations of the invention. The principal features of this invention can be employed in various embodiments without departing from the scope of the invention. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, numerous equivalents to the specific systems and methods described herein. Such equivalents are considered to be within the scope of this invention and are covered by the claims.
[0044] All publications and patent applications mentioned in the specification are indicative of the level of those skilled in the art to which this invention pertains. All publications and patent application are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.
[0045] In the claims, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of,” respectively, shall be closed or semi-closed transitional phrases.
[0046] The systems and/or methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the device and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those skilled in the art that variations may be applied to the systems and/or methods and in the steps or in the sequence of steps of the methods described herein without departing from the concept, spirit, and scope of the invention.
[0047] More specifically, it will be apparent that certain components, which are both shape and material related, may be substituted for the components described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope, and concept of the invention as defined by the appended claims.