Patent classifications
G16B50/50
METHOD AND SYSTEMS FOR GENOME SEQUENCE COMPRESSION
Systems and methods for genome sequence compression and decompression are provided. The method for compression encoding of a genome sequence includes partitioning a genome sequence into a plurality of Group of Bases (GoBs) and processing each of the plurality of GoBs independently to encode the genome sequence into a bit stream. Processing each of the plurality of GoBs includes dividing each of the plurality of GOBs into a first part and a second part, the first part including an initial context part and the second part including a learning-based inference part. The processing each of the plurality of GoBs further includes encoding the first part in accordance with a Markov model, encoding the second part in accordance with a learning-based model, and encoding the encoded first part and the encoded second part into the bit stream with an arithmetic encoder. The learning-based model may include Long and Short-Term Memory (LSTM)-based neural networks.
METHOD AND SYSTEMS FOR GENOME SEQUENCE COMPRESSION
Systems and methods for genome sequence compression and decompression are provided. The method for compression encoding of a genome sequence includes partitioning a genome sequence into a plurality of Group of Bases (GoBs) and processing each of the plurality of GoBs independently to encode the genome sequence into a bit stream. Processing each of the plurality of GoBs includes dividing each of the plurality of GOBs into a first part and a second part, the first part including an initial context part and the second part including a learning-based inference part. The processing each of the plurality of GoBs further includes encoding the first part in accordance with a Markov model, encoding the second part in accordance with a learning-based model, and encoding the encoded first part and the encoded second part into the bit stream with an arithmetic encoder. The learning-based model may include Long and Short-Term Memory (LSTM)-based neural networks.
MINING ALL ATOM SIMULATIONS FOR DIAGNOSING AND TREATING DISEASE
The present disclosure describes methods for determining the functional consequences of mutations. The methods include the use of machine learning to identify and quantify features of all atom molecular dynamics simulations to obtain the disruptive severity of genetic variants on molecular function.
MINING ALL ATOM SIMULATIONS FOR DIAGNOSING AND TREATING DISEASE
The present disclosure describes methods for determining the functional consequences of mutations. The methods include the use of machine learning to identify and quantify features of all atom molecular dynamics simulations to obtain the disruptive severity of genetic variants on molecular function.
METHODS AND SYSTEMS FOR ANONYMIZING GENOME SEGMENTS AND SEQUENCES AND ASSOCIATED INFORMATION
Various methods and systems for processing at least some of genome sequences and at least some of associated information, for an individual, may include one or more of: segmenting genome sequences for at least a purpose of anonymizing genome information; using anchor segments for a purpose of minimizing electronic storage space in storing of genetic sequence information; generating at least one linkage record; generating at least one anonymized linkage record; processing a request for genetic study results; processing genetic study results received; and/or generating personalized information of interest pertaining to the individual. A purpose of such processing may be to prevent, minimize, and/or mitigate against (1) identification of the individual from such genome sequence information and/or from associated information; and/or (2) using such genome sequence information and/or associated information as a basis for discriminating against the individual.
Compressively-accelerated read mapping framework for next-generation sequencing
A method of compressive read mapping. A high-resolution homology table is created for the reference genomic sequence, preferably by mapping the reference to itself. Once the homology table is created, the reads are compressed to eliminate full or partial redundancies across reads in the dataset. Preferably, compression is achieved through self-mapping of the read dataset. Next, a coarse mapping from the compressed read data to the reference is performed. Each read link generated represents a cluster of substrings from one or more reads in the dataset and stores their differences from a locus in the reference. Preferably, read links are further expanded to obtain final mapping results through traversal of the homology table, and final mapping results are reported. As compared to prior techniques, substantial speed-up gains are achieved through the compressive read mapping technique due to efficient utilization of redundancy within read sequences as well as the reference.
Compressively-accelerated read mapping framework for next-generation sequencing
A method of compressive read mapping. A high-resolution homology table is created for the reference genomic sequence, preferably by mapping the reference to itself. Once the homology table is created, the reads are compressed to eliminate full or partial redundancies across reads in the dataset. Preferably, compression is achieved through self-mapping of the read dataset. Next, a coarse mapping from the compressed read data to the reference is performed. Each read link generated represents a cluster of substrings from one or more reads in the dataset and stores their differences from a locus in the reference. Preferably, read links are further expanded to obtain final mapping results through traversal of the homology table, and final mapping results are reported. As compared to prior techniques, substantial speed-up gains are achieved through the compressive read mapping technique due to efficient utilization of redundancy within read sequences as well as the reference.
Data storage based on encoded DNA sequences
Devices, methods, and systems for encoding data as DNA are provided. An encoder device can include circuitry to encode a data file having a bit sequence encoding data and to generate a virtual DNA (VDNA) sequence of virtual nucleotide bases (Vnb) that reversibly encodes the bit sequence of the data file, divide the VDNA sequence into a plurality of VDNA fragments, associate each VDNA fragment with an archive library sequence (Arc_SEQ), and generate a read instruction (READ) sequence of differences between each VDNA fragment and each associated Arc_SEQ including sufficient instruction to facilitate regeneration of each VDNA fragment from each associated Arc_SEQ. A codeword sequence (Code_SEQ) is additionally generated for each VDNA fragment that includes a codename identifying the associated Arc_SEQ, the READ sequence associated with the VDNA fragment, and an index sequence (Idx_SEQ) including an index mapping of the VDNA fragment in the VDNA sequence.
Data storage based on encoded DNA sequences
Devices, methods, and systems for encoding data as DNA are provided. An encoder device can include circuitry to encode a data file having a bit sequence encoding data and to generate a virtual DNA (VDNA) sequence of virtual nucleotide bases (Vnb) that reversibly encodes the bit sequence of the data file, divide the VDNA sequence into a plurality of VDNA fragments, associate each VDNA fragment with an archive library sequence (Arc_SEQ), and generate a read instruction (READ) sequence of differences between each VDNA fragment and each associated Arc_SEQ including sufficient instruction to facilitate regeneration of each VDNA fragment from each associated Arc_SEQ. A codeword sequence (Code_SEQ) is additionally generated for each VDNA fragment that includes a codename identifying the associated Arc_SEQ, the READ sequence associated with the VDNA fragment, and an index sequence (Idx_SEQ) including an index mapping of the VDNA fragment in the VDNA sequence.
Methods for compression of molecular tagged nucleic acid sequence data
A method for compressing molecular tagged sequence data includes: grouping sequence reads associated with a molecular tag sequence to form a family of sequence reads, corresponding vectors of flow space signal measurements and corresponding sequence alignments, calculating an arithmetic mean of the corresponding vectors of flow space signal measurements to form a vector of consensus flow space signal measurements, calculating a standard deviation of the corresponding vectors of flow space signal measurements to form a vector of standard deviations, determining a consensus base sequence based on the vector of consensus flow space signal measurements, determining a consensus sequence alignment and generating a compressed data structure comprising consensus compressed data, the consensus compressed data including for each family, the consensus base sequence, the consensus sequence alignment, the vector of consensus flow space signal measurements, the vector of standard deviations and the number of members.