Patent classifications
G16B50/50
Data structures and operations for searching, computing, and indexing in DNA-based data storage
The present disclosure is directed to enabling search and extraction of data stored in DNA with optimized data structures and functions. Accordingly, systems and methods are provided herein for performing certain functions on data stored in nucleic acid molecules. The present disclosure covers at least the following areas of interest: (1) data structures to provide efficient access and search of information stored in nucleic acid molecules, (2) accurate and quick reading of information stored in nucleic acid molecules, (3) targeted approaches to accessing subsets of information stored in nucleic acid molecules, (4) a rank function that determines a count of particular bit or symbol value in a set of information stored in nucleic acid molecules, (5) functions including counting, locating, and extracting occurrences of a specific pattern in a message of information stored in nucleic acid molecules, and (6) an if-then-else operation to sort data stored in nucleic acid molecules.
Data structures and operations for searching, computing, and indexing in DNA-based data storage
The present disclosure is directed to enabling search and extraction of data stored in DNA with optimized data structures and functions. Accordingly, systems and methods are provided herein for performing certain functions on data stored in nucleic acid molecules. The present disclosure covers at least the following areas of interest: (1) data structures to provide efficient access and search of information stored in nucleic acid molecules, (2) accurate and quick reading of information stored in nucleic acid molecules, (3) targeted approaches to accessing subsets of information stored in nucleic acid molecules, (4) a rank function that determines a count of particular bit or symbol value in a set of information stored in nucleic acid molecules, (5) functions including counting, locating, and extracting occurrences of a specific pattern in a message of information stored in nucleic acid molecules, and (6) an if-then-else operation to sort data stored in nucleic acid molecules.
COMPACT GENOME DATA STORAGE WITH RANDOM ACCESS
The subject technology provides compact, searchable, random-access storage of genome data, particularly for large datasets, such as an entire human genome. The genome data may be stored in binary format, and compressed, in part, by leveraging characteristics of genome data itself, and in a way that maintains searchability of the stored compressed data.
METHODS FOR COMPRESSION OF MOLECULAR TAGGED NUCLEIC ACID SEQUENCE DATA
A method for compressing molecular tagged sequence data includes: grouping sequence reads associated with a molecular tag sequence to form a family of sequence reads, corresponding vectors of flow space signal measurements and corresponding sequence alignments, calculating an arithmetic mean of the corresponding vectors of flow space signal measurements to form a vector of consensus flow space signal measurements, calculating a standard deviation of the corresponding vectors of flow space signal measurements to form a vector of standard deviations, determining a consensus base sequence based on the vector of consensus flow space signal measurements, determining a consensus sequence alignment and generating a compressed data structure comprising consensus compressed data, the consensus compressed data including for each family, the consensus base sequence, the consensus sequence alignment, the vector of consensus flow space signal measurements, the vector of standard deviations and the number of members.
METHODS FOR COMPRESSION OF MOLECULAR TAGGED NUCLEIC ACID SEQUENCE DATA
A method for compressing molecular tagged sequence data includes: grouping sequence reads associated with a molecular tag sequence to form a family of sequence reads, corresponding vectors of flow space signal measurements and corresponding sequence alignments, calculating an arithmetic mean of the corresponding vectors of flow space signal measurements to form a vector of consensus flow space signal measurements, calculating a standard deviation of the corresponding vectors of flow space signal measurements to form a vector of standard deviations, determining a consensus base sequence based on the vector of consensus flow space signal measurements, determining a consensus sequence alignment and generating a compressed data structure comprising consensus compressed data, the consensus compressed data including for each family, the consensus base sequence, the consensus sequence alignment, the vector of consensus flow space signal measurements, the vector of standard deviations and the number of members.
Alignment methods, devices and systems
The disclosure discloses an alignment method, device, and system. The alignment method includes: converting each read into a set of short fragments corresponding to the read to obtain a plurality of sets of short fragments; determining a corresponding position of the short fragment in a reference library to obtain a first positioning result, wherein the reference library is a hash table constructed based on a reference sequence, the reference library includes a plurality of entries, one entry of the reference library corresponds to one seed sequence, and the seed sequence is capable of matching at least one sequence on the reference sequence, a distance between two seed sequences corresponding to two adjacent entries of the reference library on the reference sequence is less than a length of the short fragment; removing a short fragment positioned on any one of the adjacent entries of the reference library in the first positioning result to obtain a second positioning result; and extending based on short fragments from the same read in the second positioning result to obtain an alignment result of the read. The alignment method can efficiently and accurately process and position sequencing data.
Artificial intelligence analysis of RNA transcriptome for drug discovery
A system and method may be provided to receive sample RNA reads from patients and generate lists of genes and their associated RNA expression levels in each patient. Some of the RNA reads may be matched to an RNA transcript or gene or gene family in terms of their match likelihood and other RNA reads may be matched to an RNA transcript or gene or gene family through the use of one or more machine learning classifiers. A machine learning classifier may be trained based on the plurality of the lists and a plurality of corresponding patients' clinical status data to identify gene patterns that recur with a high degree of frequency in the plurality of the lists. Those gene patterns can be capable of modifying a disease or treatment response and can be targeted for drug/treatment development.
Artificial intelligence analysis of RNA transcriptome for drug discovery
A system and method may be provided to receive sample RNA reads from patients and generate lists of genes and their associated RNA expression levels in each patient. Some of the RNA reads may be matched to an RNA transcript or gene or gene family in terms of their match likelihood and other RNA reads may be matched to an RNA transcript or gene or gene family through the use of one or more machine learning classifiers. A machine learning classifier may be trained based on the plurality of the lists and a plurality of corresponding patients' clinical status data to identify gene patterns that recur with a high degree of frequency in the plurality of the lists. Those gene patterns can be capable of modifying a disease or treatment response and can be targeted for drug/treatment development.
SYSTEMS AND METHODS FOR FACILITATING RAPID GENOME SEQUENCE ANALYSIS
A method for facilitating rapid genome sequence analysis includes accessing an output stream of an alignment process that includes aligned reads of a biological sequence that are aligned to a reference genome. The method also includes distributing the aligned reads to a plurality of computing nodes based on genomic position. Each of the plurality of computing nodes is assigned to a separate data bin of a plurality of data bins associated with genomic position. The method also includes, for at least one aligned read determined to overlap separate data bins of the plurality of data bins, duplicating the at least one aligned read and distributing the at least one aligned read to separate computing nodes of the plurality of computing nodes that are assigned to the separate data bins.
SYSTEMS AND METHODS FOR FACILITATING RAPID GENOME SEQUENCE ANALYSIS
A method for facilitating rapid genome sequence analysis includes accessing an output stream of an alignment process that includes aligned reads of a biological sequence that are aligned to a reference genome. The method also includes distributing the aligned reads to a plurality of computing nodes based on genomic position. Each of the plurality of computing nodes is assigned to a separate data bin of a plurality of data bins associated with genomic position. The method also includes, for at least one aligned read determined to overlap separate data bins of the plurality of data bins, duplicating the at least one aligned read and distributing the at least one aligned read to separate computing nodes of the plurality of computing nodes that are assigned to the separate data bins.