Patent classifications
G16B50/50
Method for the Compression of Genome Sequence Data
The invention relates to a reference-based method for the compression of genome sequence data produced by a sequencing machine. The sequences of nucleotides or bases, that have been previously aligned to a reference sequence, are determined to be perfectly mapped, imperfectly mapped or unmapped with the reference sequence; and then coded according to said determination. The determining step comprises comparing, for each imperfectly mapped sequence, the number of mismatches between said sequence and the reference sequence with a reference threshold value, and encoding the imperfectly mapped sequences according to distinct encoding processes, depending on the result of said comparison method for the compression of genome sequence data produced by a sequencing machine.
Nick-based data storage in native nucleic acids
Nick-based methods, devices, and systems for nick-based data storage in a deoxyribonucleic acid (DNA) sequence are disclosed. Digital information is encoded in a register of at least one copy of a double-stranded DNA sequence having a plurality of nickable positions. The data is translated into a sequence of values from a nick alphabet that is subsequently mapped to the plurality of nickable positions, and the DNA sequence is nicked according to the mapped values. Because the digital information is encoded as a series of nicked and non-nicked positions of a double-stranded DNA sequence, the nucleotide sequence of the DNA can be non-synthetic, or “native” DNA.
METHOD AND DEVICE FOR CREATING GENE MUTATION DICTIONARY, AND METHOD AND DEVICE FOR COMPRESSING GENOMIC DATA USING THE DICTIONARY
Provided are a method and device for creating a gene mutation dictionary, and a method and device for compressing genomic data using the gene mutation dictionary. The method for creating a gene mutation dictionary includes: obtaining genome sequence data of a plurality of individuals of a species and reference genome data of the species; aligning genome sequence data of each individual to the reference genome data to obtain a mutation result of the genome sequence data of each individual relative to the reference genome data; partitioning a genome of the species into a plurality of unit regions of biological significance; and generating a plurality of mutant patterns of the individuals in each unit region by statistically analyzing mutant status for each unit region based on the mutation result, and numbering the mutant patterns, to obtain the gene mutation dictionary.
METHOD AND DEVICE FOR CREATING GENE MUTATION DICTIONARY, AND METHOD AND DEVICE FOR COMPRESSING GENOMIC DATA USING THE DICTIONARY
Provided are a method and device for creating a gene mutation dictionary, and a method and device for compressing genomic data using the gene mutation dictionary. The method for creating a gene mutation dictionary includes: obtaining genome sequence data of a plurality of individuals of a species and reference genome data of the species; aligning genome sequence data of each individual to the reference genome data to obtain a mutation result of the genome sequence data of each individual relative to the reference genome data; partitioning a genome of the species into a plurality of unit regions of biological significance; and generating a plurality of mutant patterns of the individuals in each unit region by statistically analyzing mutant status for each unit region based on the mutation result, and numbering the mutant patterns, to obtain the gene mutation dictionary.
K-mer based genomic reference data compression
A computer-implemented method includes receiving genomic data associated with a plurality of genomes and identifying k-mer sets within the genomic data. The method includes constructing a k-mer subset tree according to the following process: performing iterative pairwise comparisons on the k-mer sets, wherein the iterative pairwise comparisons identify fragments with the most shared k-mers, merging the identified fragments into non-leaf nodes of the k-mer subset tree, and placing each remaining k-mer into a leaf node of the k-mer subset tree. The method includes storing the k-mer subset tree. A computer program product for data compression includes a computer readable storage medium having program instructions embodied therewith. The program instructions are executable by a computer to cause the compute to perform the foregoing method. A system includes a processor and logic. The logic is configured to perform the foregoing method.
K-mer based genomic reference data compression
A computer-implemented method includes receiving genomic data associated with a plurality of genomes and identifying k-mer sets within the genomic data. The method includes constructing a k-mer subset tree according to the following process: performing iterative pairwise comparisons on the k-mer sets, wherein the iterative pairwise comparisons identify fragments with the most shared k-mers, merging the identified fragments into non-leaf nodes of the k-mer subset tree, and placing each remaining k-mer into a leaf node of the k-mer subset tree. The method includes storing the k-mer subset tree. A computer program product for data compression includes a computer readable storage medium having program instructions embodied therewith. The program instructions are executable by a computer to cause the compute to perform the foregoing method. A system includes a processor and logic. The logic is configured to perform the foregoing method.
Method and apparatus for a pipelined DNA memory hierarchy
one embodiment of a memory stores information, including address bits, on DNA strands and provides access using a pipeline of tubes, where each tube selectively transfers half of the strands to the next tube based on probing of associated address bits. Transfers are controlled by logic relating to the state of the tubes: The pipeline may be initialized to start at a high-order target address, providing random access without enzymes, synthesizing probe molecules or PCR at access time. Thereafter, a processing unit gets fast access to sequentially addressed strands each cycle, for applications like executing machine language instructions or reading blocks of data from a file. Another embodiment with a compare unit allows low-order random access. Provided that addresses are encoded using single-stranded regions of DNA where probe molecules may hybridize, other information may use any DNA encoding. Electronic/electrochemical (electrowetting, nanopore, etc.) embodiments as well as biochemical embodiments are possible.
Method and apparatus for a pipelined DNA memory hierarchy
one embodiment of a memory stores information, including address bits, on DNA strands and provides access using a pipeline of tubes, where each tube selectively transfers half of the strands to the next tube based on probing of associated address bits. Transfers are controlled by logic relating to the state of the tubes: The pipeline may be initialized to start at a high-order target address, providing random access without enzymes, synthesizing probe molecules or PCR at access time. Thereafter, a processing unit gets fast access to sequentially addressed strands each cycle, for applications like executing machine language instructions or reading blocks of data from a file. Another embodiment with a compare unit allows low-order random access. Provided that addresses are encoded using single-stranded regions of DNA where probe molecules may hybridize, other information may use any DNA encoding. Electronic/electrochemical (electrowetting, nanopore, etc.) embodiments as well as biochemical embodiments are possible.
Improved Quality Value Compression Framework in Aligned Sequencing Data Based on Novel Contexts
A method for compressing information includes accessing a read of genomic sequencing data, aligning the read to a reference, generating alignment data based on alignment of the read, obtaining a set of contexts based on the alignment data, and compressing quality values corresponding to the alignment data based on the set of contexts. The alignment data may provide an indication of errors in the genomic sequencing data, and each of the quality values may provide an indication of a probability of error at one or more bases in the genomic sequencing data.
Improved Quality Value Compression Framework in Aligned Sequencing Data Based on Novel Contexts
A method for compressing information includes accessing a read of genomic sequencing data, aligning the read to a reference, generating alignment data based on alignment of the read, obtaining a set of contexts based on the alignment data, and compressing quality values corresponding to the alignment data based on the set of contexts. The alignment data may provide an indication of errors in the genomic sequencing data, and each of the quality values may provide an indication of a probability of error at one or more bases in the genomic sequencing data.