G06N3/123

Generating machine learning models using genetic data

Systems, methods, and apparatuses for generating and using machine learning models using genetic data. A set of input features for training the machine learning model can be identified and used to train the model based on training samples, e.g., for which one or more labels are known. As examples, the input features can include aligned variables (e.g., derived from sequences aligned to a population level or individual references) and/or non-aligned variables (e.g., sequence content). The features can be classified into different groups based on the underlying genetic data or intermediate values resulting from a processing of the underlying genetic data. Features can be selected from a feature space for creating a feature vector for training a model. The selection and creation of feature vectors can be performed iteratively to train many models as part of a search for optimal features and an optimal model.

Method and apparatus for a pipelined DNA memory hierarchy
11515012 · 2022-11-29 ·

one embodiment of a memory stores information, including address bits, on DNA strands and provides access using a pipeline of tubes, where each tube selectively transfers half of the strands to the next tube based on probing of associated address bits. Transfers are controlled by logic relating to the state of the tubes: The pipeline may be initialized to start at a high-order target address, providing random access without enzymes, synthesizing probe molecules or PCR at access time. Thereafter, a processing unit gets fast access to sequentially addressed strands each cycle, for applications like executing machine language instructions or reading blocks of data from a file. Another embodiment with a compare unit allows low-order random access. Provided that addresses are encoded using single-stranded regions of DNA where probe molecules may hybridize, other information may use any DNA encoding. Electronic/electrochemical (electrowetting, nanopore, etc.) embodiments as well as biochemical embodiments are possible.

REVERSING BIAS IN POLYMER SYNTHESIS ELECTRODE ARRAY
20220362734 · 2022-11-17 ·

Polymers synthesized by solid-phase synthesis are selectively released from a solid support by reversing the bias of spatially addressable electrodes. Change in the current and voltage direction at one or more of the spatially addressable electrodes changes the ionic environment which triggers cleavage of linkers that leads to release of the attached polymers. The spatially addressable electrodes may be implemented as CMOS inverters embedded in an integrated circuit (IC). The IC may contain an array of many thousands of spatially addressable electrodes. Control circuity may independently reverse the bias on any of the individual electrodes in the array. This provides fine-grained control of which polymers are released from the solid support. Examples of polymers that may be synthesized on this type of array include oligonucleotides and peptides.

Generation of protein sequences using machine learning techniques

Amino acid sequences of antibodies can be generated using a generative adversarial network that includes a first generating component that generates amino acid sequences of antibody light chains and a second generating component generates amino acid sequences of antibody heavy chains. Amino acid sequences of antibodies can be produced by combining the respective amino acid sequences produced by the first generating component and the second generating component. The training of the first generating component and the second generating component can proceed at different rates. Additionally, the antibody amino acids produced by combining amino acid sequences from the first generating component and the second generating component may be evaluated according to complentarity-determining regions of the antibody amino acid sequences. Training datasets may be produced using amino acid sequences that correspond to antibodies have particular binding affinities with respect to molecules, such as binding affinity with major histocompatibility complex (MHC) molecules.

Generation of protein sequences using machine learning techniques

Amino acid sequences of antibodies can be generated using a generative adversarial network that includes a first generating component that generates amino acid sequences of antibody light chains and a second generating component generates amino acid sequences of antibody heavy chains. Amino acid sequences of antibodies can be produced by combining the respective amino acid sequences produced by the first generating component and the second generating component. The training of the first generating component and the second generating component can proceed at different rates. Additionally, the antibody amino acids produced by combining amino acid sequences from the first generating component and the second generating component may be evaluated according to complentarity-determining regions of the antibody amino acid sequences. Training datasets may be produced using amino acid sequences that correspond to antibodies have particular binding affinities with respect to molecules, such as binding affinity with major histocompatibility complex (MHC) molecules.

DIGITAL TRANSACTION LEDGER WITH DNA-RELATED LEDGER PARAMETER

A digital transaction ledger with a DNA-related parameter is provided by obtaining DNA-based data unique to a particular entity, and establishing a DNA-related ledger parameter using the DNA-based data. Further, the method includes associating the DNA-based ledger parameter with a digital transaction ledger, making the digital transaction ledger related, at least in part, to the obtained DNA-based data.

PROGRAMS AND FUNCTIONS IN DNA-BASED DATA STORAGE

Systems and methods are provided herein for encoding and storing information in nucleic acids. Encoded information is partitioned and stored in nucleic acids having native key-value pairs that allow for storage of metadata or other data objects. Computation on the encoded information is performed by chemical implementation of if-then-else operations. Numerical data is stored in nucleic acids by producing samples having nucleic acid sequences copy counts corresponding to the numerical data. Data objects of a dataset are encoded by partitioning of bytes into parts and encoding of parts along distinct libraries of nucleic acids. These libraries can be used as inputs for computation on the dataset.

BIOCOMPATIBLE NUCLEIC ACIDS FOR DIGITAL DATA STORAGE

A device for the storage and/or the editing of digital data including at least one double stranded, replicative, composite nucleic acid molecule. The composite nucleic acid molecule includes both digital data-encoding and non-digital data-encoding nucleic acids. The non-digital data-encoding nucleic acids may allow indexing and/or the provision of metadata for the flanking digital data-encoding nucleic acid. The composite nucleic acid molecules may be pooled to constitute an array and arrays may constitute a DNA drive, which represents the physical support on which the digital data are stored.

MACHINE LEARNING (ML) MODELING BY DNA COMPUTING
20230089824 · 2023-03-23 ·

Methods, computer program products, and systems are presented. The methods include, for instance: building, by one or more DNA processor, a DNA strand corresponding to a conditional expectation. The methods include, for instance: obtaining, by one or more DNA processor, a conditional expectation having a regularization metric.

AUTOMATICALLY IDENTIFYING FAILURE SOURCES IN NUCLEOTIDE SEQUENCING FROM BASE-CALL-ERROR PATTERNS
20230093253 · 2023-03-23 ·

Methods, systems, and non-transitory computer readable media are disclosed for accurately and efficiently identifying base-call-error scars or patterns from sequencing data to determine failure sources that contribute to the base-call-error scars or patterns. For example, the disclosed system can utilize a reference genome to determine nucleotide-specific errors within a run of a sequencing pipeline. Based on the co-occurrence of different nucleotide-specific errors, the disclosed system can determine a base-call-error scar. The disclosed system can further determine one or more sample error scars from sample sequencing runs that correlate to the base-call-error scar. Based on the correlation and by utilizing a statistical model, the disclosed system can identify failure sources contributing to the nucleotide-specific errors within the base-call-error scar.