JOINT MULTI-NANOPORE SEQUENCING FOR RELIABLE DATA RETRIEVAL IN NUCLEIC ACID STORAGE

Abstract

A nucleic acid storage system (100) that uses nanopore sequencing to read data values chemically embedded in oligonucleotides includes a membrane (102), a voltage source (108), and a nucleic acid strand (110). The membrane (102) has a plurality of nanopores (104) that are stacked upon one another in a multi-nanopore arrangement. The voltage source (108) is configured to direct voltage across the plurality of nanopores (104). The nucleic acid strand (110) including the oligonucleotides is threaded through each of the plurality of nanopores (104) within the membrane (102). A separate base signal (118) is generated from the nucleic acid strand (110) being threaded through each of the plurality of nanopores (104), and Recursive Neural Networks can be used to estimate a signal shape for each oligonucleotide. Recurrent Convolutional Neural Networks and noise predictive data detection algorithms can be used based on the estimated signal shapes to sequence the oligonucleotides.

Claims

1. A nucleic acid digital data storage system that uses nanopore sequencing to read data values chemically embedded in oligonucleotides, the nucleic acid storage system comprising: a membrane having a plurality of nanopores that are stacked upon one another in a multi-nanopore arrangement; a voltage source that is configured to direct voltage across the plurality of nanopores; and a nucleic acid strand including the oligonucleotides that is threaded through each of the plurality of nanopores within the membrane.

2. The nucleic acid digital data storage system of claim 1 wherein the nanopores are surrounded by an electrolyte solution within the membrane.

3. The nucleic acid digital data storage system of claim 1 wherein the nucleic acid strand is a DNA strand; and wherein the oligonucleotides include one or more of adenine, guanine, cytosine, and thymine.

4. The nucleic acid digital data storage system of claim 1 wherein the nucleic acid strand is an RNA strand.

5. The nucleic acid digital data storage system of claim 1 wherein the voltage from the voltage source is applied across each of the plurality of nanopores independently of one another to create an electrical field across pore ends of each of the plurality of nanopores; and wherein the electrical field creates an ionic current to pass through each of the plurality of nanopores.

6. The nucleic acid digital data storage system of claim 1 wherein the membrane is usable to capture multiple waveforms for a base sequence when the oligonucleotides are threaded through the plurality of nanopores; and wherein the oligonucleotides being threaded through each of the plurality of nanopores generates a corresponding ionic current.

7. The nucleic acid digital data storage system of claim 6 wherein a separate base signal is generated from the nucleic acid strand being threaded through each of the plurality of nanopores.

8. The nucleic acid digital data storage system of claim 7 wherein Recursive Neural Networks are used to estimate a signal shape for each oligonucleotide.

9. The nucleic acid digital data storage system of claim 8 wherein Recurrent Convolutional Neural Networks and noise predictive maximum likelihood data detection algorithms are used based on the estimated signal shapes to sequence the oligonucleotides.

10. The nucleic acid digital data storage system of claim 7 wherein each of the base signals is modified by each of a post-processing system, a joint symbol detection system, and an Error Correction Coding (ECC) decoding system.

11. The nucleic acid digital data storage system of claim 1 wherein the plurality of nanopores includes a first nanopore, a second nanopore and a third nanopore that are stacked one on top of another from top to bottom in the multi-nanopore arrangement; and wherein the membrane further includes a first cavity that is defined between the first nanopore and the second nanopore, and a second cavity that is defined between the second nanopore and the third nanopore.

12. The nucleic acid digital data storage system of claim 11 wherein each of the plurality of nanopores is different from each of the other nanopores in one or more of size and translocation speed.

13. The nucleic acid digital data storage system of claim 12 wherein the first cavity has a first size, and the second cavity has a second size that is different than the first size.

14. The nucleic acid digital data storage system of claim 1 wherein the membrane is one of a biological membrane, a solid-state membrane, and a hybrid of a biological membrane and a solid-state membrane.

15. A method for using nanopore sequencing to read data values chemically embedded in oligonucleotides, the method comprising the steps of: stacking a plurality of nanopores upon one another in a multi-nanopore arrangement within a membrane; directing voltage across the plurality of nanopores with a voltage source; and threading a nucleic acid strand including the oligonucleotides through each of the plurality of nanopores within the membrane.

16. The method of claim 15 further comprising the step of providing an electrolyte solution within the membrane so that the nanopores are surrounded by the electrolyte solution.

17. The method of claim 15 wherein the step of directing includes applying the voltage from the voltage source across each of the plurality of nanopores independently of one another to create an electrical field across pore ends of each of the plurality of nanopores; and creating an ionic current with the electrical field to pass through each of the plurality of nanopores.

18. The method of claim 15 further comprising the steps of capturing multiple waveforms for a base sequence with the membrane when the oligonucleotides are threaded through the plurality of nanopores; and generating a corresponding ionic current from the oligonucleotides being threaded through each of the plurality of nanopores.

19. The method of claim 18 further comprising the steps of generating a separate base signal from the nucleic acid strand being threaded through each of the plurality of nanopores; estimating a signal shape for each oligonucleotide using Recursive Neural Networks; and sequencing the oligonucleotides using Recurrent Convolutional Neural Networks and noise predictive maximum likelihood data detection algorithms based on the estimated signal shapes.

20. The method of claim 19 further comprising the step of modifying each of the base signals by each of a post-processing system, a joint symbol detection system, and an Error Correction Coding (ECC) decoding system.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] The novel features of this invention, as well as the invention itself, both as to its structure and its operation, will be best understood from the accompanying drawings, taken in conjunction with the accompanying description, in which similar reference characters refer to similar parts, and in which:

[0039] FIG. 1 is a simplified schematic illustration of an embodiment of a nucleic acid digital data storage system having features of the present invention;

[0040] FIG. 2 is a simplified schematic illustration of a portion of the nucleic acid digital data storage system illustrated in FIG. 1, including an embodiment of a membrane, a voltage source and a DNA strand;

[0041] FIG. 3 is a simplified schematic illustration of an embodiment of a post-processing system that can be incorporated into the nucleic acid digital data storage system illustrated in FIG. 1;

[0042] FIG. 4 is a simplified schematic illustration of an embodiment of a joint symbol detection system that can be incorporated into the nucleic acid digital data storage system illustrated in FIG. 1;

[0043] FIG. 5 is a simplified schematic illustration of an embodiment of an Error Correction Coding (ECC) decoding system that can be incorporated into the nucleic acid digital data storage system illustrated in FIG. 1;

[0044] FIG. 6 is a representative graphical illustration of a base signal estimation for nanopore sequencers that may be seen using the nucleic acid digital data storage system illustrated in FIG. 1; and

[0045] FIG. 7 is a simplified schematic cross-sectional view illustration of nanopores usable within the nucleic acid digital data storage system illustrated in FIG. 1 shown on a two-dimensional planar surface.

[0046] While embodiments of the present invention are susceptible to various modifications and alternative forms, specifics thereof have been shown by way of example and drawings, and are described in detail herein. It is understood, however, that the scope herein is not limited to the particular embodiments described. On the contrary, the intention is to cover modifications, equivalents, and alternatives falling within the spirit and scope herein.

DESCRIPTION

[0047] Embodiments of the present invention are described in the context of a nucleic acid digital data storage system (also sometimes referred to as a “data storage system” or simply a “storage system”) that utilizes joint multi-nanopore sequencing for reliable data retrieval. More particularly, in various embodiments, the data storage system is configured to use multiple-pore manufacturing in the same membrane to capture multiple waveforms for the same base sequence. In other words, the same oligonucleotides pass through multiple physically collocated pores (stacked on top of each other) with potentially different translocation speeds, and each generates a corresponding ionic current. As referred to herein, it is appreciated that a nanopore is a pore of nanometer size. Thus, the terms “nanopore” and “pore” are sometimes used interchangeably herein.

[0048] Those of ordinary skill in the art will realize that the following detailed description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the present invention will readily suggest themselves to such skilled persons having the benefit of this disclosure. Reference will now be made in detail to implementations of the present invention as illustrated in the accompanying drawings. The same or similar reference indicators will be used throughout the drawings and the following detailed description to refer to the same or like parts.

[0049] In the interest of clarity, not all of the routine features of the implementations described herein are shown and described. It will, of course, be appreciated that in the development of any such actual implementations, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, such as compliance with application-related and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the art having the benefit of this disclosure.

[0050] In various implementations of the present invention, the data storage system is configured to use multiple nanopores (with each individual nanopore being either biological (protein-based), solid-state, or a hybrid thereof) with different aperture sizes and potentially chemical content (protein, graphene, silicon nitrate, etc.), usable in nanopore sequencing for reliable data retrieval. An example structure of the multi-pore cross-section, as well as the subsequent system components, is shown in FIG. 1. More specifically, FIG. 1 is a simplified schematic illustration of an embodiment of a nucleic acid digital data storage system 100 (also referred to as a “data storage system” or simply as a “storage system”) including a membrane 102 (either a biological membrane, a solid-state membrane, or a hybrid thereof) having a plurality of nanopores 104 (or “pores”) that are stacked upon one another in a multi-nanopore arrangement, which are surrounded by an electrolyte solution 106; a voltage source 108; and a nucleic acid strand, such as a DNA strand 110 in this non-exclusive embodiment, that is threaded through the membrane 102, such as through the nanopores 104 positioned within the membrane 102; and further including a post-processing system 112, a joint symbol detection system 114 (also referred to herein as a “detection system”), and an Error Correction Coding (ECC) decoding system 116 (also referred to herein as a “decoding system”). With such design, as described in greater detail herein, the same membrane 102 can be used to capture multiple waveforms for the same base sequence, with the same oligonucleotides passing through multiple physically collocated nanopores 104 with potentially different translocation speeds, and each generating a corresponding ionic current. Additionally, or in the alternative, the data storage system 100 can include more components or fewer components than what is illustrated in FIG. 1.

[0051] DNA-based data storage systems encode digital information (typically in a series of 0's and 1's) using combinations of the four nucleotides (adenine (A), guanine (G), cytosine (C) and thymine (T), more commonly known as “bases”) of which DNA is composed. There is considerable flexibility in that encoding. For example, each base may represent two bits, or individual (or short sequences of) bits may be represented by short, predetermined sequences of bases. It is recognized that the systems and methods described in detail herein are applicable in all of these cases.

[0052] Although the invention is generally described in detail in relation to DNA digital data storage, it is appreciated that substantially the same systems and methods would be equally applicable utilizing RNA in lieu of DNA. Therefore, it is not intended that the scope of the present disclosure be limited in such manner.

[0053] It is appreciated that the membrane 102 can include any suitable number of nanopores 104 that are stacked one upon another. For example, in the embodiment illustrated in FIG. 1, the membrane 102 includes three nanopores 104, such as a first (upper) nanopore 104A, a second (middle) nanopore 104B, and a third (lower) nanopore 104C, which are stacked upon one another in a multi-nanopore arrangement. Alternatively, the membrane 102 can include greater than three nanopores 104 or only two nanopores 104 in accordance with the teachings of the present invention.

[0054] In different implementations, the nanopores 104 may, for example, be created by a pore-forming protein or as a hole in synthetic materials such as silicon or graphene. More particularly, as noted, the nanopores 104 can be biological, solid-state, or a hybrid thereof. In one such implementation, the nanopores 104 are created as holes in silicon nitrate (SiN) structures and/or materials.

[0055] As further illustrated in FIG. 1, base signals 118 that are generated from the DNA strand 110 being threaded through the nanopores 104 are also shown, as the base signals 118 are then moved through, subjected to, processed, detected, decoded and/or modified by the post-processing system 112, the detection system 114, and the decoding system 116. More particularly, in summary, a multi-nanopore storage system 100 as described leads to a sequence of read-out base signals 118, and the three modules, such as the post-processing system 112, the detection system 114, and the decoding system 116 in this particular embodiment, process these raw base signals 118 to be able to decide on the final DNA molecule.

[0056] The post-processing undertaken within the post-processing system 112 can take different shapes depending on the signal quality, signal synchronization, signal amplitude, signal phase among other properties of the signal captured. There may also be coupling between the nanopore currents due to the physical proximity which will be compensated in the joint symbol detection system 114 after post-processing is done. Finally, the data is decoded using generated redundancy (ECC) within the decoding system 116.

[0057] Each of the major components of the embodiment of the storage system 100 of FIG. 1, including the membrane 102 and the various components included therein, the post-processing system 112, the detection system 114 and the decoding system 116, are shown in greater detail in FIGS. 2-5 herein below. Initially, details of an embodiment of the membrane 102 and the various components utilized therein is illustrated in FIG. 2. Subsequently, details of embodiments of the post-processing system 112, the joint symbol detection system 114, and the ECC decoding system 116 of the data storage system 100 are illustrated in FIGS. 3, 4 and 5, respectively.

[0058] FIG. 2 is a simplified schematic illustration of a portion of the nucleic acid digital data storage system 100 illustrated in FIG. 1, including an embodiment of the membrane 102, the voltage source 108 and the DNA strand 110.

[0059] As noted above, the membrane 102 can be provided in the form of either a biological membrane, a solid-state membrane, or a hybrid thereof. In one non-exclusive embodiment, the membrane 102 can include silicon nitrate structures 220 that form the plurality of nanopores 104.

[0060] In various embodiments, the membrane 102 includes the plurality of nanopores 104 (or “pores”) that are stacked upon one another in a multi-nanopore arrangement. The nanopores 104 are further surrounded by the electrolyte solution 106. For simplicity of illustration, in the embodiment specifically illustrated in FIG. 2, the membrane 102 includes three nanopores 104 that are stacked one upon another in the multi-nanopore arrangement. However, it is appreciated that the membrane 102 can include any suitable number of nanopores 104, which may be greater than three nanopores 104 or only two nanopores 104. As further shown in FIG. 2, the size and shape of each of the plurality of nanopores 104 can be varied. More specifically, in this non-exclusive embodiment, the first (upper) nanopore 104A, the second (middle) nanopore 104B, and the third (lower) nanopore 104C are shown as each having a slightly different size and shape.

[0061] The areas within the membrane 102 between the nanopores 104 can also be referred to as cavities. For example, as shown in FIG. 2, a first (top) cavity 222 is defined between the first nanopore 104A and the second nanopore 104B, and between the uppermost and middle silicon nitrate structures 220; and a second (bottom) cavity 224 is defined between the second nanopore 104B and the third nanopore 104C, and between the middle and lowermost silicon nitrate structures 220. As shown, the cavities 222, 224 may be different sizes from one another. With such design, the present invention provides the ability to control the translocation time of DNA molecules through the use of multiple nanopores 104 which may be interleaved with different sized cavities 222, 224.

[0062] It is appreciated that the nanopores 104 are again illustrated in FIG. 2 as being surrounded by the electrolyte solution 106.

[0063] When one or more nanopores 104 are present in an electrically insulating membrane 102, a detection principle is based on monitoring the ionic current passing through the nanopores 104 as a voltage is applied across the membrane 102. When the nanopores 104 are of molecular dimensions, passage of molecules (such as DNA) cause interruptions of the “open” current level, leading to a “translocation event” signal.

[0064] As illustrated, in a nanopore sequencing technique, which is used to read data values chemically embedded in oligonucleotides, the DNA strand 110 passes through the plurality of nanopores 104 and voltage from the voltage source 108 is applied across the nanopores 104 which ends up creating an electrical field 226 across pore ends 204E (one such electrical field 226 is identified in FIG. 2). This voltage (the electrical field 226 itself) creates an ionic current to pass through the nanopores 104 (movement of charges due to the electrical field 226). The effect of applying a bias voltage across the membrane 102 thereby inducing the electrical field 226 that drives charged particles, in this case the ions, into motion, is known as electrophoresis. For high enough concentrations, the electrolyte solution 106 is well distributed and all the voltage drop concentrates near and inside the nanopores 104. This means charged particles in the electrolyte solution 104 only feel a force from the electrical field 226 when they are near the pore region. This region is often referred to as the capture region.

[0065] Inside the capture region, ions have a directed motion that can be recorded as a steady ionic current by placing electrodes near the membrane 102. More particularly, as noted above, depending on the type of the molecule passing through the nanopores 104, different current blockade levels and translocation speeds can be measured and recorded through placing electrodes near the membrane 102. This molecule also has a net charge that feels a force from the electrical field 226 when it is found in the capture region. The molecule approaches this capture region aided by Brownian motion and any attraction it might have to the surface of the membrane 102. Once inside the nanopore 104, the molecule translocates through via a combination of electro-phoretic, electro-osmotic and sometimes thermo-phoretic forces. Inside the nanopore 104, the molecule occupies a volume that partially restricts the flow of ions, observed as an ionic current drop. Different molecules can then be sensed and potentially identified based on this modulation in ionic current. For example, based on various factors such as nanopore 104 geometry, size and chemical composition, the change in the magnitude of the ionic current blockade and the duration of the translocation (so called dwell time) may vary over time.

[0066] The voltage source 108 can be any suitable type of voltage source that is configured to provide the desired voltage across the nanopores 104 which ends up creating the electrical field 226 across the pore ends 204E, and which creates the ionic current to pass through the nanopores 104.

[0067] As illustrated in FIG. 2, in various embodiments, the DNA strand 110 can be a double-helix DNA strand that is fed into the nanopores 104. An enzymatic reaction dispatches the strands and one of them passes through the three different nanopores 104A-104C, which can have different sizes and chemical content and distinct cavity 222, 224 volumes/rooms. The translocation speed also varies due to natural manufacturing differences between the nanopores 104, cavity 222, 224 sizes and the type of motor mechanism (such as a protein) used to move the DNA strand 110 or some other mechanism. The first nanopore 104A assumes the fastest speed, whereas as one moves down the membrane 102, the average translocation speed of the nanopores 104 decreases. A voltage from the voltage source 108 is applied across each nanopore 104 independently. This voltage leads to induced ionic current blockade through the nanopores 104 which are measured and recorded.

[0068] In the real-time streaming, these base signals 118 (illustrated in FIG. 1) are post-processed within the post-processing system 112 (illustrated in FIG. 1) after the ionic current is measured and recorded.

[0069] FIG. 3 is a simplified schematic illustration of an embodiment of the post-processing system 312 that can be incorporated into the nucleic acid digital data storage system 100 illustrated in FIG. 1. The post-processing undertaken within the post-processing system 312 can take different shapes depending on the signal quality, signal synchronization, signal amplitude, signal phase among other properties of the base signals 118 (illustrated in FIG. 1) that have been captured in the manner as described above.

[0070] As illustrated, in certain embodiments, the raw base signals 118 first go through a bank of adaptive filters 328 (such as Adaptive Finite-Impulse Response filters (AFIRs) or other suitable types of filters) in parallel, whose coefficients are subject to optimization/learning, to generate a plurality of filtered signals 330. Next, due to physical separation between the nanopores 104 (illustrated in FIG. 1) and varying translocation, shifting operation within one or more shifters 332 is applied to each one of the filtered signals 330 depending on their location in the stacked architecture to generate a plurality of shifted signals 334. The shifter 332 does signal shifts (either to the right or to the left) to generate the shifted signals 334. The closer the filtered signal 330 is to the center, the less the amount of shift becomes.

[0071] Following this stage, data is padded as necessary onto the shifted signals 334 with a data padding system 336 due to the shifting operation. Data padding is used to place zeros for frame completion in some embodiments. Subsequently, the waveform is sampled within an aperiodic sampling system 338 at a period that can change over time (adjusted based on the translocation and physical distances or geometries). In other words, sampling within the sampling system 338 creates samples from the signals subject to non-uniform sampling periods. Finally, a whitening filter 340 is used to change the statistical properties of the colored noise. This whitening filter 340 is typically designed to be a finite-impulse response filter also, but can alternatively include another suitable type of filter such as an infinite impulse response (IIR) filter. The whitening filter 340 operates on the discrete samples and helps the subsequent detection process minimally affected by the colored nature of the noise. Such a sequence of post processing tools prepares the signal samples for the subsequent detection process.

[0072] FIG. 4 is a simplified schematic illustration of an embodiment of a joint symbol detection system 414 that can be incorporated into the nucleic acid digital data storage system 100 illustrated in FIG. 1. The detection process uses branch metric calculations for each signal. Therefore, there is a branch metric calculator 442 before the data passes through a trellis 444 that is configured for use in data-dependent list detection. To embed data-dependency, at the expense of complexity, for multiple potential data sequences, different branch metrics can be calculated. The trellis 444 is constructed and branch metrics are used to calculate a proximity metric. The trellis 444 can alternatively be constructed jointly and hence jumping from one trellis 444 to another might be possible as shown in FIG. 4. Based on the accumulated branch metrics on the trellis 444, a most likely path is found through a standard backtracking. If more memory is used to keep track of multiple most likelihood paths in each step of the trellis 444, then a group of most likely S sequences can be generated for each nanopore 104 (illustrated in FIG. 1) by following the valid paths on the joint trellis 444. This list approach can help improve the detection accuracy. Data dependency can be inserted into the branch metric calculator 442 module for each possible data sequence, and a different branch metric can be calculated and used for different branches at different times.

[0073] FIG. 5 is a simplified schematic illustration of an embodiment of an ECC decoding system 516 that can be incorporated into the nucleic acid digital data storage system 100 illustrated in FIG. 1. Despite the fact that the storage system 100 is configured to optimize the alignment between different nanopore read-outs, the nanopores 104 (illustrated in FIG. 1) themselves may miss or insert new nucleotides due to varying translocation speed or imperfections inside the cavities 222, 224 (illustrated in FIG. 2) or nanopores 104 (illustrated in FIG. 1). Thus, symbols may be inserted, deleted (indel for short), or substituted. Thus, an individual indel decoding is applied to each detection output. Due to the correlation between distinct detection outputs, these indel decoders 546 work collaboratively and pass information among themselves to increase the accuracy of the symbol/data correction. The remaining substitution errors are resolved by a concatenated error and/or erasure decoding algorithm. This final decoder 548 combines the results of the indel decoder 546 outputs, merges them and minimizes the number of errors before running the secondary error correction decoder algorithm. The main purpose of the final decoder 548 is to pull the error rates to 10{circumflex over ( )}-19 or below at the worst case. The code rates for each coding stage in a concatenated setting are determined based on the nominal uncoded error rate of the storage system 100. This would be a function of nanopores used, detection algorithm parameters, preprocessing tools employed and environmental conditions, among other effects.

[0074] It is appreciated that the joint symbol detection system 414 and the ECC decoding system 516 that can be incorporated as part of the nucleic acid digital data storage system 100 can include features, components and details somewhat similar to what was illustrated in the bit error detection and correction system of U.S. patent application Ser. No. 13/719,777 filed on Dec. 19, 2012 that utilizes a combination of a List-Viterbi (or “List-NPMLD”) detection algorithm, and error detection code decoders for reducing the number of error events at the output of the Viterbi (or “NPMLD”). As far as permitted, the contents of U.S. patent application Ser. No. 13/719,777 are incorporated in their entirety herein by reference.

[0075] In summary, after the base signals 118 are collected in the manner illustrated and described, post-processing is applied to the collected current waveforms. Following the post-processing, a joint detector architecture follows to generate the final base-calling output before implementation of the Error Correction Coding (ECC) decoding stage. To correctly operate, it is necessary to have a decent signal model and a PP+detector combination that should be implemented carefully based on the operating conditions and the resulting data. Various methods of post-processing and detection methods are provided as a list of claims in the following. Each of these claims can either alone or jointly be implemented to address the problems previously mentioned herein.

[0076] In a first claim, in order to enhance understanding of the channel, reduce complexity, and decouple different stages of the data detection process, it is proposed to use Artificial Neural Networks/Recursive Neural Networks (ANN/RNN) to estimate isolated impulse responses of the nanopore to four different bases, namely A, G, C and T. In this characterization, each ionic current level is a result of multiple signals shifted right/left and superimposed on each other. An example scenario is illustrated and described in greater detail herein above. With this treatment, simple threshold-detector approaches can be designed based on the signal shapes as well as severity of the inter-symbol-interference. Alternative detection methods can also be proposed, of which some are detailed in other claims.

[0077] In a second claim, in an embodiment of the present invention, it is assumed that the response of a given nanopore to a nucleotide is a combination of two channel responses h.sub.1(t) and h.sub.2(t). To model the varying translocation, time shifts of these two signals are assumed to form the current blockade signal,

I(t)=Σ.sub.ia.sub.ih.sub.i(t−iT)+b.sub.ih.sub.2(t−iS)+η(t) (Equation 1)

where a.sub.i∈{+1, −1} and b.sub.i∈{+1, −1}. Also, T and S are the periods for these responses and η(t) is the noise component of the observed current signal I(t). There are four combinations of a.sub.ib.sub.i which are used to encode nucleotides A, G, C and T. In this formulation, h.sub.1(t), h.sub.2(t), T and S are estimated based on the given recorded signals so that given the DNA sequence I(t) most mimics the training data. There may be multiple AI-based approaches to the estimation process. In one embodiment, neural networks can be used, whereas in the other, linear or non-linear regression techniques can alternatively be used.

[0078] FIG. 6 is a representative graphical illustration of a base signal estimation for nanopore sequencers that may be seen using the nucleic acid digital data storage system illustrated in FIG. 1. As shown in FIG. 6, each of the nucleotides, or bases, A, G, C and T, has a unique estimated base signal shape that is found through use of the process of nanopore sequencing. More particularly, as shown, the adenine (A) nucleotide has a first estimated base signal shape 618A, the thymine (T) nucleotide has a second estimated base signal shape 618T that is different than the first estimated base signal shape 618A, the guanine (G) nucleotide has a third estimated base signal shape 618G that is different than the first estimated base signal shape 618A and the second estimated base signal shape 618T, and the cytosine (C) nucleotide has a fourth estimated base signal shape 618C that is different than the first estimated base signal shape 618A, the second estimated base signal shape 618T and the third estimated base signal shape 618G.

[0079] With the base signals 118 (one example of which is shown in FIG. 6) generated through threading the DNA strand 110 (illustrated in FIG. 1) through the nanopores 104 (illustrated in FIG. 1) within the membrane 102 (illustrated in FIG. 1), a base sequence is generated that relates to the current level which includes a concatenation of four individual signal shapes. Examples are illustrated in FIG. 6 for sequence “AAAC” and sequence “TTAC”.

[0080] In certain embodiments, Recursive Neural Networks (RNNs) are used to estimate the signal shapes for each base nucleotide rather than using a base detection process directly. Based on the estimated signal shapes, the data storage system is configured to use Recurrent Convolutional Neural Networks (R-CNNs) and conventional detection algorithms based on estimated signal shapes such as noise predictive maximum likelihood detection (NPMLD) to sequence the nucleotides in a spatially coordinated way. In this manner, improved detection accuracy performance is ensured, while giving a brand-new methodology to the detection process within the context of explainable AI and low-complexity information decoding.

[0081] Assuming a linear system under sufficiently responsive and adaptive conditions, the individual estimation of signal shapes based on RNNs or R-CNNs would lead to accurate weighted superposition and the estimate of the observed induced current/voltage signal. Hence, knowing the individual impulse responses, and their adaptive estimation, a sequence detector can be employed to estimate the base sequences.

[0082] In a third claim, in an alternative post-processing method, it is appreciated that as the nucleotides pass through the nanopores, there will be multiple and dependent signals measured. A conventional RNN would not work in this case as it expects a one-dimensional time series. Therefore, multiple independent RNNs can be employed that can be run without using the inherent dependency between the measured signals and plus the coupling. RNN outputs are finally combined through simple majority voting to have the final decision on the sequence of nucleotides.

[0083] In a fourth claim, in alternative methodology, assuming three nanopores as shown in FIG. 1, the raw base signals can be post-processed in the following way: First, the top signal I.sub.T(t) is shifted by Δ.sub.1 to the right, then the bottom signal I.sub.B (t) is shifted by Δ.sub.2 to the left. These signals go through signal padding to have the same length or pad if need be in the streaming mode. Next, these signals are sampled with appropriate periods to get the signal samples. Finally, a recurrent CNN (R-CNN) [1] (f.sub.R-CNN(.,.,.)) is implemented to use these signal samples all at the same time and exploit the dependencies/correlations and/or eliminate coupling inherent to their generation. In other words, the R-CNN output consists of samples of the function

f.sub.R-CNN(I.sub.T(t−Δ.sub.1),I.sub.M(t),I.sub.B(t+Δ.sub.2)) (Equation 2)

[0084] This technique still uses an end-to-end neural network and could be quite complex to implement, particularly in the context of a 100 million stacked nanopore architecture.

[0085] In a fifth claim, in another embodiment, neural networks are used to estimate signal shapes for each nanopore rather than doing a joint base calling. The estimation of signal shapes might be different for each physical nanopore. However, with coupling between such nanopores, techniques like R-CNN could be used to estimate signal shapes jointly. For instance in an embodiment of a three nanopore structure, there can be 12 different signal shape estimates, one for each nanopore and base. Next, using such signal estimates, a maximum likelihood detector (MLD) can be employed based on a trellis structure (for each nanopore individually) whose branch metric computations will be done based on the signal estimates that are jointly generated. The basecalling output would be the least costly path in the trellis given the nanopore signal output. Finally, a majority vote at the end merges these sequences to make a decision on a single base sequence. In this case, multiple MLDs per nanopore would be needed. To give an example, consider the following sequence as shown in Table 1:

TABLE-US-00001 TABLE 1 Initial Sequencing Detected t = 0 = t1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 t = 11 t = 12 t = 13 t = 14 Pore 1 A C T G A C G G C T G A C C A Pore 2 o A C T G A C G C C T G A C C Pore 3 o o A C T G A C G C C T G A C

[0086] Now, assume that even if joint cost estimation, etc. is used, there is a base deleted during the detection process due to faster translocation than usual. So, the following picture can be obtained after a deletion in one of the pores, as shown in Table 2.

[0087] Deletion in Pore 3

TABLE-US-00002 TABLE 2 Sequencing Detected After A Deletion in Pore 3 t = 0 = t1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 t = 11 t = 12 t = 13 t = 14 Pore 1 A C T G A C G C C T G A C C A Pore 2 o A C T G A C G C C T G A C C Pore 3 o o A C T G A C G C T G A C C

[0088] As shown in Table 2, a deletion in pore 3 happens right after t=8, where a nucleotide C is deleted by the pore due to translocation or detection problems. By considering the output of all three pores, this deletion error can easily be detected and corrected through some majority logic voting system.

[0089] In a sixth claim, as an alternative to the fifth claim, the MLD detectors (for each nanopore) can exchange information during the sequence estimation process to decide on the single base sequence while sequencing their own bases. In other words, while calculating the distance metrics, corresponding distance metrics from other trellises can be used to determine the most likely sequence. Thus, in this formulation, bases are jointly determined and MLDs work collaboratively. That is to say, MLDs converge to the same sequence decision while moving over their corresponding signal sequences. The joint collaboration results in the same consensus over the most likely base sequence by identifying errors, deletions as well as insertions to the base sequence. A short-time memory would need to be used for back-tracing in the MLD implementation. However, due to time dependence between sequences, memory used for each MLD can help other memories in the back-tracing process.

[0090] In a seventh claim, in another embodiment of the system for the example number of nanopores of FIG. 1 and apparatus described therein, the contributions of distance metrics of the corresponding MLDs can be weighted in a unique way. The main reason behind it is due to the preprocessing of the fourth claim noted above, where the top and bottom signals are shifted to the right and left by different amounts in a 3-nanopore joint base calling and natural translocation speeds of nanopores are different by design. However, these estimations are subject to errors and/or failures which can be detrimental to the overall system detection performance. Particularly, if these parameters become non-adaptive due to the varying translocation speeds and environmental changes (such as PH for biological nanopores), these shift amounts may not be accurate throughout the sequencing process. In the case of adaptive calculation, a highly non-stationary signal nature can make these parameter estimations hard to be of use in practice. In an embodiment of the idea, the middle nanopore may be manufactured to give the best performance while the other neighboring nanopores can be structured as helpers and can be chosen to be cost-efficient and of lesser quality to reduce overall cost. For instance, the middle nanopore can be larger in size, can use the best and more costly chemical processes, can use extra mechanisms to stabilize the translocation, etc. Thus, the MLD for the middle nanopore current output forms the main detection engine while the other two MLDs can act as auxiliary detection engines and their metric information can be weighted less as compared to the main engine. In this manner, errors in the shift amount estimation would be less propagated to the main sequence estimation process to ensure better detection performance. In fact, the shift amounts Δ.sub.1 and Δ.sub.2 and the weights are interconnected to each other and need to be optimized jointly.

[0091] In an eighth claim, in still another embodiment of the storage system, data could be encoded using indel-correction code, followed by a product code able to correct both substitution errors and erasures. This concatenation of coding could be necessary to reduce error rates below 10.sup.−20 nucleotide detection error rates. Through joint detection, some of the indels would be detected due to the diversity of multiple captured copies of the same data. These detected nucleotides are filled/labelled as erasures to be used by the subsequent product decoding. Product codes are great selections to attack a mixture of substitution errors and erasures whereas the front-end indel-correcting code will take care of the remaining single deletions or insertions. The remaining indels are expected to be small in size, such as a single indel per codeword at maximum.

[0092] In a ninth claim, in yet another embodiment of the proposed storage system, the concept of “Master channel” can be used to periodically learn the signal shapes, filter coefficients, whitener coefficients, branch metrics, shift amounts, pad amounts, and sampling periods among other parameters of the storage system. Master nanopores have a special chemical header attached to the nanopore entrance. This chemical composition identifies specially designed DNA reference molecules. These nanopores do not allow any other molecule to pass but these special molecules. Therefore, since these reference molecules are known, corresponding system parameters are optimized based on the resulting nanopores. These parameters are then communicated with non-master nanopores for update during real-time sequencing operation. FIG. 7 is a simplified schematic cross-sectional view illustration of nanopores 704 usable within the nucleic acid digital data storage system 100 illustrated in FIG. 1 shown on a two-dimensional planar surface 750. As shown, each stacked nanopore 704 is associated with multiple wells 752. In this example, four wells 752 are shown for each nanopore 704 just like in an Oxford Minion Device.

[0093] As can be seen in FIG. 7, well-sizes are different and a nanopore 704 can only switch to one and only one of these wells 752 (forming the DNA channel) during sequencing. Well sizes are different because DNA molecules pass more frequently with the bigger size wells 752. Hence, by switching between the wells 752 for master nanopores 704M, the update frequency of the system parameters can be adjusted. The switch between different wells 752 in other non-master channels is done based on the probabilities of DNA molecules passing through each well 752. For example, the biggest well 752 can be switched on for 50% of time, whereas the rest of the wells 752 share equally the other 50% of the time. The number of and the allocation of the master nanopores 704M among all the set of nanopores 704 are adjusted such that enough update information can be collected and allocation is balanced all across the two-dimensional surface 750 such that the separation between the master nanopores 704M is maximized for a given fixed number.

[0094] It is further noted that thanks to their solid-state nature, the nanopores 704 are expected to survive in their initial state for a long time and hence ensure a stationary signal shape throughout the data lifetime. In case a major change is detected in the storage system 100, retraining of collected data is executed to correct the signal shapes and sampling times. Otherwise, a drift in the storage system 100 may dramatically reduce the detection accuracy performance of the subsequent detection algorithms.

[0095] It is further appreciated that other machine learning schemes can also be used within the context of this disclosure where appropriate as long as multi-class classification is performed. For instance, the regression or reinforcement learning can be used to estimate h.sub.1(t) and h.sub.2(t). Depending on the nanopore model, signal levels can be mapped to these functions provided the sampling periods are known. Another such example is Error Correction Output Coding (ECOC) frameworks, in which multiple component binary classifiers are used with an appropriate merging algorithm to achieve successful multi-class classification. All multi-class (4-class) classification algorithms can be used to classify bytes in each iteration into one of the four classes A, G, C, T. Accuracy of such algorithms is of crucial importance for the iterations to work properly and in order not to introduce new type of errors into the decoding operation. Depending on the technique, the training may take different amounts of time and memory space.

[0096] With the present invention, contrary to the state-of-the-art, Recursive Neural Networks (RNNs) are used to estimate the signal shapes for each base nucleotide rather than using a base detection process directly. Based on the estimated signal shapes, the data storage system is configured to use Recurrent Convolutional Neural Networks (R-CNNs) and conventional detection algorithms based on estimated signal shapes such as noise predictive maximum likelihood detection (NPMLD) to sequence the nucleotides in a spatially coordinated way. In this manner, improved detection accuracy performance is ensured, while giving a brand-new methodology to the detection process within the context of explainable AI and low-complexity information decoding.

[0097] More specifically, first, the data storage system is configured to use multiple pores put on top of each other where their sizes, architecture of their internal structure, and what they are made of, may be different. In fact, hybrid pores (both protein and solid-state at the same time) could be combined to make up the multi-pore architecture. Protein nanopores are robust, easily reproducible at low cost, and easy to modify. On the other hand, solid-state nanopores, due to their chemical nature, would improve the cost and scale of nanopore analyses. So, within this architecture, the present invention can use the best of both worlds to improve the detection process. It is appreciated that for compatibility to solid-state circuit development, allowing solid-state-only nanopores may be preferable from a manufacturing cost point of view.

[0098] Another objective of such a design is to create almost-balanced translocation speeds so as to ensure stationary system and signal shapes over a long period of time. Thus, another novelty of the present invention is the ability to control the translocation time of DNA molecules through the use of multiple pores which may be interleaved with different sized cavities. Through the use of multiple pores and using multiple chemical mechanisms to generate a driving force inside the cavities, an almost constant translocation time is aimed. In fact, pores would help each other to rearrange the speed if it becomes too fast or too slow. The system can be further configured to detect signal anomalies and have to trigger re-estimation of signals (offline) to maintain detection performance (for the later detection processes). Fastest translocation is expected at the top of the pores, whereas the slowest translocation speeds are associated with the bottom of the stacked pore structure.

[0099] In summary, the present disclosure describes a methodology based on multi-pore sequencing to improve the base-calling performance through redundancy in space, thereby adding a spatial resolution into the detection process. The classic approach to improve spatial resolution is to decrease k (ideally to 1, thus using all single-base detection studies through miniaturizing the pore sizes). However, with the present invention, the k value is artificially increased through stacking multiple nanopores inside a membrane, with each housing one or more nucleotides at a given time. Moreover, the present invention is configured to use noise predictive data detection algorithms and error/erasure/deletion and insertion correction codes to introduce redundancy in time and reduce the complexity. By introducing these two redundancies at the same time, and by decoupling the system components, the data storage system aims to improve the detection speed and accuracy performances of the nanopore sequencing process.

[0100] Thus, with use of the data storage system configured having features and aspects of the present invention, certain disadvantages can be overcome. For example, the present invention can be utilized to overcome at least these three important problems with respect to the state of the art: (1) Neural network-based detection approach requires complex and/or specially designed hardware. Moreover, hundreds of such would be needed to do parallel processing; (2) It is impossible to reason about the overall base-detection process and hence hard to improve the system accuracy performance through introducing novel system modules/algorithms. In fact, in all conventional systems, all signal time-dependent disturbances such as noise, inter-symbol interference, phase shift, signal smearing, etc., are solved by RNNs in a complicated way; and (3) Nanopore sequencing is based on ionic current blockade levels and single-dimensional temporal data. In other words, there is no spatial data component to enhance detection performance and hence this results in high error rates.

[0101] It is understood that although a number of different embodiments of the data storage system have been illustrated and described herein, one or more features of any one embodiment can be combined with one or more features of one or more of the other embodiments, provided that such combination satisfies the intent of the present invention.

[0102] While a number of exemplary aspects and embodiments of the data storage system have been discussed above, those of skill in the art will recognize certain modifications, permutations, additions, and sub-combinations thereof. It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such modifications, permutations, additions, and sub-combinations as are within their true spirit and scope.

JOINT MULTI-NANOPORE SEQUENCING FOR RELIABLE DATA RETRIEVAL IN NUCLEIC ACID STORAGE

Inventors

Cpc classification

Classification Explorer

G16B40/10

PHYSICS

Classification Explorer

G01N33/48721

PHYSICS

Classification Explorer

C12Q1/6869

CHEMISTRY; METALLURGY

International classification

Classification Explorer

G16B40/10

PHYSICS

Classification Explorer

G01N33/487

PHYSICS

Classification Explorer

C12Q1/6869

CHEMISTRY; METALLURGY

Abstract

Claims

Description