JOINT MULTI-NANOPORE SEQUENCING FOR RELIABLE DATA RETRIEVAL IN NUCLEIC ACID STORAGE
20230215516 · 2023-07-06
Inventors
Cpc classification
G16B40/10
PHYSICS
G01N33/48721
PHYSICS
International classification
Abstract
A nucleic acid storage system (100) that uses nanopore sequencing to read data values chemically embedded in oligonucleotides includes a membrane (102), a voltage source (108), and a nucleic acid strand (110). The membrane (102) has a plurality of nanopores (104) that are stacked upon one another in a multi-nanopore arrangement. The voltage source (108) is configured to direct voltage across the plurality of nanopores (104). The nucleic acid strand (110) including the oligonucleotides is threaded through each of the plurality of nanopores (104) within the membrane (102). A separate base signal (118) is generated from the nucleic acid strand (110) being threaded through each of the plurality of nanopores (104), and Recursive Neural Networks can be used to estimate a signal shape for each oligonucleotide. Recurrent Convolutional Neural Networks and noise predictive data detection algorithms can be used based on the estimated signal shapes to sequence the oligonucleotides.
Claims
1. A nucleic acid digital data storage system that uses nanopore sequencing to read data values chemically embedded in oligonucleotides, the nucleic acid storage system comprising: a membrane having a plurality of nanopores that are stacked upon one another in a multi-nanopore arrangement; a voltage source that is configured to direct voltage across the plurality of nanopores; and a nucleic acid strand including the oligonucleotides that is threaded through each of the plurality of nanopores within the membrane.
2. The nucleic acid digital data storage system of claim 1 wherein the nanopores are surrounded by an electrolyte solution within the membrane.
3. The nucleic acid digital data storage system of claim 1 wherein the nucleic acid strand is a DNA strand; and wherein the oligonucleotides include one or more of adenine, guanine, cytosine, and thymine.
4. The nucleic acid digital data storage system of claim 1 wherein the nucleic acid strand is an RNA strand.
5. The nucleic acid digital data storage system of claim 1 wherein the voltage from the voltage source is applied across each of the plurality of nanopores independently of one another to create an electrical field across pore ends of each of the plurality of nanopores; and wherein the electrical field creates an ionic current to pass through each of the plurality of nanopores.
6. The nucleic acid digital data storage system of claim 1 wherein the membrane is usable to capture multiple waveforms for a base sequence when the oligonucleotides are threaded through the plurality of nanopores; and wherein the oligonucleotides being threaded through each of the plurality of nanopores generates a corresponding ionic current.
7. The nucleic acid digital data storage system of claim 6 wherein a separate base signal is generated from the nucleic acid strand being threaded through each of the plurality of nanopores.
8. The nucleic acid digital data storage system of claim 7 wherein Recursive Neural Networks are used to estimate a signal shape for each oligonucleotide.
9. The nucleic acid digital data storage system of claim 8 wherein Recurrent Convolutional Neural Networks and noise predictive maximum likelihood data detection algorithms are used based on the estimated signal shapes to sequence the oligonucleotides.
10. The nucleic acid digital data storage system of claim 7 wherein each of the base signals is modified by each of a post-processing system, a joint symbol detection system, and an Error Correction Coding (ECC) decoding system.
11. The nucleic acid digital data storage system of claim 1 wherein the plurality of nanopores includes a first nanopore, a second nanopore and a third nanopore that are stacked one on top of another from top to bottom in the multi-nanopore arrangement; and wherein the membrane further includes a first cavity that is defined between the first nanopore and the second nanopore, and a second cavity that is defined between the second nanopore and the third nanopore.
12. The nucleic acid digital data storage system of claim 11 wherein each of the plurality of nanopores is different from each of the other nanopores in one or more of size and translocation speed.
13. The nucleic acid digital data storage system of claim 12 wherein the first cavity has a first size, and the second cavity has a second size that is different than the first size.
14. The nucleic acid digital data storage system of claim 1 wherein the membrane is one of a biological membrane, a solid-state membrane, and a hybrid of a biological membrane and a solid-state membrane.
15. A method for using nanopore sequencing to read data values chemically embedded in oligonucleotides, the method comprising the steps of: stacking a plurality of nanopores upon one another in a multi-nanopore arrangement within a membrane; directing voltage across the plurality of nanopores with a voltage source; and threading a nucleic acid strand including the oligonucleotides through each of the plurality of nanopores within the membrane.
16. The method of claim 15 further comprising the step of providing an electrolyte solution within the membrane so that the nanopores are surrounded by the electrolyte solution.
17. The method of claim 15 wherein the step of directing includes applying the voltage from the voltage source across each of the plurality of nanopores independently of one another to create an electrical field across pore ends of each of the plurality of nanopores; and creating an ionic current with the electrical field to pass through each of the plurality of nanopores.
18. The method of claim 15 further comprising the steps of capturing multiple waveforms for a base sequence with the membrane when the oligonucleotides are threaded through the plurality of nanopores; and generating a corresponding ionic current from the oligonucleotides being threaded through each of the plurality of nanopores.
19. The method of claim 18 further comprising the steps of generating a separate base signal from the nucleic acid strand being threaded through each of the plurality of nanopores; estimating a signal shape for each oligonucleotide using Recursive Neural Networks; and sequencing the oligonucleotides using Recurrent Convolutional Neural Networks and noise predictive maximum likelihood data detection algorithms based on the estimated signal shapes.
20. The method of claim 19 further comprising the step of modifying each of the base signals by each of a post-processing system, a joint symbol detection system, and an Error Correction Coding (ECC) decoding system.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0038] The novel features of this invention, as well as the invention itself, both as to its structure and its operation, will be best understood from the accompanying drawings, taken in conjunction with the accompanying description, in which similar reference characters refer to similar parts, and in which:
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046] While embodiments of the present invention are susceptible to various modifications and alternative forms, specifics thereof have been shown by way of example and drawings, and are described in detail herein. It is understood, however, that the scope herein is not limited to the particular embodiments described. On the contrary, the intention is to cover modifications, equivalents, and alternatives falling within the spirit and scope herein.
DESCRIPTION
[0047] Embodiments of the present invention are described in the context of a nucleic acid digital data storage system (also sometimes referred to as a “data storage system” or simply a “storage system”) that utilizes joint multi-nanopore sequencing for reliable data retrieval. More particularly, in various embodiments, the data storage system is configured to use multiple-pore manufacturing in the same membrane to capture multiple waveforms for the same base sequence. In other words, the same oligonucleotides pass through multiple physically collocated pores (stacked on top of each other) with potentially different translocation speeds, and each generates a corresponding ionic current. As referred to herein, it is appreciated that a nanopore is a pore of nanometer size. Thus, the terms “nanopore” and “pore” are sometimes used interchangeably herein.
[0048] Those of ordinary skill in the art will realize that the following detailed description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the present invention will readily suggest themselves to such skilled persons having the benefit of this disclosure. Reference will now be made in detail to implementations of the present invention as illustrated in the accompanying drawings. The same or similar reference indicators will be used throughout the drawings and the following detailed description to refer to the same or like parts.
[0049] In the interest of clarity, not all of the routine features of the implementations described herein are shown and described. It will, of course, be appreciated that in the development of any such actual implementations, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, such as compliance with application-related and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the art having the benefit of this disclosure.
[0050] In various implementations of the present invention, the data storage system is configured to use multiple nanopores (with each individual nanopore being either biological (protein-based), solid-state, or a hybrid thereof) with different aperture sizes and potentially chemical content (protein, graphene, silicon nitrate, etc.), usable in nanopore sequencing for reliable data retrieval. An example structure of the multi-pore cross-section, as well as the subsequent system components, is shown in
[0051] DNA-based data storage systems encode digital information (typically in a series of 0's and 1's) using combinations of the four nucleotides (adenine (A), guanine (G), cytosine (C) and thymine (T), more commonly known as “bases”) of which DNA is composed. There is considerable flexibility in that encoding. For example, each base may represent two bits, or individual (or short sequences of) bits may be represented by short, predetermined sequences of bases. It is recognized that the systems and methods described in detail herein are applicable in all of these cases.
[0052] Although the invention is generally described in detail in relation to DNA digital data storage, it is appreciated that substantially the same systems and methods would be equally applicable utilizing RNA in lieu of DNA. Therefore, it is not intended that the scope of the present disclosure be limited in such manner.
[0053] It is appreciated that the membrane 102 can include any suitable number of nanopores 104 that are stacked one upon another. For example, in the embodiment illustrated in
[0054] In different implementations, the nanopores 104 may, for example, be created by a pore-forming protein or as a hole in synthetic materials such as silicon or graphene. More particularly, as noted, the nanopores 104 can be biological, solid-state, or a hybrid thereof. In one such implementation, the nanopores 104 are created as holes in silicon nitrate (SiN) structures and/or materials.
[0055] As further illustrated in
[0056] The post-processing undertaken within the post-processing system 112 can take different shapes depending on the signal quality, signal synchronization, signal amplitude, signal phase among other properties of the signal captured. There may also be coupling between the nanopore currents due to the physical proximity which will be compensated in the joint symbol detection system 114 after post-processing is done. Finally, the data is decoded using generated redundancy (ECC) within the decoding system 116.
[0057] Each of the major components of the embodiment of the storage system 100 of
[0058]
[0059] As noted above, the membrane 102 can be provided in the form of either a biological membrane, a solid-state membrane, or a hybrid thereof. In one non-exclusive embodiment, the membrane 102 can include silicon nitrate structures 220 that form the plurality of nanopores 104.
[0060] In various embodiments, the membrane 102 includes the plurality of nanopores 104 (or “pores”) that are stacked upon one another in a multi-nanopore arrangement. The nanopores 104 are further surrounded by the electrolyte solution 106. For simplicity of illustration, in the embodiment specifically illustrated in
[0061] The areas within the membrane 102 between the nanopores 104 can also be referred to as cavities. For example, as shown in
[0062] It is appreciated that the nanopores 104 are again illustrated in
[0063] When one or more nanopores 104 are present in an electrically insulating membrane 102, a detection principle is based on monitoring the ionic current passing through the nanopores 104 as a voltage is applied across the membrane 102. When the nanopores 104 are of molecular dimensions, passage of molecules (such as DNA) cause interruptions of the “open” current level, leading to a “translocation event” signal.
[0064] As illustrated, in a nanopore sequencing technique, which is used to read data values chemically embedded in oligonucleotides, the DNA strand 110 passes through the plurality of nanopores 104 and voltage from the voltage source 108 is applied across the nanopores 104 which ends up creating an electrical field 226 across pore ends 204E (one such electrical field 226 is identified in
[0065] Inside the capture region, ions have a directed motion that can be recorded as a steady ionic current by placing electrodes near the membrane 102. More particularly, as noted above, depending on the type of the molecule passing through the nanopores 104, different current blockade levels and translocation speeds can be measured and recorded through placing electrodes near the membrane 102. This molecule also has a net charge that feels a force from the electrical field 226 when it is found in the capture region. The molecule approaches this capture region aided by Brownian motion and any attraction it might have to the surface of the membrane 102. Once inside the nanopore 104, the molecule translocates through via a combination of electro-phoretic, electro-osmotic and sometimes thermo-phoretic forces. Inside the nanopore 104, the molecule occupies a volume that partially restricts the flow of ions, observed as an ionic current drop. Different molecules can then be sensed and potentially identified based on this modulation in ionic current. For example, based on various factors such as nanopore 104 geometry, size and chemical composition, the change in the magnitude of the ionic current blockade and the duration of the translocation (so called dwell time) may vary over time.
[0066] The voltage source 108 can be any suitable type of voltage source that is configured to provide the desired voltage across the nanopores 104 which ends up creating the electrical field 226 across the pore ends 204E, and which creates the ionic current to pass through the nanopores 104.
[0067] As illustrated in
[0068] In the real-time streaming, these base signals 118 (illustrated in
[0069]
[0070] As illustrated, in certain embodiments, the raw base signals 118 first go through a bank of adaptive filters 328 (such as Adaptive Finite-Impulse Response filters (AFIRs) or other suitable types of filters) in parallel, whose coefficients are subject to optimization/learning, to generate a plurality of filtered signals 330. Next, due to physical separation between the nanopores 104 (illustrated in
[0071] Following this stage, data is padded as necessary onto the shifted signals 334 with a data padding system 336 due to the shifting operation. Data padding is used to place zeros for frame completion in some embodiments. Subsequently, the waveform is sampled within an aperiodic sampling system 338 at a period that can change over time (adjusted based on the translocation and physical distances or geometries). In other words, sampling within the sampling system 338 creates samples from the signals subject to non-uniform sampling periods. Finally, a whitening filter 340 is used to change the statistical properties of the colored noise. This whitening filter 340 is typically designed to be a finite-impulse response filter also, but can alternatively include another suitable type of filter such as an infinite impulse response (IIR) filter. The whitening filter 340 operates on the discrete samples and helps the subsequent detection process minimally affected by the colored nature of the noise. Such a sequence of post processing tools prepares the signal samples for the subsequent detection process.
[0072]
[0073]
[0074] It is appreciated that the joint symbol detection system 414 and the ECC decoding system 516 that can be incorporated as part of the nucleic acid digital data storage system 100 can include features, components and details somewhat similar to what was illustrated in the bit error detection and correction system of U.S. patent application Ser. No. 13/719,777 filed on Dec. 19, 2012 that utilizes a combination of a List-Viterbi (or “List-NPMLD”) detection algorithm, and error detection code decoders for reducing the number of error events at the output of the Viterbi (or “NPMLD”). As far as permitted, the contents of U.S. patent application Ser. No. 13/719,777 are incorporated in their entirety herein by reference.
[0075] In summary, after the base signals 118 are collected in the manner illustrated and described, post-processing is applied to the collected current waveforms. Following the post-processing, a joint detector architecture follows to generate the final base-calling output before implementation of the Error Correction Coding (ECC) decoding stage. To correctly operate, it is necessary to have a decent signal model and a PP+detector combination that should be implemented carefully based on the operating conditions and the resulting data. Various methods of post-processing and detection methods are provided as a list of claims in the following. Each of these claims can either alone or jointly be implemented to address the problems previously mentioned herein.
[0076] In a first claim, in order to enhance understanding of the channel, reduce complexity, and decouple different stages of the data detection process, it is proposed to use Artificial Neural Networks/Recursive Neural Networks (ANN/RNN) to estimate isolated impulse responses of the nanopore to four different bases, namely A, G, C and T. In this characterization, each ionic current level is a result of multiple signals shifted right/left and superimposed on each other. An example scenario is illustrated and described in greater detail herein above. With this treatment, simple threshold-detector approaches can be designed based on the signal shapes as well as severity of the inter-symbol-interference. Alternative detection methods can also be proposed, of which some are detailed in other claims.
[0077] In a second claim, in an embodiment of the present invention, it is assumed that the response of a given nanopore to a nucleotide is a combination of two channel responses h.sub.1(t) and h.sub.2(t). To model the varying translocation, time shifts of these two signals are assumed to form the current blockade signal,
I(t)=Σ.sub.ia.sub.ih.sub.i(t−iT)+b.sub.ih.sub.2(t−iS)+η(t) (Equation 1)
where a.sub.i∈{+1, −1} and b.sub.i∈{+1, −1}. Also, T and S are the periods for these responses and η(t) is the noise component of the observed current signal I(t). There are four combinations of a.sub.ib.sub.i which are used to encode nucleotides A, G, C and T. In this formulation, h.sub.1(t), h.sub.2(t), T and S are estimated based on the given recorded signals so that given the DNA sequence I(t) most mimics the training data. There may be multiple AI-based approaches to the estimation process. In one embodiment, neural networks can be used, whereas in the other, linear or non-linear regression techniques can alternatively be used.
[0078]
[0079] With the base signals 118 (one example of which is shown in
[0080] In certain embodiments, Recursive Neural Networks (RNNs) are used to estimate the signal shapes for each base nucleotide rather than using a base detection process directly. Based on the estimated signal shapes, the data storage system is configured to use Recurrent Convolutional Neural Networks (R-CNNs) and conventional detection algorithms based on estimated signal shapes such as noise predictive maximum likelihood detection (NPMLD) to sequence the nucleotides in a spatially coordinated way. In this manner, improved detection accuracy performance is ensured, while giving a brand-new methodology to the detection process within the context of explainable AI and low-complexity information decoding.
[0081] Assuming a linear system under sufficiently responsive and adaptive conditions, the individual estimation of signal shapes based on RNNs or R-CNNs would lead to accurate weighted superposition and the estimate of the observed induced current/voltage signal. Hence, knowing the individual impulse responses, and their adaptive estimation, a sequence detector can be employed to estimate the base sequences.
[0082] In a third claim, in an alternative post-processing method, it is appreciated that as the nucleotides pass through the nanopores, there will be multiple and dependent signals measured. A conventional RNN would not work in this case as it expects a one-dimensional time series. Therefore, multiple independent RNNs can be employed that can be run without using the inherent dependency between the measured signals and plus the coupling. RNN outputs are finally combined through simple majority voting to have the final decision on the sequence of nucleotides.
[0083] In a fourth claim, in alternative methodology, assuming three nanopores as shown in
f.sub.R-CNN(I.sub.T(t−Δ.sub.1),I.sub.M(t),I.sub.B(t+Δ.sub.2)) (Equation 2)
[0084] This technique still uses an end-to-end neural network and could be quite complex to implement, particularly in the context of a 100 million stacked nanopore architecture.
[0085] In a fifth claim, in another embodiment, neural networks are used to estimate signal shapes for each nanopore rather than doing a joint base calling. The estimation of signal shapes might be different for each physical nanopore. However, with coupling between such nanopores, techniques like R-CNN could be used to estimate signal shapes jointly. For instance in an embodiment of a three nanopore structure, there can be 12 different signal shape estimates, one for each nanopore and base. Next, using such signal estimates, a maximum likelihood detector (MLD) can be employed based on a trellis structure (for each nanopore individually) whose branch metric computations will be done based on the signal estimates that are jointly generated. The basecalling output would be the least costly path in the trellis given the nanopore signal output. Finally, a majority vote at the end merges these sequences to make a decision on a single base sequence. In this case, multiple MLDs per nanopore would be needed. To give an example, consider the following sequence as shown in Table 1:
TABLE-US-00001 TABLE 1 Initial Sequencing Detected t = 0 = t1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 t = 11 t = 12 t = 13 t = 14 Pore 1 A C T G A C G G C T G A C C A Pore 2 o A C T G A C G C C T G A C C Pore 3 o o A C T G A C G C C T G A C
[0086] Now, assume that even if joint cost estimation, etc. is used, there is a base deleted during the detection process due to faster translocation than usual. So, the following picture can be obtained after a deletion in one of the pores, as shown in Table 2.
[0087] Deletion in Pore 3
TABLE-US-00002 TABLE 2 Sequencing Detected After A Deletion in Pore 3 t = 0 = t1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 t = 11 t = 12 t = 13 t = 14 Pore 1 A C T G A C G C C T G A C C A Pore 2 o A C T G A C G C C T G A C C Pore 3 o o A C T G A C G C T G A C C
[0088] As shown in Table 2, a deletion in pore 3 happens right after t=8, where a nucleotide C is deleted by the pore due to translocation or detection problems. By considering the output of all three pores, this deletion error can easily be detected and corrected through some majority logic voting system.
[0089] In a sixth claim, as an alternative to the fifth claim, the MLD detectors (for each nanopore) can exchange information during the sequence estimation process to decide on the single base sequence while sequencing their own bases. In other words, while calculating the distance metrics, corresponding distance metrics from other trellises can be used to determine the most likely sequence. Thus, in this formulation, bases are jointly determined and MLDs work collaboratively. That is to say, MLDs converge to the same sequence decision while moving over their corresponding signal sequences. The joint collaboration results in the same consensus over the most likely base sequence by identifying errors, deletions as well as insertions to the base sequence. A short-time memory would need to be used for back-tracing in the MLD implementation. However, due to time dependence between sequences, memory used for each MLD can help other memories in the back-tracing process.
[0090] In a seventh claim, in another embodiment of the system for the example number of nanopores of
[0091] In an eighth claim, in still another embodiment of the storage system, data could be encoded using indel-correction code, followed by a product code able to correct both substitution errors and erasures. This concatenation of coding could be necessary to reduce error rates below 10.sup.−20 nucleotide detection error rates. Through joint detection, some of the indels would be detected due to the diversity of multiple captured copies of the same data. These detected nucleotides are filled/labelled as erasures to be used by the subsequent product decoding. Product codes are great selections to attack a mixture of substitution errors and erasures whereas the front-end indel-correcting code will take care of the remaining single deletions or insertions. The remaining indels are expected to be small in size, such as a single indel per codeword at maximum.
[0092] In a ninth claim, in yet another embodiment of the proposed storage system, the concept of “Master channel” can be used to periodically learn the signal shapes, filter coefficients, whitener coefficients, branch metrics, shift amounts, pad amounts, and sampling periods among other parameters of the storage system. Master nanopores have a special chemical header attached to the nanopore entrance. This chemical composition identifies specially designed DNA reference molecules. These nanopores do not allow any other molecule to pass but these special molecules. Therefore, since these reference molecules are known, corresponding system parameters are optimized based on the resulting nanopores. These parameters are then communicated with non-master nanopores for update during real-time sequencing operation.
[0093] As can be seen in
[0094] It is further noted that thanks to their solid-state nature, the nanopores 704 are expected to survive in their initial state for a long time and hence ensure a stationary signal shape throughout the data lifetime. In case a major change is detected in the storage system 100, retraining of collected data is executed to correct the signal shapes and sampling times. Otherwise, a drift in the storage system 100 may dramatically reduce the detection accuracy performance of the subsequent detection algorithms.
[0095] It is further appreciated that other machine learning schemes can also be used within the context of this disclosure where appropriate as long as multi-class classification is performed. For instance, the regression or reinforcement learning can be used to estimate h.sub.1(t) and h.sub.2(t). Depending on the nanopore model, signal levels can be mapped to these functions provided the sampling periods are known. Another such example is Error Correction Output Coding (ECOC) frameworks, in which multiple component binary classifiers are used with an appropriate merging algorithm to achieve successful multi-class classification. All multi-class (4-class) classification algorithms can be used to classify bytes in each iteration into one of the four classes A, G, C, T. Accuracy of such algorithms is of crucial importance for the iterations to work properly and in order not to introduce new type of errors into the decoding operation. Depending on the technique, the training may take different amounts of time and memory space.
[0096] With the present invention, contrary to the state-of-the-art, Recursive Neural Networks (RNNs) are used to estimate the signal shapes for each base nucleotide rather than using a base detection process directly. Based on the estimated signal shapes, the data storage system is configured to use Recurrent Convolutional Neural Networks (R-CNNs) and conventional detection algorithms based on estimated signal shapes such as noise predictive maximum likelihood detection (NPMLD) to sequence the nucleotides in a spatially coordinated way. In this manner, improved detection accuracy performance is ensured, while giving a brand-new methodology to the detection process within the context of explainable AI and low-complexity information decoding.
[0097] More specifically, first, the data storage system is configured to use multiple pores put on top of each other where their sizes, architecture of their internal structure, and what they are made of, may be different. In fact, hybrid pores (both protein and solid-state at the same time) could be combined to make up the multi-pore architecture. Protein nanopores are robust, easily reproducible at low cost, and easy to modify. On the other hand, solid-state nanopores, due to their chemical nature, would improve the cost and scale of nanopore analyses. So, within this architecture, the present invention can use the best of both worlds to improve the detection process. It is appreciated that for compatibility to solid-state circuit development, allowing solid-state-only nanopores may be preferable from a manufacturing cost point of view.
[0098] Another objective of such a design is to create almost-balanced translocation speeds so as to ensure stationary system and signal shapes over a long period of time. Thus, another novelty of the present invention is the ability to control the translocation time of DNA molecules through the use of multiple pores which may be interleaved with different sized cavities. Through the use of multiple pores and using multiple chemical mechanisms to generate a driving force inside the cavities, an almost constant translocation time is aimed. In fact, pores would help each other to rearrange the speed if it becomes too fast or too slow. The system can be further configured to detect signal anomalies and have to trigger re-estimation of signals (offline) to maintain detection performance (for the later detection processes). Fastest translocation is expected at the top of the pores, whereas the slowest translocation speeds are associated with the bottom of the stacked pore structure.
[0099] In summary, the present disclosure describes a methodology based on multi-pore sequencing to improve the base-calling performance through redundancy in space, thereby adding a spatial resolution into the detection process. The classic approach to improve spatial resolution is to decrease k (ideally to 1, thus using all single-base detection studies through miniaturizing the pore sizes). However, with the present invention, the k value is artificially increased through stacking multiple nanopores inside a membrane, with each housing one or more nucleotides at a given time. Moreover, the present invention is configured to use noise predictive data detection algorithms and error/erasure/deletion and insertion correction codes to introduce redundancy in time and reduce the complexity. By introducing these two redundancies at the same time, and by decoupling the system components, the data storage system aims to improve the detection speed and accuracy performances of the nanopore sequencing process.
[0100] Thus, with use of the data storage system configured having features and aspects of the present invention, certain disadvantages can be overcome. For example, the present invention can be utilized to overcome at least these three important problems with respect to the state of the art: (1) Neural network-based detection approach requires complex and/or specially designed hardware. Moreover, hundreds of such would be needed to do parallel processing; (2) It is impossible to reason about the overall base-detection process and hence hard to improve the system accuracy performance through introducing novel system modules/algorithms. In fact, in all conventional systems, all signal time-dependent disturbances such as noise, inter-symbol interference, phase shift, signal smearing, etc., are solved by RNNs in a complicated way; and (3) Nanopore sequencing is based on ionic current blockade levels and single-dimensional temporal data. In other words, there is no spatial data component to enhance detection performance and hence this results in high error rates.
[0101] It is understood that although a number of different embodiments of the data storage system have been illustrated and described herein, one or more features of any one embodiment can be combined with one or more features of one or more of the other embodiments, provided that such combination satisfies the intent of the present invention.
[0102] While a number of exemplary aspects and embodiments of the data storage system have been discussed above, those of skill in the art will recognize certain modifications, permutations, additions, and sub-combinations thereof. It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such modifications, permutations, additions, and sub-combinations as are within their true spirit and scope.