High-Capacity Storage of Digital Information in DNA
20230214319 · 2023-07-06
Assignee
Inventors
Cpc classification
B82Y10/00
PERFORMING OPERATIONS; TRANSPORTING
G11C13/02
PHYSICS
G06F2212/1032
PHYSICS
International classification
B82Y10/00
PERFORMING OPERATIONS; TRANSPORTING
G11C13/02
PHYSICS
G16B30/00
PHYSICS
Abstract
A method for storage of an item of information (210) is disclosed. The method comprises encoding bytes (720) in the item of information (210), and representing using a schema the encoded bytes by a DNA nucleotide to produce a DNA sequence (230). The DNA sequence (230) is broken into a plurality of overlapping DNA segments (240) and indexing information (250) added to the plurality of DNA segments. Finally, the plurality of DNA segments (240) is synthesized (790) and stored (795).
Claims
1. A method of creating a plurality of DNA segments data to be provided to a DNA synthesis platform for controlling the synthesis of a plurality of DNA segments for storing an item of information, the method comprising: encoding bytes in an item of information, stored in a first computer file as a DNA sequence data, using a representation schema to represent the encoded bytes as at least one DNA nucleotide datum in the DNA sequence data; splitting the DNA sequence data into a plurality of overlapping DNA segments data; adding indexing information to the plurality of DNA segments data, the indexing information indicating a position in the DNA sequence data of any one nucleotide datum of any one of the plurality of DNA segments data; and storing the plurality of DNA segments data in a machine-readable second computer file.
2. The method of claim 1, further including the addition of adapters data to the DNA segments data.
3. The method of claim 1 using a base-3 scheme for encoding the bytes.
4. The method of claim 1, wherein the representation schema used is designed such that adjacent ones of the DNA nucleotide data are different.
5. The method of claim 1, further comprising adding a parity-check to the indexing information.
6. The method of claim 1, wherein alternate ones of the DNA segments data are reverse complemented.
7. The method of claim 1, wherein the representation schema used is designed to avoid long, self-reverse complementary DNA segments data.
8. The method of claim 1, further comprising providing to a DNA synthesis platform the plurality of DNA segments data for controlling the synthesis of a plurality of DNA segments from the DNA segments data.
9. The method of claim 1, further comprising the step of synthesizing from the DNA segments data a plurality of DNA segments for storing an item of information.
10. A non-volatile, non-transitory storage medium storing a plurality of DNA segments data to be provided to a DNA synthesis platform for controlling the synthesis of a plurality of DNA segments for storing an item of information, wherein the plurality of DNA segments data are created by a method comprising: encoding bytes in an item of information, stored in a first computer file, as a DNA sequence data using a representation schema to represent the encoded bytes as at least one DNA nucleotide datum in the DNA sequence data; splitting the DNA sequence data into a plurality of overlapping DNA segments data; and adding indexing information to the plurality of DNA segments data, the indexing information indicating a position in the DNA sequence data of any one of the plurality of DNA segments data.
11. A computer program product comprising logic for executing the method according to claim 1.
Description
DESCRIPTION OF THE FIGURES
[0020] For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description and the accompanying drawings, in which:
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
DETAILED DESCRIPTION
[0029] One of the main challenges for a practical implementation of DNA storage to date has been the difficulty of creating long sequences of DNA to a specified design. The long sequences of DNA are required to store large data files, such as long text items and videos. It is also preferable to use an encoding with a plurality of copies of each designed DNA. Such redundancy guards against both encoding and decoding errors, as will be explained below. It is not cost-efficient to use a system based on individual long DNA chains to encode each (potentially large) message. The inventors have developed a method that uses ‘indexing’ information associated with each one of the DNA segments to indicate the position of the DNA segment in a hypothetical longer DNA molecule that encodes the entire message.
[0030] The inventors used methods from code theory to enhance the recoverability of the encoded messages from the DNA segment, including forbidding DNA homopolymers (i.e. runs of more than one identical base) that are known to be associated with higher error rates in existing high throughput technologies. The inventors further incorporated a simple error-detecting component, analogous to a parity-check bit.sup.9 into the indexing information in the code. More complex schemes, including but not limited to error-correcting codes and, indeed, substantially any form of digital data security (e.g. RAID-based schemes) currently employed in informatics, could be implemented in future developments of the DNA storage scheme. See, Baum, E. B. Building an associative memory vastly larger than the brain. Science 268, 583-585 (1995) and Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H. & Patterson, D. A. RAID: high-performance, reliable secondary storage. ACM Computing Surveys 26, 145-185 (1994).
[0031] The inventors selected five computer files to be encoded as a proof-of-concept for the DNA storage of this disclosure. Rather than restricting the files to human-readable information, files using a range of common formats were chosen. This demonstrated the ability of the teachings of the disclosure to store arbitrary types of digital information. The files contained all 154 of Shakespeare's sonnets (in TXT format), the complete text and figure of ref 10 (in PDF format), a medium-resolution color photograph of the EMBL-European Bioinformatics Institute (JPEG 2000 format), a 26 second extract from Martin Luther King's “I Have A Dream” speech (MP3 format) and a file defining the Huffman code used in this study to convert bytes to base-3 digits (as a human-readable text file).
[0032] The five files selected for DNA-storage were as follows: [0033] wssnt10.txt—107738 bytes—ASCII text format all 154 Shakespeare sonnets (from Project Gutenberg, http://www.gutenberg.org/ebooks/1041) [0034] watsoncrick.pdf—280864 bytes—PDF format document Watson and Crick's (1953) publication.sup.10 describing the structure of DNA (from the Nature website, http://www.nature.com/nature/dna50/archive.html, modified to achieve higher compression and thus smaller file size). [0035] EBI.jp2—184264 bytes—JPEG 2000 format image file color photograph (16.7M colors, 640×480 pixel resolution) of the EMBL-European Bioinformatics Institute (own picture). [0036] MLK_excerpt_VBR_45-85.mp3—168539 bytes—MP3 format sound file 26 second-long extract from Martin Luther King's “I Have A Dream” speech (from http://www.americanrhetoric.com/speeches/mlkihaveadream.htm, modified to achieve higher compression: variable bit rate, typically 48-56 kbps; sampling frequency 44.1 kHz) [0037] View huff3.cd.new—15646 bytes—ASCII file human-readable file defining the Huffman code used in this study to convert bytes to base-3 digits (trits)
[0038] The five computer files comprise a total of 757051 bytes, approximately equivalent to a Shannon information of 5.2×10.sup.6 bits or 800 times as much encoded and recovered human-designed information as the previous maximum amount known to have been stored (see
[0039] The DNA encoding of each one of the computer files was computed using software and the method is illustrated in
[0040] The resulting in silico DNA sequences 230 are too long to be readily produced by standard oligonucleotide synthesis. Each of the DNA sequences 230 was therefore split in step 730 into overlapping segments 240 of length 100 bases with an overlap of 75 bases. To reduce the risk of systematic synthesis errors introduced to any particular run of bases, alternate ones of the segments were then converted in step 740 to their reverse complement, meaning that each base is “written” four times, twice in each direction. Each segment was then augmented in step 750 with an indexing information 250 that permitted determination of the computer file from which the segment 240 originated and its location within that computer file 210, plus simple error-detection information. This indexing information 250 was also encoded in step 760 as non-repeating DNA nucleotides, and appended in step 770 to the 100 information storage bases of the DNA segments 240. It will be appreciated that the division of the DNA segments 240 into lengths of 100 bases with an overlap of 75 bases is purely arbitrary. It would be possible for other lengths and overlaps to be used and this is not limiting of the invention.
[0041] In total, all of the five computer files were represented by 153335 strings of DNA. Each one of the strings of DNA comprised 117 nucleotides (encoding original digital information plus indexing information). The encoding scheme used had various features of the synthesized DNA (e.g. uniform segment lengths, absence of homopolymers) that made it obvious that the synthesized DNA did not have a natural (biological) origin. It is therefore obvious that the synthesized DNA has a deliberate design and encoded information. See, Cox, J. P. L. Long-term data storage in DNA. TRENDS Biotech. 19, 247-250 (2001).
[0042] As noted above, other encoding schemes for the DNA segments 240 could be used, for example to provide enhanced error-correcting properties. It would also be straightforward to increase the amount of indexing information in order to allow more or larger files to be encoded. It has been suggested that the Nested Primer Molecular Memory (NPMM) scheme reaches its practical maximum capacity at 16.8M unique addresses, and there appears to be no reason why the method of the disclosure could not be extended beyond this to enable the encoding of almost arbitrarily large amounts of information. See, Yamamoto, M., Kashiwamura, S., Ohuchi, A. & Furukawa, M. Large-scale DNA memory based on the nested PCR. Natural Computing 7, 335-346 (2008) and Kari, L. & Mahalingam, K. DNA computing: a research snapshot. In Atallah, M. J. & Blanton, M. (eds.) Algorithms and Theory of Computation Handbook, vol. 2. 2nd ed. pp. 31-1-31-24 (Chapman & Hall, 2009)
[0043] One extension to the coding scheme in order to avoid systematic patterns in the DNA segments 240 would be to add change the information. Two ways of doing this were tried. A first way involved the “shuffling” of information in the DNA segments 240, the information can be retrieved if one knows the pattern of shuffling. In one aspect of the disclosure different patterns of shuffles were used for different ones of the DNA segments 240.
[0044] A further way is to add a degree of randomness into the information in each one of the DNA segments 240. A series of random digits can be used for this, using modular addition of the series of random digits and the digits comprising the information encoded in the DNA segments 240. The information can easily be retrieved by modular subtraction during decoding if one knows the series of random digits used. In one aspect of the disclosure, different series of random digits were used for different ones of the DNA segments 240.
[0045] The digital information encoding in step 720 was carried out as follows. The five computer files 210 of digital information (represented in
[0046] The indexing information 250 comprised two trits for file identification (permitting 3.sup.2=9 files to be distinguished, in this implementation), 12 trits for intra-file location information (permitting 3.sup.12=531441 locations per file) and one ‘parity-check’ trit. The indexing information 250 was encoded in step 760 as non-repeating DNA nucleotides and was appended in step 770 to the 100 information storage bases. Each indexed DNA segment 240 had one further base added in step 780 at each end, consistent with the ‘no homopolymers’ rule, that would indicate whether the entire DNA segment 240 were reverse complemented during the ‘reading’ stage of the experiment.
[0047] In total, the five computer files 210 were represented by 153335 strings of DNA, each comprising 117 (1+100+2+12+1+1) nucleotides (encoding original digital information and indexing information).
[0048] The data-encoding component of each string in the aspect of the invention described herein can contain Shannon information at 5.07 bits per DNA base, which is close to the theoretical optimum of 5.05 bits per DNA base for base-4 channels with run length limited to one. The indexing implementation 250 permits 3.sup.14=4782969 unique data locations. Increasing the number of indexing trits (and therefore bases) used to specify file and intra-file location by just two, to 16, gives 3.sup.16=43046721 unique locations, in excess of the 16.8M that is the practical maximum for the NPMM scheme. See, Yamamoto, M., Kashiwamura, S., Ohuchi, A. & Furukawa, M. Large-scale DNA memory based on the nested PCR. Natural Computing 7, 335-346 (2008) and Kari, L. & Mahalingam, K. DNA computing: a research snapshot. In Atallah, M. J. & Blanton, M. (eds.) Algorithms and Theory of Computation Handbook, vol. 2. 2nd ed. pp. 31-1-31-24 (Chapman & Hall, 2009)
[0049] The DNA synthesis process of step 790 was also used to incorporate 33 bp adapters to each end of each one of the oligonucleotides (oligo) to facilitate sequencing on Illumina sequencing platforms:
TABLE-US-00001 5′ adapter: (SEQ ID NO: 1) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3′ adapter: (SEQ ID NO: 2) AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
[0050] The 153335 DNA segment designs 240 were synthesized in step 790 in three distinct runs (with the DNA segments 240 randomly assigned to runs) using an updated version of Agilent Technologies' OLS (Oligo Library Synthesis) process described previously.sup.22, 23 to create approx. 1.2×10.sup.7 copies of each DNA segment design. Errors were seen to occur in only about one error per 500 bases and independently in different copies of the DNA segments 240. Agilent Technologies adapted the phosphoramidite chemistry developed previously.sup.24 and employed inkjet printing and flow cell reactor technologies in Agilent's SurePrint in situ microarray synthesis platform. The inkjet printing within an anhydrous chamber allows the delivery of very small volumes of phosphoramidites to a confined coupling area on a 2D planar surface, resulting in the addition of hundreds of thousands of bases in parallel. Subsequent oxidation and detritylation are carried out in a flow cell reactor. Once the DNA synthesis has been completed, the oligonucleotides are then cleaved from the surface and deprotected. See, Cleary, M. A. et al. Production of complex nucleic acid libraries using highly parallel in situ oligonucleotide synthesis. Nature Methods 1, 241-248 (2004).
[0051] The adapters were added to the DNA segments to enable a plurality of copies of the DNA segments to be easily made. A DNA segment with no adapter would require additional chemical processes to “kick start” the chemistry for the synthesis of the multiple copies by adding additional groups onto the ends of the DNA segments.
[0052] Up to ˜99.8% coupling efficiency is achieved by using thousands-fold excess of phosphoramidite and activator solution. Similarly, millions-fold excess of detritylation agent drives the removal of the 5′-hydroxyl protecting group to near completion. A controlled process in the flowcell reactor significantly reduced depurination, which is the most prevalent side reaction. See, Le Proust, E. M. et al. Synthesis of high-quality libraries of long (150 mer) oligonucleotides by a novel depurination controlled process. Nucl. Acids Res. 38, 2522-2540 (2010). Up to 244000 unique sequences can be synthesized in parallel and delivered as ˜1-10 picomole pools of oligos.
[0053] The three samples of lyophilized oligos were incubated in Tris buffer overnight at 4° C., periodically mixed by pipette and vortexing, and finally incubated at 50° C. for 1 hour, to a concentration of 5 ng/ml. As insolubilized material remained, the samples were left for a further 5 days at 4° C. with mixing two-four times each day. The samples were then incubated at 50° C. for 1 hour and 68° C. for 10 minutes, and purified from residual synthesis by-products on Ampure XP paramagnetic beads (Beckman Coulter) and could be stored in step 795. Sequencing and decoding is shown in
[0054] The combined oligo sample was amplified in step 810 (22 PCR cycles using thermocycler conditions designed to give even A/T vs. G/C processing.sup.26) using paired-end Illumina PCR primers and high-fidelity AccuPrime reagents (Invitrogen), a combination of Taq and Pyrococcus polymerases with a thermostable accessory protein. The amplified products were bead purified and quantified on an Agilent 2100 Bioanalyzer, and sequenced using AYB software in paired-end mode on an Illumina HiSeq 2000 to produce reads of 104 bases.
[0055] The digital information decoding was carried out as follows. The central 91 bases of each oligo were sequenced in step 820 from both ends and so rapid computation of full-length (117 base) oligos and removal of sequence reads inconsistent with the designs was straightforward. The sequence reads were decoded in step 830 using computer software that exactly reverses the encoding process. The sequence reads for which the parity-check trit indicated an error or that at any stage could not be unambiguously decoded or assigned to a reconstructed computer file were discarded in step 840 from further consideration.
[0056] The vast majority of locations within every decoded file were detected in multiple different sequenced DNA oligos, and simple majority voting in step 850 was used to resolve any discrepancies caused by the DNA synthesis or the sequencing errors. On completion of this procedure 860, four of the five original computer files 210 were reconstructed perfectly. The fifth computer file required manual intervention to correct two regions each of 25 bases that were not recovered from any sequenced read.
[0057] During decoding in step 850, it was noticed that one file (ultimately determined to be watsoncrick.pdf) reconstructed in silico at the level of DNA (prior to decoding, via base-3, to bytes) contained two regions of 25 bases that were not recovered from any one of the sequenced oligos. Given the overlapping segment structure of the encoding, each region indicated the failure of four consecutive segments to be synthesized or sequenced, as any one of four consecutive overlapping segments would have contained bases corresponding to this location. Inspection of the two regions indicated that the non-detected bases fell within long repeats of the following 20-base motif:
TABLE-US-00002 (SEQ ID NO: 3) 5′ GAGCATCTGCAGATGCTCAT 3′
[0058] It was noticed that repeats of this motif have a self-reverse complementary pattern. These are shown in
[0059] It is possible that long, self-reverse complementary DNA segments might not be readily sequenced using the Illumina paired-end process, owing to the possibility that the DNA segments might form internal nonlinear stem-loop structures that would inhibit the sequencing-by-synthesis reaction used in the protocol used in the method described in this document. Consequently, the in silico DNA sequence was modified to repair the repeating motif pattern and then subjected to subsequent decoding steps. No further problems were encountered, and the final decoded file matched perfectly the file watsoncrick.pdf. A code that ensured that no long self-complementary regions existed in any of the designed DNA segments could be used in future.
Example of Huffman Coding Scheme
[0060] Table 1 shows an example of the exemplary Huffman coding scheme used to convert byte values (0-255) to base-3. For highly compressed information, each byte value should appear equally frequently and the mean number of trits per byte will be (239*5+17*6)/256=5.07. The theoretical maximum number of trits per byte is log(256)/log(3)=5.05.
TABLE-US-00003 TABLE 1 Base 3 Coding Code Word No 8-bit ASCII Character Byte Value (5 or 6 trits) 0 0 22201 1 U 85 22200 2 ™ 170 22122 3 127 22121 4 ” 253 22120 5 4 52 22112 6 ä 138 22111 7 ) 41 22110 8 V 86 22102 9 * 42 22101 10 d 100 22100 11 , 44 22022 12 {dot over ( )} 250 22020 13 Ñ 132 22021 14 ° 161 22012 15 b 98 22010 16 8 22002 17 ″ 34 22011 18 [NL] 10 22001 19 ï 149 22000 20 W 87 21222 21 21 21221 22 J 74 21220 23 $ 36 21212 24 E 69 21210 25 ± 177 21202 26 20 21211 27 ' 213 21200 28 £ 163 21201 29 Â 229 21121 30 {hacek over ( )} 255 21122 31 ≈ 197 21120 32 Ö 133 21112 33 , 252 21110 34 26 21111 35 ≠ 173 21101 36 ó 151 21102 37 R 82 21100 38 K 75 21022 39 % 37 21021 40 ¶ 166 21011 41 ø 191 21020 42 X 88 21012 43 ? 63 21010 44 D 68 21001 45 ñ 150 21002 46 L 76 21000 47 4 20222 48 ö 154 20221 49 Í 234 20212 50 22 20220 51 ¢ 162 20211 52 i 105 20210 53 f 102 20202 54 {acute over ( )} 171 20201 55 h 104 20200 56 © 169 20122 57 f 196 20121 58 — 208 20120 59 T 84 20112 60 ç 130 20111 61 í 146 20102 62 H 72 20110 63 16 20101 64 B 66 20100 65 24 20022 66 j 106 20012 67 fl 223 20020 68 : 58 20021 69 â 137 20011 70 I 73 20010 71 e 101 20001 72 ® 168 20002 73 μ 181 12221 74 Ø 175 12222 75 ° 251 20000 76 ( 40 12220 77 å 140 12212 78 17 12211 79 S 83 12210 80 254 12202 81 240 12201 82 ÷ 214 12200 83 5 53 12122 84 202 12112 85 25 12121 86 18 12120 87 {tilde over ( )} 247 12111 88 Æ 174 12110 89 p 112 12102 90 Y 89 12101 91 “ 210 12100 92 Ÿ 217 12012 93
206 01110 198 Π 184 01100 199 ” 227 01101 200 È 233 01022 201 Ì 237 01021 202 ° 188 01020 203 q 113 01012 204 1 49 01011 205 ... 201 01010 206 õ 155 01002 207 fi 222 01000 208 Á 231 01001 209 5 00222 210 27 00221 211 É 131 00212 212 § 164 00220 213 3 00211 214 .Math. 46 00210 215 w 119 00201 216 28 00202 217 ∞ 176 00200 218 23 00122 219 @ 64 00121 220 ù 157 00120 221 .sup.a 187 00112 222 Ù 244 00110 223 Ó 238 00111 224 {grave over ( )} 96 00102 225 Î 235 00101 226 < 60 00022 227 1 00100 228 n 110 00021 229 » 200 00011 230
221 00020 231 c 99 00012 232 31 00010 233 Δ 198 00002 234
193 00001 235 } 125 00000 236 | 124 22222 237 ò 152 22222 238 z 122 22222 239 G 71 222212 240 {circumflex over ( )} 94 222211 241
220 222210 242 29 222202 243 « 199 222201 244 = 61 222200 245 11 222122 246 ‰ 228 222121 247 > 62 222120 248 7 55 222112 249 y 121 222111 250 7 222110 251 - 30 222102 252 Ë 232 222101 253 Ω 189 222100 254 ; 59 222021 255 Ï 236 222022
Encoding of the File
[0061] The arbitrary computer file 210 is represented as a string S.sub.Ø of bytes (often interpreted as a number between Ø and 2.sup.8−1, i.e. a value in the set {0 . . . 255}). The string S.sub.Ø is encoded using the Huffman code and converting to base-3. This generates a string S.sub.1 of characters as the trit {Ø, 1, 2}.
[0062] Let us now write len( ) for the function that computes the length (in characters) of the string S.sub.1, and define n=len(S.sub.1). Represent n in base-3 and prepend 0s to generate a string S.sub.2 of trits such that len(S.sub.2)=20. Form the string concatenation S.sub.4=S.sub.1. S.sub.3. S.sub.2, where S.sub.3 is a string of at most 24 zeros is chosen so that len(S.sub.4) is an integer multiple of 25.
[0063] S.sub.4 is converted to the DNA string S.sub.5 of characters in {A, C, G, T} with no repeated nucleotides (nt) using the scheme illustrated in the table below. The first trit of S4 is coded using the ‘A’ row of the table. For each subsequent trit, characters are taken from the row defined by the previous character conversion.
TABLE-US-00004 previous next trit to encode nt written Ø 1 2 A C G T C G T A G T A C T A C G
[0064] Table: Base-3 to DNA encoding ensuring no repeated nucleotides.
[0065] For each trit t to be encoded, select the row labeled with the previous nucleotide used and the column labeled t and encode using the nt in the corresponding table cell.
[0066] Define N=len (S.sub.5), and let ID be a 2-trit string identifying the original file and unique within a given experiment (permitting mixing of DNA form different files So in one experiment. Split S.sub.5 into the overlapping DNA segments 240 of length 100 nt, each of the DNA segments 240 being offset from the previous one of the DNA segments 240 by 25 nt. This means there will be ((N/25)−3) DNA segments 240, conveniently indexed i=Ø . . . (N/25)−4. The DNA segment i is denoted F.sub.i and contains (DNA) characters 25i . . . 25.sub.i+99 of S.sub.5.
[0067] Each DNA segment F.sub.i is further processed as follows:
[0068] If i is odd, reverse complement the DNA segment F.sub.i.
[0069] Let i3 be the base-3 representation of i, appending enough leading zeros so that len(i3)=12. Compute P as the sum (mod 3) of the odd-positioned trits in ID and i3, i.e. ID.sub.1+i3.sub.1+i3.sub.3+i3.sub.5+i3.sub.7+i3.sub.9+i3.sub.11. (P acts a ‘parity trit’—analogous to a parity bit—to check for errors in the encoded information about ID and i.)
[0070] Form the indexing information 250 string IX=ID. i2. P (comprising 2+12+1=15 trits). Append the DNA-encoded (step 760) version of IX to F.sub.i using the same strategy as shown in the above table, starting with the code table row defined by the last character of F, to give indexed segment F′.sub.i.
[0071] Form F″.sub.i by prepending A or T and appending C or G to F.sub.i—choosing between A and T, and between C and G, randomly if possible but always such that there are no repeated nucleotides. This ensures that one can distinguish a DNA segment 240 that has been reverse complemented (step 240) during DNA sequencing from one that has not. The former will start with Q|C and the end with T|A; the latter will start A|T and end C|G.
[0072] The segments F″.sub.1 are synthesized in step 790 as actual DNA oligonucleotides and stored in step 790 and may be supplied for sequencing in step 820.
Decoding
[0073] Decoding is simply reverse of the encoding in step 720, starting with the sequenced DNA segments 240 F″.sub.1 of length 117 nucleotides. Reverse complementation during the DNA sequencing procedure (e.g. during PCR reactions) can be identified for subsequent reversal by observing whether fragments start with A|T and end with C|G, or start with G|C and end T|A. With these two ‘orientation’ nucleotides removed, the remaining 115 nucleotide of each DNA segment 240 can be split into the first 100 ‘message’ nucleotides and the remaining fifteen ‘indexing information 250’ nucleotides. The indexing information nucleotide 250 can be decoded to determine the file identifier ID and the position index i3 and hence i, and errors may be detected by testing the parity trit P. Position indexing information 250 permits the reconstruction of the DNA-encoded file 230, which can then be converted to base-3 using the reverse of the encoding table above and then to the original bytes using the given Huffman code.
Discussion on Data Storage
[0074] The DNA storage has different properties from the traditional tape-based storage or disk-based storage. The ˜750 kB of information in this example was synthesized in 10 pmol of DNA, giving an information storage density of approximately one Terabyte/gram. The DNA storage requires no power and remains (potentially) viable for thousands of years even by conservative estimates.
[0075] DNA Archives can also be copied in a massively parallel manner by the application of PCR to the primer pairs, followed by aliquoting (splitting) the resulting DNA solution. In the practical demonstration of this technology in the sequencing process this procedure was done multiple times, but this could also be used explicitly for copying at large scale the information and then physically sending this information to two or more locations. The storage of the information in multiple locations would provide further robustness to any archiving scheme, and might be useful in itself for very large scale data copying operations between facilities.
[0076] The decoding bandwidth in this example was at 3.4 bits/second, compared to disk (approximately one Terabit/second) or tape (140 Megabit/second), and latency is also high (˜20 days in this example). It is expected that future sequencing technologies are likely to improve both these factors.
[0077] Modeling the full cost of archiving using either the DNA-storage of this disclosure or the tape storage shows that the key parameters are the frequency and fixed costs of transitioning between tape storage technologies and media.
[0078] One issue for long-term digital archiving is how DNA-based storage scales to larger applications. The number of bases of the synthesized DNA needed to encode the information grows linearly with the amount of information to be stored. One must also consider the indexing information required to reconstruct full-length files from the short DNA segments 240. The indexing information 250 grows only as the logarithm of the number of DNA segments 240 to be indexed. The total amount of synthesized DNA required grows sub-linearly. Increasingly large parts of each ones of the DNA segments 240 are needed for indexing however and, although it is reasonable to expect synthesis of longer strings to be possible in future, the behavior of the scheme was modeled under the conservative constraint of a constant 114 nucleotides available for both the data and the indexing information 250.
[0079] As the total amount of information increases, the encoding efficiency decreases only slowly (
[0080]
[0081]
[0082]
[0083] In addition to data storage, the teachings of this disclosure can also be used for steganography.