Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity

Abstract

A method (100) and an apparatus (200) for encoding information in codeword sequences are described which help avoid synthesizing reverse complementary nucleotide sequences, making them suitable for synthesizing nucleic acid strands. Multiple codes are provided (102), consisting of a same amount of corresponding code words. No word belongs to more than one code. Each code could completely encode all information units which are encoded using code word sequences generated from the codes. Generating (105) a sequence comprises: selecting (106), from code words of a code, a next code word to be appended to the sequence; appending (108) the next code word if a concatenation of the sequence and the next code word does not contain a reverse complementary of any code symbol sequence that at least partly contains the next code word; and otherwise (109) selecting a corresponding next code word from a different code and repeating the appending.

Claims

1. A computer-implemented method for encoding a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words, comprising providing a plurality of codes, consisting of a same amount of corresponding code words, wherein the code words consist of code symbols representing nucleotides and none of the code words belongs to more than one of said codes and each of said codes is capable of completely encoding said plurality of information units, wherein the information units are mapped to corresponding code words; encoding the plurality of information units using at least one code word sequence generated from said plurality of codes during a generating process comprising: selecting, from the plurality of code words of a code of said plurality of codes, a next code word to be appended to the code word sequence; appending the next code word if a concatenation of the code word sequence and the next code word to be appended does not contain at least one section comprising a reverse complementary code symbol sequence of any code symbol sequence of a defined length that at least partly contains the next code word to be appended, wherein the reverse complementary code symbol sequence corresponds to the reverse complementary of a sequence of nucleotides that is represented by the code symbol sequence; and otherwise selecting a corresponding next code word from a different code of said plurality of codes and repeating the appending.

2. The method according to claim 1, wherein the code symbol sequence consists of the next code word to be appended and a most recently appended code word, and the at least one section comprises a reverse complementary code symbol sequence of the next code word to be appended being directly adjacent to a reverse complementary code symbol sequence of the most recently appended code word.

3. The method according to claim 1, wherein the selecting and appending is repeated to generate the code word sequence, until the code word sequence has at least a predefined length or the information units are completely encoded.

4. The method (100) according to claim 1, comprising synthesizing at least one nucleic acid molecule containing a segment wherein a sequence of nucleotides is arranged to correspond to the at least one code word sequence.

5. The method according to claim 4, wherein the synthesizing is performed during the generation of the at least one code word sequence.

6. The method according to claim 1, wherein said code symbol sequence consists of the next code word to be appended and the at least one section comprises the reverse complementary code word of the next code word to be appended.

7. The method according to claim 1, wherein said code symbol sequence consists of the next code word to be appended and a predefined amount of code words previously appended to the code word sequence and said at least one section comprises reverse complementary code words of the next code word to be appended and of the predefined amount of previously appended code words.

8. The method according to claim 1, wherein, if none corresponding code word of the plurality of codes is appendable as the next code word, the most recently appended code word is removed and a corresponding code word of a different code of the plurality of codes is selected as the next code word to be appended.

9. The method according to claim 1, wherein an additional code is provided, consisting of less code words than said plurality of codes, wherein none of the code words of the additional code belongs to any of the plurality of codes, the code words of the additional code have corresponding code words in said plurality of codes and the additional code is capable of incompletely encoding said plurality of information units.

10. The method according to claim 1, wherein the next code word is appended if the concatenation of the code word sequence and the next code word to be appended contains said at least one section and a distance between a location of the most recently appended code word and any location of said at least one section within said code word sequence is greater than a predefined distance.

11. The method according to claim 1, wherein the plurality of codes is generated from an initial plurality of code words having all code words containing a runlength of identical code symbols of more than a maximum runlength, either within a single code word or concatenated with another code word of the initial plurality of code words, removed.

12. An apparatus for encoding a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words, comprising a code generator unit configured to provide a plurality of codes, consisting of a same amount of corresponding code words, wherein the code words consist of code symbols representing nucleotides and none of the code words belongs to more than one of said codes and each of said codes is capable of completely encoding said plurality of information units, wherein the information units are mapped to corresponding code words; an information encoder unit configured to encode the plurality of information units using at least one code word sequence generated from said plurality of codes; and a code word sequence generator unit configured to generate said at least one code word sequence at least by: selecting, from the plurality of code words of a code of said plurality of codes, a next code word to be appended to the code word sequence; appending the next code word if a concatenation of the code word sequence and the next code word to be appended does not contain at least one section comprising a reverse complementary code symbol sequence of any code symbol sequence of a defined length that at least partly contains the next code word to be appended, wherein the reverse complementary code symbol sequence corresponds to the reverse complementary of a sequence of nucleotides that is represented by the code symbol sequence; and otherwise selecting a corresponding next code word from a different code of said plurality of codes and repeating the appending.

13. The apparatus according to claim 12, comprising a synthesizer unit configured to synthesize at least one nucleic acid molecule containing a segment wherein a sequence of nucleotides is arranged to correspond to the at least one code word sequence.

14. A non-transitory computer readable storage medium having stored therein instructions enabling encoding of a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words, which, when executed by a computer, cause the computer to: provide a plurality of codes, consisting of a same amount of corresponding code words, wherein the code words consist of code symbols representing nucleotides and none of the code words belongs to more than one of said codes and each of said codes is capable of completely encoding said plurality of information units, wherein the information units are mapped to corresponding code words; encode the plurality of information units using at least one code word sequence generated from said plurality of codes; and generate said at least one code word sequence at least by: selecting, from the plurality of code words of a code of said plurality of codes, a next code word to be appended to the code word sequence; appending the next code word if a concatenation of the code word sequence and the next code word to be appended does not contain at least one section comprising a reverse complementary code symbol sequence of any code symbol sequence of a defined length that at least partly contains the next code word to be appended, wherein the reverse complementary code symbol sequence corresponds to the reverse complementary of a sequence of nucleotides that is represented by the code symbol sequence; and otherwise selecting a corresponding next code word from a different code of said plurality of codes and repeating the appending.

15. An apparatus for decoding at least one sequence of code words obtained by the method of claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 schematically illustrates an embodiment of a method for encoding a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words;

(2) FIG. 2 schematically illustrates an embodiment of an apparatus for encoding a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words;

(3) FIG. 3 schematically illustrates an example of a division of an initial plurality of code words into a plurality of codes; and

(4) FIG. 4 schematically illustrates an example of a decision tree which code to use for which code word in a code word sequence.

DETAILED DESCRIPTION OF EMBODIMENTS

(5) For a better understanding, the present principles will now be explained in more detail in the following description with reference to the drawings. It is understood that the present principles are not limited to these exemplary embodiments and that specified features can also expediently be combined and/or modified without departing from the scope of the present principles as defined in the appended claims.

(6) Referring to FIG. 1, an embodiment of a method 100 for encoding a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words, is schematically shown. The term information unit refers a quantum of information such as a bit or a plurality of bits, for example a byte, and information units may refer to any digital representation of user data to be stored, e.g. binary encoded bitstreams.

(7) A plurality of information units is provided 101. Further, a plurality of codes is provided 102, each consisting of a same amount of corresponding code words. The term corresponding code words refers to code words of different codes of the plurality of codes capable of encoding the same information units. None of the code words belongs to more than one of the codes and each of the codes is capable of completely encoding the plurality of information units, wherein the information units are mapped or assigned to the code words. Each of the codes is complete, i.e. each of said information units can be mapped or assigned to a code word of the same code.

(8) The plurality of information units is encoded 103 using at least one code word sequence generated from the plurality of codes. The amount of required code word sequences depends on the amount of information units to be encoded and a length suitable for being used when generating an oligo carrying the encoded information. At least one code word sequence is generated. However, many more may be required for encoding user data, for example an audio or video file.

(9) For each new code word sequence a first code word is selected 104 from the plurality of code words of a code of the plurality of codes.

(10) The generation 105 of each code word sequence further comprises a set of steps which may be executed recursively:

(11) In one step, a next code word to be appended to the code word sequence is selected 106 from the plurality of code words of a code of said plurality of codes. The code may or may not be the same code as the code the first code word was selected from.

(12) In a next step, it is analyzed 107, whether a concatenated sequence, i.e. a concatenation of the code word sequence and the next code word to be appended, contains at least one section that comprises a reverse complementary code symbol sequence of any code symbol sequence of a predefined length that at least partly contains the next code word to be appended, wherein the reverse complementary code symbol sequence corresponds to the reverse complementary of a sequence of nucleotides that is represented by the code symbol sequence.

(13) In any example, the next code word is appended 108 if the code word sequence does not contain such a section.

(14) A section is a sequence of code symbols. It can but does not have to be code word aligned. If the section does not fulfill the described criterion, for example contains the reverse complementary of the most recently appended code word neighboring the reverse complementary of the next code word which has not yet been appended, the next code word is not appended to avoid generation of a long self-reverse complementary portion within the code word sequence being generated.

(15) The two reverse complementary code words being adjacent to each other comprises that one reverse complementary code word follows the other, or vice versa. That is, adjacent refers to before or after. It could be the sequence of two reverse complementary code words, or the reverse complementary of a sequence of the two code words. Further, in the analysis 107 more than one such section could be detected.

(16) Otherwise 109 a corresponding next code word is selected from a different code of said plurality of codes and the appending, i.e. the analysis 107 of the code word sequence and, depending on the result of the analysis, potentially the actual appending 108 of the next code word, is repeated.

(17) If none corresponding code word of the plurality of codes is appendable (i.e. can be appended) as the next code word to the most recently appended code word, the most recent code word, i.e. the most recently appended code word, is removed 110 and a corresponding code word of a different code of the plurality of codes is selected as the next code word.

(18) The selecting and appending is repeated to generate the current code word sequence, until the code word sequence has at least a predefined length or the information units are completely encoded 111.

(19) If the code word sequence has the predefined length but remaining information units still have to be encoded 112, another code word sequence is generated.

(20) In the shown embodiment a nucleic acid molecule is synthesized 113 using the generated code word sequence. The nucleic acid molecule contains a segment wherein a sequence of nucleotides is arranged to correspond to the code word sequence. Many more than one nucleic acid molecules may be generated. The amount of nucleic acid molecules or oligos generated or synthesized by a synthesizer corresponds to the amount of generated code word sequences. At least one nucleic acid molecule is synthesized for each code word sequence. However, multiple nucleic acid molecules may be generated for each or a selected, for example high-priority, subset of the code word sequences. The synthesizing step may be carried out after generation of all code word sequences or after generation of each of the sequences.

(21) The synthesized nucleic acid molecules carry the information encoded by the succession of the nucleotides forming the nucleic acid molecules. These molecules can be stored and the information can be retrieved by reading the sequence of nucleotides using a sequencer and decoding the extracted code words.

(22) Referring to FIG. 2, an embodiment of an apparatus 200 for encoding (or translating or mapping) a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words is schematically illustrated. The shown apparatus allows implementing the advantages and characteristics of the described method as part of an apparatus for encoding a plurality of information units in at least one sequence of code words consisting of quaternary code symbols, to be used for storing information in synthesized nucleic acid molecules comprising sequences of nucleotides corresponding to the code words.

(23) The apparatus 200 has an input 201 for receiving information units to be encoded and providing them to an information encoder unit 203. The apparatus 200 further comprises a code generator unit 202 configured to provide a plurality of codes, consisting of a same amount of corresponding code words, wherein none of the code words belongs to more than one of the codes and the codes are capable of completely encoding the plurality of information units. The code generator unit 202 may, for example, be programmable logic circuitry or a processing device arranged to generate the multiple codes, connected to a memory device for storing the codes. The code words may be generated on request by the information encoder unit 202 or may be generated in advance and provided via, for example, a code book table or memory containing the generated code words. Therefore, the code generator unit 202 may also refer to a set of code books provided in one or more memory devices.

(24) The information encoder unit 203 is configured to encode the plurality of information units using at least one code word sequence generated from the plurality of codes.

(25) Further, the apparatus 200 comprises a code word sequence generator unit 204 connected to the information encoder unit 203 and configured to generate each of the at least one code word sequence at least by: selecting, from the plurality of code words of a code of the plurality of codes, a next code word to be appended to the code word sequence;

(26) appending the next code word if a concatenation of the code word sequence and the next code word does not contain a section comprising a reverse complementary code symbol sequence of any code symbol sequence of a defined length that at least partly contains the next code word to be appended; and otherwise selecting a corresponding next code word from a different code of the plurality of codes and repeating the appending.

(27) The code generator unit 202, the information encoder unit 203 and the code word sequence generator unit 204 may, for example, be provided as separate devices, jointly as at least one device or logic circuitry, or functionality carried out by a microprocessor, microcontroller or other processing device, computer or other programmable apparatus.

(28) In the embodiment shown in FIG. 2, the illustrated apparatus 200 further comprises a synthesizer unit 205 connected to receive the generated code word sequences. It is configured to synthesize nucleic acid molecules, for example DNA or RNA strands, each containing a segment wherein a sequence of nucleotides is arranged to correspond to a particular code word sequence. In another embodiment, the apparatus does not comprise the synthesizer unit but is connected or connectable to it by means of an interface.

(29) For storing digital data in a nucleic acid strand, such as a DNA strand, a code is generated by the code word generator unit 202. The code assigns to each digital source value or information unit a specific sequence of code symbols. Each code symbol may later be translated into a corresponding nucleotide of the nucleic acid strand. For example, when using bytes as source values, i.e. information units, the code is chosen to define the assignment of 256 numbers to different sequences of code symbols, i.e. different code words. Each code word will be translated or transformed into a corresponding nucleotide sequence. Other input values may be used.

(30) In order to explain in more detail the code generation performed by the code generator unit 202, an example of a division of an initial plurality of code words into a plurality of codes is schematically illustrated in FIG. 3.

(31) A succession of n nucleotides can hold n.sup.4 different sequences of nucleotides (e.g. n=4: 256, n=5: 1024, n=6: 4096). As shown, a corresponding set S.sub.1 of code words, i.e. different sequences of code symbols, has all items or code words removed, which might otherwise lead to runlength problems when concatenating the code words to form a code word sequence.

(32) Therefore, all code words containing a runlength greater than three are removed. Further, all code words which might lead to runlength problems when being concatenated are removed. A possible rule applied by the code word generator unit 202 for avoiding concatenation problems is to allow only a maximum of two identical code symbols, respectively nucleotides, at the beginning of an item or code word and no two identical nucleotides at the end of an item or code word. Other rules may be applied instead. With the rule described above, the set S.sub.1 (for example, 1024 different code symbol sequences or code words for n=5) is reduced to a subset S.sub.2, for example containing 720 code words if S.sub.1 has a size of 1024). Code words of the subset S.sub.2 can be chosen to form a code. If, for example, a code with 256 code values is required, the code generator unit 202 generates two codes (C.sub.1 and C.sub.2) each forming a subset of S.sub.2. The generated codes are complete, as their number of code words is identical to the number of source values or information units. Further, as their intersection is empty, for each source value any of the available codes C.sub.1 and C.sub.2 can be used during the encoding process.

(33) Additionally referring to FIG. 4, an example of a decision tree which code to use for which code word in a code word sequence is schematically shown. A way of selecting the right code for each source value or information unit such that self-reverse complementariness is avoided is described. A code word sequence for an oligo is generated such that it consists of a sequence of m code words, each consisting of n code symbols which correspond to n nucleotides or nucleobases when synthesizing a corresponding oligo.

(34) The decision tree shown in FIG. 4 visualizes which of two available codes C.sub.1 and C.sub.2 can be chosen when generating the code word sequence. In case of the two available codes C.sub.1 and C.sub.2, the possible number of code combinations, which corresponds to the number of paths from left to right, for the m code words is 2.sup.m.

(35) The solution according to the present principles allows determining, if a path represents a code word sequence, respectively an oligo, with no self-reverse complementary sections. The number of paths describing oligos without self-reverse complementary sections is usually much larger than the number of paths with self-reverse complementary sections.

(36) For example, when the ith code word (cw.sub.i) shall be appended to the already created code word sequence containing the code words cw.sub.0 to all locations (L.sub.i) are determined where the self-reverse complementary version of the code word can be found in a concatenated sequence consisting of the already created code word sequence and the ith code word to be appended. These locations are compared with the locations L.sub.i+1 within the same concatenated sequence but found for the code word before (cw.sub.i+1). If the difference between any location in and L.sub.i+1 is identical to n, appending the code word cw.sub.i would lead to two self-reverse complementary regions or sections potentially having a length of up to 3n2 code symbols (as there may probably be self-reverse complementary code word fragments of length n1 neighboring a self-reverse complementary code word of length n code symbols) and is avoided.

(37) A path in the tree where the selected codes fulfill the condition that no two self-reverse complimentary sections exist having the length 2n or above can be found in the following way: Each time a code word (cw.sub.i) is to be appended to the already created part of a code word sequence, respectively oligo part, it is checked for the first code C.sub.1 in the way described above, if appending of the code word is allowed. If it is not allowed, all other available codes are tried, i.e. C.sub.2 in the shown example, until a code word cw.sub.i is found, where the appending is allowed. Then the related code word, respectively code, is selected and the processing is continued with code word cw.sub.i+1.

(38) If for none of the codes C.sub.1 and C.sub.2 a code word fulfilling the related condition is found, the process is continued at code word cw.sub.i+1 again by trying the next available code there. The process is finished, when all m code words were chosen fulfilling the condition described above.

(39) In the example, the codes C.sub.1 and C.sub.2 were taken from the set S.sub.2 shown in FIG. 3. Referring again to FIG. 3, if the number of code words in the set S.sub.2 cannot be distributed completely to different codes (in the described example: 7202256=208) code words remain unassigned. The unassigned code words can be used to form an incomplete code.

(40) The described solution allows generating code word sequences which can be synthesized as nucleic acid strands which do not contain long self-reverse complementary sections. Self-reverse nucleic acid strands can be avoided and problems with sequencing synthesized nucleic acid strands can be reduced, thereby improving their suitability to serve as a bio-chemical storage for any kind of user data.

(41) As will be appreciated by one skilled in the art, aspects of the present principles can be embodied as an apparatus, a system, method or computer readable medium. Accordingly, aspects of the present principles can take the form of a hardware embodiment, a software embodiment or an embodiment combining software and hardware aspects. Furthermore, aspects of the present principles can take the form of a computer readable storage medium. Any combination of one or more computer readable storage medium(s) may be utilized.

(42) Aspects of the present principles may, for example, at least partly be implemented in a computer program comprising code portions for performing steps of the method according to an embodiment of the present principles when run on a programmable apparatus or enabling a programmable apparatus to perform functions of an apparatus or system according to an embodiment of the present principles.

(43) Further, any shown connection may be a direct or an indirect connection. Furthermore, those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or impose an alternate decomposition of functionality upon various logic blocks.

Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity

Assignee

Inventors

Cpc classification

Classification Explorer

H03M7/46

ELECTRICITY

Classification Explorer

H03M5/145

ELECTRICITY

Classification Explorer

G06N3/123

PHYSICS

International classification

Classification Explorer

H03M7/38

ELECTRICITY

Classification Explorer

H03M7/46

ELECTRICITY

Abstract

Claims

Description