STEGANOGRAPHIC EMBEDDING OF INFORMATION IN CODING GENES

20220238184 · 2022-07-28

    Inventors

    Cpc classification

    International classification

    Abstract

    The invention relates to the storage of information in nucleic acid sequences. The invention also relates to nucleic acid sequences containing desired information and to the design, production or use of sequences of this type.

    Claims

    1.-26. (canceled)

    27. A method for producing an engineered vector comprising an information containing nucleic acid molecule, the method comprising the steps: (a) generating the information containing nucleic acid molecule; and (b) introducing the information containing nucleic acid molecule into a vector, thereby generating the engineered vector, wherein the nucleotide sequence of a starting nucleic acid molecule encoding a selected protein is altered to generate the nucleotide sequence of the information containing nucleic acid molecule, wherein the information containing nucleic acid molecule contains encoded information, wherein the information containing nucleic acid molecule and the starting nucleic acid molecule encode the selected protein with no differences in amino acid sequence, wherein the only codons altered to generate the information containing nucleic acid molecule and read to disclose the information are codons for the following eight amino acids: arginine, valine, glycine, alanine, threonine, serine, leucine, and proline, wherein the encoded information is read from 5′ to 3′ and each codon encoding the eight amino acids is read as a zero or a one, wherein a set of zeros and ones represents a character of information, and wherein the expression level of the encoded protein produced in a human cell from the information containing nucleic acid molecule is not measurably decreased compared to the expression level of the starting nucleic acid molecule.

    28. The method of claim 27, wherein the most prevalent codon in FIG. 3 for each of the eight amino acids is read as a zero and the second most prevalent codon in FIG. 3 for each of the eight amino acids is read as a one.

    29. The method of claim 27, wherein more than one codon is selected to represent a zero and more than one codon is selected to represent a one.

    30. The method of claim 29, wherein codons selected to represent zeros and ones alternate based upon codon usage preference for a particular organism.

    31. The method of claim 30, wherein the codons for serine are read as the digits of either zero or one, wherein (i) AGC, TCT, and AGT are each read as a zero and TCC, TCA, and TCG are each read as a one.

    32. The method of claim 30, wherein the first most preferred codon encoding for an amino acid is read as a zero and the second most preferred amino acid is read as a one.

    33. The method of claim 30, wherein the first most preferred codon encoding for the amino acid serine is AGC and the second most preferred codon encoding for the amino acid serine is TCC.

    34. The method of claim 30, wherein the third most preferred codon encoding for an amino acid is read as a one and the fourth most preferred amino acid is read as a zero.

    35. The method of claim 34, wherein the third most preferred codon encoding for the amino acid serine is TCT and the fourth most preferred codon encoding for the amino acid serine is TCA.

    36. The method of claim 27, wherein the starting nucleic acid molecule is codon optimized for a particular organism.

    37. The method of claim 27, wherein the zeros and ones are read in groups of six or eight to represent a single character.

    38. The method of claim 27, further comprising introducing the engineered vector into a cell.

    39. A cell produced by the method of claim 38.

    40. The method of claim 38, wherein the cell is a human cell.

    41. An engineered vector produced by the method of claim 27.

    Description

    FIGURES

    [0047] FIG. 1: Extract from the international ASCII table.

    [0048] FIG. 2A shows the test gene used in Example 1 (mouse telomerase), optimized for H. sapiens (A) (SEQ ID NO:9)

    [0049] FIG. 2B shows the encoded protein for the test gene (mouse telomerase) used in Example 1 (SEQ ID NO:10)

    [0050] FIG. 3: Codon usage table (CUT) for Homo sapiens

    [0051] FIG. 4: Codon order of the permutations

    [0052] FIG. 5 shows an analysis of the modified sequence obtained in Example 1 in comparison with the starting sequence

    [0053] FIG. 6 shows an alignment of the sequences of eGFP(opt) (SEQ ID NO:12) and eGFP(msg) (SEQ ID NO:13) from Example 3. The translated amino acid sequence of the protein eGFP is shown above the alignment (SEQ ID NO:11). Silent substitutions arising from the use of alternative codons on embedding the message “AEQUOREA VICTORIA.” in eGFP(msg) are highlighted in black. Cloning sites are underlined, the vector content of the 6×His-tag (SEQ ID No:18) is also shown downstream of the 3′ HindIII restriction site.

    [0054] FIG. 7 shows the results of analysis of the expression of the genes eGFP(opt) and eGFP(msg) from Example 3 by Coomassie gel, Western blot (with a GFP-specific antibody) and fluorescence analysis.

    [0055] FIG. 8 shows an alignment of the sequences of EMG1(opt) (SEQ ID NO:15), EMG1(msg) (SEQ ID NO:16) and EMG1(enc) (SEQ ID NO:17) from Example 4. The translated amino acid sequence of the protein EMG1 is shown above the alignment (SEQ ID NO:14). Silent substitutions arising from the use of alternative codons on embedding the message “GENEART AG U.S. Pat. No. 1,234,567” in EMG1(msg) and the encrypted message “:JQWF&G%DY%$41Y#′XE%87G;K” in EMG1(enc) are highlighted in black. Cloning sites are underlined.

    [0056] FIG. 9 shows the result of the analysis of the expression of EMG 1(opt), EMG1(msg) and EMG1(enc) by means of Western blot analysis using a His-specific antibody.

    EXAMPLES

    Example 1: Encryption of “GENE” in the N Terminus of M. musculus Telomerase (Optimized for H. sapiens)

    [0057] The N terminus of M. musculus telomerase was selected as the medium for encrypting the message “GENE”. M. musculus telomerase (1251AA) comprises 360 four-fold degenerate, information-containing codons (ICCs) and 372 six-fold degenerate ICCs. The open reading frame (ORF) of the gene is first of all optimized in conventional manner, i.e. codon selection is adapted to the specific circumstances of the target organism.

    [0058] Below, consideration is given only to those codons which are 4- and 6-fold degenerate, thus for the amino acids VPTAG (SEQ ID NO:19) (each 4 codons) and LSR (each 6 codons). These are designated ICC (information containing codons). (Amino acids for which there are only 2 or 3 codons (DEKNIQHCYF (SEQ ID NO:20)) may in principle also be used, but since gene performance suffers more severely, they are disregarded in the present example.)

    [0059] The secret information (under certain circumstances previously encrypted) is now broken down into bits. 6 bits (=2.sup.6=64 states) per character are here sufficient for letters+numbers+special characters; ideally the ASCII characters from 32=0010 0000 (space) to 95=0101 1111 (underscore). This range includes capital letters, numbers and the most important special characters (see FIG. 1). The eight digit ASCII code is reduced to a 6 bit code using the conventional bit operation: 6 bits=8 bits−32 or 8 bits=6 bits+32.

    [0060] The CUT below for Homo sapiens is used for encryption in this example:

    TABLE-US-00001 ICC CUT H. sapiens (sorted by “fraction (1) & alphabetically (2)). AA Codon Fraction A GCC 0.40 A GCT 0.26 A GCA 0.23 A GCG 0.11 G GGC 0.34 G GGA 0.25 G GGG 0.25 G GGT 0.16 P CCC 0.33 P CCT 0.28 P CCA 0.27 P CCG 0.11 T ACC 0.36 T ACA 0.28 T ACT 0.24 T ACG 0.11 V GTG 0.46 V GTC 0.24 V GTT 0.18 V GTA 0.12 L CTG 0.40 L CTC 0.20 L CTT 0.13 L TTG 0.13 L CTA 0.08 L TTA 0.07 R CGG 0.21 R AGA 0.20 R AGG 0.20 R CGC 0.19 R CGA 0.11 R CGT 0.08 S AGC 0.24 S TCC 0.22 S TCT 0.18 S AGT 0.15 S TCA 0.15 S TCG 0.06

    [0061] On the basis of the species-specific codon usage table (CUT), all ICCs from 5′ to 3′ are successively modified and the additional information introduced bit by bit. The following applies:

    [0062] Binary 1=first or third best codon

    [0063] Binary 0=second or fourth best codon

    [0064] The “first best”-“fourth best” codon weighting here reflects the frequency with which the respective codon is used in the target organism for encoding its amino acid. A database on this subject may be found at: http//www.kazusa.or.jp/codon/.

    [0065] The alternative of two possible codons per bit makes it possible, most probably in every case, to avoid unwanted sequence motifs during optimization. ICC-adjacent non-ICC codons may, of course, also be modified in order to exclude specific motifs.

    [0066] A defined CUT is necessary for definite encryption and decryption. However, especially for little investigated organisms, CUTs will still change in future. It is therefore necessary in many cases to deposit a dated CUT. However, only the order of the ICC codons is of relevance, not the actual frequency figures.

    [0067] The order may be deposited on paper or notarially. It is, of course, possible also to accommodate these data in the DNA itself, for example the 3′ UTR (immediately downstream from the gene). 22 nt are required for deposition of the ICC CUT (see Example 2).

    [0068] However, for the commonest target organisms (mammals, crop plants, E. coli, baker's yeast etc.), the codon tables are so complete that they will not change any further. If two or more codons have the same frequency in the CUT, the codons in question are sorted alphabetically: A>C>G>T.

    [0069] The end of a message may be marked with an agreed stop character for example “11 1111”, corresponding to the underscore character.

    [0070] The strategy of defining the first or third best codon as binary 1 and the second or fourth best codon as binary 0, i.e. in general of working with a codon usage table, gives rise to a gene which is firstly largely optimized and thus functions well in the target organism and secondly permits a watermark.

    [0071] Alternatively, it is in principle also possible to define all amino acids for which there are two or more codons as ICC and to agree on the following coding principle for steganographic data embedding:

    [0072] Binary 1=G or C at codon position 3

    [0073] Binary 0=A or T at codon position 3

    [0074] This is possible for the 18 amino acids GEDAVRSKNTIQHPLCYF (SEQ ID NO:21). (In the above method based on a quality ranking, there are only 8 ICCs.) In this manner, more than twice as much information may be accommodated in a gene and a definite CUT need not be deposited in any case. The disadvantage of this method is, however, that the resultant gene is not optimized or is scarcely so.

    [0075] In the present example, the message “GENE” was encrypted in the N terminus of M. musculus telomerase. This message contains 4×6=24 bits.

    TABLE-US-00002 G E N E “GENE”, binary 8 bit: 0100 0111     0100 0101     0100 1110     0100 0101     (71) (69) (78) (69) 8 bit-32: (39) (37) (46) (37) “GENE”, binary 6 bit: 10 0111    10 0101    10 1110    10 0101   

    [0076] 24 bits were encrypted by modifying 10 four-fold or six-fold degenerate ICCs in the N terminus of the telomerase:

    TABLE-US-00003 (SEQ ID No: 1) M  D  A  M  K  R  G  L  C  C  V  L  L  L  C  G  A  V  F  V (12 ICCs) Old sequence ATGGATGCAATGAAGAGGGGCCTGTGCTGCGTGCTGCTGCTGTGTGGCGCCGTGTTTGTG (SEQ ID No: 2) Old ranking       3        3  1  1        1  1  1  1     1  1  1     1 Message bit       1        0  0  1        1  1  1  0     0  1  0     1 New ranking       1        2  2  1        1  1  1  2     2  1  2     1 New sequence [00001]embedded image (SEQ ID No: 4) S  P  S  E  I  T  R  A  P  R  C  P  A  V  R  S  L  L  R  S (17 ICCs) Old sequence AGCCCTAGCGAGATCACCAGAGCCCCCAGATGCCCTGCCGTGAGAAGCCTGCTGCGGAGC (SEQ ID No: 5) Old ranking 1  2  1        1  2  1  1  2     2  1  1  2 Message bit 1  0  1        1  1  0  1  0     0  1  0  1 New ranking 1  2  1        1  1  2  1  2     2  1  2  1 New sequence [00002]embedded image

    [0077] No unwanted motifs nor an excessively high GC content occurred during coding. It was therefore not necessary to make use of the third best and fourth best codons. FIG. 5 shows a comparison of the analysis of the starting sequence and of the modified sequence.

    Example 2: Encryption of the Codon Usage Table for Escherichia coli and Deposition as a Nucleic Acid Sequence

    [0078] It is essential to know the coding used in order to encrypt the information embedded in the genes. It is the key for decoding and may preferably consist of the codon usage table predetermined by the organism. In principle, however, the key used may be selected at will from approx. 5.48×10.sup.19 possible combinations.

    [0079] It is possible likewise to encode this key in the form of a specific nucleotide sequence and so deposit it, for example, within the genome.

    [0080] The codon usage table is firstly sorted alphabetically by amino acid and then the codons of an amino acid are sorted alphabetically by codon:

    TABLE-US-00004 Amino acid Codon Frequency Rank A GCA 0.22 3 A GCC 0.27 2 A GCG 0.35 1 A GCT 0.16 4 C TGC 0.55 1 C TGT 0.45 2 D GAC 0.37 2 D GAT 0.63 1 E GAA 0.68 1 E GAG 0.32 2 F TTC 0.42 2 F TTT 0.58 1 G GGA 0.12 4 G GGC 0.38 1 G GGG 0.16 3 G GGT 0.33 2 H CAC 0.42 2 H CAT 0.58 1 I ATA 0.09 3 I ATC 0.40 2 I ATT 0.50 1 K AAA 0.76 1 K AAG 0.24 2 L CTA 0.04 6 L CTC 0.10 5 L CTG 0.49 1 L CTT 0.11 4 L TTA 0.13 2 L TTG 0.13 3 M ATG 1.00 1 N AAC 0.53 1 N AAT 0.47 2 P CCA 0.19 2 P CCC 0.13 4 P CCG 0.51 1 P CCT 0.17 3 Q CAA 0.33 2 Q CAG 0.67 1 R AGA 0.05 5 R AGG 0.03 6 R CGA 0.07 4 R CGC 0.37 1 R CGG 0.11 3 R CGT 0.36 2 S AGC 0.27 1 S AGT 0.16 2 S TCA 0.14 6 S TCC 0.15 3 S TCG 0.15 4 S TCT 0.15 5 T ACA 0.15 4 T ACC 0.41 1 T ACG 0.27 2 T ACT 0.17 3 V GTA 0.16 4 V GTC 0.21 3 V GTG 0.37 1 V GTT 0.26 2 W TGG 1.00 1 Y TAC 0.43 2 Y TAT 0.57 1 Stop TAA 0.59 1 Stop TAG 0.09 3 Stop TGA 0.32 2

    [0081] The “Frequency” column contains the percentage proportion of the respective codon relative to the respective amino acid, while the “Rank” column contains the rank of the respective codons. The “Rank” value defines the frequency of the respective codon within an amino acid. Where there are two or more identical frequency values within an amino acid, the ranks of the equally frequent codons are additionally allocated alphabetically. The “Rank” column thus contains the key.

    [0082] In the example, the alphabetically sorted codons for alanine (GCA, GCC, GCG, GCT) have the order of precedence 3, 2, 1, 4 or 3214.

    [0083] For amino acids with one codon (M, W), there is only one possibility for order of precedence (1).

    [0084] For amino acids with two codons (C, D, E, F, H, K, N, Q, Y), there are two possibilities for order of precedence (12, 21).

    [0085] For amino acids with three codons (I, stop), there are six possibilities for order of precedence (123, 132, 213, 231, 312, 321).

    [0086] For amino acids with four codons (A, G, P, T, V), there are 24 possibilities for order of precedence (1234, 1243, 1324 . . . 0.4231, 4312, 4321).

    [0087] For amino acids with six codons (L, R, S), there are 720 possibilities for order of precedence (123456, 123465, 123546, . . . 654231, 654312, 654321).

    [0088] On the basis of these figures, it becomes clear that there are 1.sup.2×29×6.sup.2×24.sup.3×720.sup.3=5.48×10.sup.19 different combinations of order of precedence. This is thus the number of possible keys.

    [0089] For each amino acid group (one, two, three, four, six codons), an ascending list of all possible orders of precedence is drawn up and consecutively numbered in binary. This is shown by way of example for the 24 possible orders of precedence of the amino acids with four codons (A, G, P, T, V):

    TABLE-US-00005 Order of precedence Decimal Binary 1234 00 00000 1243 01 00001 1324 02 00010 1342 03 00011 1423 04 00100 1432 05 00101 2134 06 00110 2143 07 00111 2314 08 01000 2341 09 01001 2413 10 01010 2431 11 01011 3124 12 01100 3142 13 01101 3214 14 01110 3241 15 01111 3412 16 10000 3421 17 10001 4123 18 10010 4132 19 10011 4213 20 10100 4231 21 10101 4312 22 10110 4321 23 10111

    [0090] 0 binary digits are required for the binary coding of the order of precedence of amino acid with one codon.

    [0091] 1 binary digit (decimal 0=binary 0 & decimal 1=binary 1) is required for the binary coding of the order of precedence of amino acids with two codons.

    [0092] 3 binary digits (decimal 0=binary 000 & decimal 5=binary 101) are required for the binary coding of the order of precedence of amino acids with three codons.

    [0093] 5 binary digits. (decimal 0=binary 00000 & decimal 23=binary 10111) are required for the binary coding of the order of precedence of amino acids with four codons.

    [0094] 10 binary digits (decimal 0=binary 0000000000 & decimal 719=binary 1011001111) are required for the binary coding of the order of precedence of amino acids with six codons.

    [0095] A specific binary number may accordingly be assigned to each order of precedence of the alphabetically sorted amino acids. The entirety of the binary numbers represents the specific codon usage table which is used for the steganographic method.

    TABLE-US-00006 Amino Order of Only 4 fold acid precedence Binary & 6 fold A 3214 01110 01110 C 12 0 D 21 1 E 12 0 F 21 1 G 4132 10011 10011 H 21 1 I 321 101 K 12 0 L 651423 1010111100 1010111100 M 1 N 12 0 P 2413 01010 01010 Q 21 1 R 564132 1001010011 1001010011 S 126345 0000010010 0000010010 T 4123 10010 10010 V 4312 10110 10110 W 1 Y 21 1 Stop 132 001

    [0096] The entire 70-digit binary sequence of the codon usage table of this example accordingly reads:

    [0097] 0111001011001111010101011110000101011001010011000001001010010 101101001

    [0098] In order to translate this binary sequence into a nucleotide sequence, each nucleobase is assigned a fixed, two-digit binary value: A=00, C=01, G=10, T=11

    [0099] Using this key, the binary sequence can be translated into a 35-digit nucleotide sequence.

    TABLE-US-00007 (SEQ ID NO: 7) CTAGTATTCCCCTGACCCGCCATAACAGGCCCGGC

    [0100] If only amino acids with four or six codons are used during the steganographic embedding of information into the coding sequence, it is sufficient to restrict oneself to these amino acids when depositing the codon usage table. The relevant binary numbers are stated in the above table in the “Only 4 fold & 6 fold” column and together give rise to the 56-digit binary sequence:

    [0101] 011101001110101111000010101001010011000001001010010101100

    [0102] Using the above-mentioned key, this may be translated into the following 28-digit nucleotide sequence:

    TABLE-US-00008 (SEQ ID NO: 8) CTCATGGTTACCCAGGCGAAGCCAGGTA

    [0103] As already mentioned, the binary sequence may furthermore be encrypted with a password using conventional encryption algorithms prior to translation into a nucleotide sequence.

    [0104] Translation of the nucleotide sequence back into a binary sequence and an order of precedence (key) proceeds in the reverse order in a similar manner to the described method.

    Example 3 Study into the Expression of E. Coli

    [0105] Construct eGFP(Opt):

    [0106] The open reading frame for enhanced green fluorescent protein (eGFP) was optimized for expression in E. coli. In so doing, a codon adaptation index (CAI) of 0.93 and a GC content of 53% were achieved.

    [0107] Construct eGFP(Msg):

    [0108] According to the invention, the message “AEQUOREA VICTORIA.” was embedded into the optimized DNA sequence, the key used being the codon usage table (CUT) of E. coli and the only codons used to accommodate the bits being those which have a degree of degeneracy of 4 or 6 and thus encode the amino acids A, G, P, T, V, L, R, S. Embedding the 18×6=10.sup.8 bit long message results in 71 nucleotide substitutions, so modifying the sequence by 10%. The CAI changes to 0.84, the GC content to 47%.

    [0109] FIG. 6 shows an alignment of the two sequences eGFP(opt) and eGFP(msg).

    [0110] Both genes were produced synthetically and, via NdeI/HindIII, ligated into the expression vector pEG-His. The proteins consequently contain a C terminal 6×His-tag (SEQ ID NO:18).

    [0111] Both genes, eGFP(opt) and eGFP(msg) were expressed in E. coli and analysed by Coomassie gel, Western blot (with a GFP-specific antibody) and fluorescence. The results are shown in FIG. 7. It was found that eGFP(msg) exhibits expression which is better by a factor of approx. 2 than eGFP(opt). This increase in expression is a random effect and not the rule (according to studies with other genes). What is important to note is that expression does not suffer from the embedding of the message.

    Example 4: Study of Expression in Human Cells

    [0112] Construct EMG1(Opt):

    [0113] The open reading frame for the human gene EMG1 nucleolar protein homologue was optimized for expression in human cells. In so doing, a codon adaptation index (CAI) of 0.97 and a GC content of 64% were achieved.

    [0114] Construct EMG1(Msg):

    [0115] According to the invention, the message “GENEART AG U.S. Pat. No. 1,234,567” was embedded into the optimized DNA sequence, the key used being the codon usage table (CUT) of H. sapiens and the only codons used to accommodate the bits being those which have a degree of degeneracy of 4 or 6 and thus encode the amino acids A, G, P, T, V, L, R. S. Embedding the 24×6=144 bit long message results in 92 nucleotide substitutions, so modifying the sequence by 12%. The CAI changes to 0.87, the GC content to 59%.

    [0116] Construct EMG1(Enc):

    [0117] The message “GENEART AG U.S. Pat. No. 1,234,567” was firstly encrypted using the conventional polyalphabetic Vigenère method (after Blaise de Vigenère, 1586) with the password “Secret”, so generating the character string “:JQWF&G%DY%$4Y#′XE%87G;K” from the message. In addition to the very simple and insecure Vigenère method, in which a plaintext letter is replaced by different ciphertext letters depending on its position in the text, it is in principle possible to use any other encryption method. According to the invention, the encrypted character string “:JQWF&G%DY%$4Y#′XE%87G;K” was embedded into the optimized DNA sequence, the key used being the codon usage table (CUT) of H. sapiens and the only codons used to accommodate the bits being those which have a degree of degeneracy of 4 or 6 and thus encode the amino acids A, G, P, T, V, L, R, S. Embedding the 24×6=144 bit long message results in 93 nucleotide substitutions, so modifying the sequence by 12%. Here too, the CAI changes to 0.87, the GC content to 59%.

    [0118] FIG. 8 shows an alignment of the sequences of EMG1(opt), EMG1(msg) and EMG1(enc).

    [0119] All three genes were produced synthetically and, via NcoI/XhoI, ligated into the vector pTriEx1.1 which permits expression in mammalian cells.

    [0120] Human HEK-293T cells were transfected with the three constructs EMG1(opt), EMG1(msg) and EMG1(enc) and harvested after 36 h. Expression of EMG1 was detected by Western blot analysis (with a His-specific antibody). All three constructs exhibit a comparable strength of expression. The results are shown in FIG. 9.