NUCLEIC ACIDS ENCODING REPETITIVE AMINO ACID SEQUENCES RICH IN PROLINE AND ALANINE RESIDUES THAT HAVE LOW REPETITIVE NUCLEOTIDE SEQUENCES
20190010192 ยท 2019-01-10
Assignee
Inventors
Cpc classification
C07K2319/31
CHEMISTRY; METALLURGY
C07K14/5759
CHEMISTRY; METALLURGY
C12N15/64
CHEMISTRY; METALLURGY
C12N15/67
CHEMISTRY; METALLURGY
C12N15/11
CHEMISTRY; METALLURGY
C07K14/00
CHEMISTRY; METALLURGY
International classification
C07K14/00
CHEMISTRY; METALLURGY
C07K14/715
CHEMISTRY; METALLURGY
Abstract
The present invention relates to a nucleic acid molecule comprising a low repetitive nucleotide sequence encoding a proline/alanine-rich amino acid repeat sequence. The encoded polypeptide comprises a repetitive amino acid sequence that forms a random coil. The nucleic acid molecule comprising said low repetitive nucleotide sequences can further comprise a nucleotide sequence encoding a biologically or pharmacologically active protein. Further, the present invention provides for selection means and methods to identify said nucleic acid molecule comprising said low repetitive nucleotide sequence. The present invention also relates to a method for preparing said nucleic acid molecules. Also provided herein are methods for preparing the encoded polypeptide or drug conjugates with the encoded polypeptide using the herein provided nucleic acid molecules. The drug conjugate may comprise a biologically or pharmacologically active protein or a small molecule drug. Also provided herein are vectors and hosts comprising such nucleic acid molecules.
Claims
1. A nucleic acid molecule, wherein said nucleic acid molecule comprises a nucleotide sequence encoding a polypeptide consisting of proline, alanine and, optionally, serine, wherein the nucleotide sequence of said nucleic acid has a length of at least 300 nucleotides, wherein said nucleotide sequence has a Nucleotide Repeat Score (NRS) lower than 50,000, wherein said Nucleotide Repeat Score (NRS) is determined according to the formula:
2. The nucleic acid molecule of claim 1, wherein said encoded polypeptide consists of proline and alanine.
3. The nucleic acid molecule of claim 2, wherein said proline residues constitute more than about 10% and less than about 75% of said encoded polypeptide.
4. The nucleic acid molecule of claim 1, wherein said encoded polypeptide consists of proline, alanine and serine.
5. The nucleic acid molecule of claim 4, wherein said proline residues constitute more than 4% and less than 40% of said encoded polypeptide.
6. The nucleic acid molecule of any one of claims 1 to 5, wherein said Nucleotide Repeat Score (NRS) is lower than 100.
7. The nucleic acid molecule of any one of claims 1 to 6, wherein said Nucleotide Repeat Score (NRS) is lower than 50.
8. The nucleic acid molecule of any one of claims 1 to 7, wherein said Nucleotide Repeat Score (NRS) is lower than 35.
9. The nucleic acid molecule of any one of claims 1 to 8, wherein the nucleotide sequence of said nucleic acid has a length of at least 900 nucleotides.
10. The nucleic acid molecule of any one of claims 1 to 9, wherein said nucleic acid molecule has an enhanced genetic stability.
11. The nucleic acid molecule of any one of claims 1 to 10, wherein said nucleotide sequence comprises said repeats, wherein said repeats have a maximum length n.sub.max, wherein n.sub.max is determined according to the formula:
12. The nucleic acid molecule of any one of claims 1 to 11, wherein said repeats have a maximum length of about 14, 15, 16, or 17 nucleotides to about 55 nucleotides.
13. The nucleic acid molecule of any one of claims 1 to 12, wherein said encoded polypeptide comprises a repetitive amino acid sequence with a plurality of amino acid repeats, wherein no more than 9 consecutive amino acid residues are identical and wherein said polypeptide forms a random coil.
14. The nucleic acid molecule of any one of claims 1 to 3 and 6 to 13, wherein said nucleic acid molecule is selected from the group consisting of: (a) the nucleic acid molecule comprising at least one nucleotide sequence selected from the group consisting of SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31, SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 87, SEQ ID NO: 88, SEQ ID NO: 89, SEQ ID NO: 90, SEQ ID NO: 91, SEQ ID NO: 92, SEQ ID NO: 93, SEQ ID NO: 94, SEQ ID NO: 95, SEQ ID NO: 96, SEQ ID NO: 97, SEQ ID NO: 98, SEQ ID NO: 99, SEQ ID NO: 100, SEQ ID NO: 101, SEQ ID NO: 102, SEQ ID NO: 103, SEQ ID NO: 104, SEQ ID NO: 105, SEQ ID NO: 106, SEQ ID NO: 107, SEQ ID NO: 108, SEQ ID NO: 109, SEQ ID NO: 110, SEQ ID NO: 111, SEQ ID NO: 112, SEQ ID NO: 113, SEQ ID NO: 114, SEQ ID NO: 115, SEQ ID NO: 116, SEQ ID NO: 117, SEQ ID NO: 118, SEQ ID NO: 119, SEQ ID NO: 120, SEQ ID NO: 121, SEQ ID NO: 122, SEQ ID NO: 192 and SEQ ID NO: 193; (b) the nucleic acid molecule comprising the nucleotide sequence consisting of SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 153, SEQ ID NO: 154, SEQ ID NO: 155, SEQ ID NO: 156, SEQ ID NO: 157, SEQ ID NO: 158, SEQ ID NO: 159, SEQ ID NO: 160, SEQ ID NO: 161, SEQ ID NO: 162, SEQ ID NO: 163, SEQ ID NO: 164, SEQ ID NO: 165, SEQ ID NO: 166, SEQ ID NO: 167, SEQ ID NO: 168, SEQ ID NO: 169, SEQ ID NO: 170, SEQ ID NO: 171, SEQ ID NO: 172, and/or SEQ ID NO: 173; (c) the nucleic acid molecule hybridizing under stringent conditions to the complementary strand of the nucleotide sequence as defined in (a) or (b); (d) the nucleic acid molecule comprising the nucleotide sequence having at least 66.7% identity to the nucleotide sequence as defined in any one of (a), (b) and (c); and (e) the nucleic acid molecule being degenerate as a result of the genetic code to the nucleotide sequence as defined in (a) or (b).
15. The nucleic acid molecule of any one of claims 1 and 4 to 13, wherein said nucleic acid molecule is selected from the group consisting of: (a) the nucleic acid molecule comprising at least one nucleotide sequence selected from the group consisting of SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 123, SEQ ID NO: 124, SEQ ID NO: 125, SEQ ID NO: 126, SEQ ID NO: 127, SEQ ID NO: 128, SEQ ID NO: 129, SEQ ID NO: 130, SEQ ID NO: 131, SEQ ID NO: 132, SEQ ID NO: 133, SEQ ID NO: 134, SEQ ID NO: 135, SEQ ID NO: 136, SEQ ID NO: 137, SEQ ID NO: 138, SEQ ID NO: 139, SEQ ID NO: 140, SEQ ID NO: 141, SEQ ID NO: 142, SEQ ID NO: 143, SEQ ID NO: 144, SEQ ID NO: 145, SEQ ID NO: 146, SEQ ID NO: 147, SEQ ID NO: 148, SEQ ID NO: 149, SEQ ID NO: 150, SEQ ID NO: 151, SEQ ID NO: 152, SEQ ID NO: 194 and SEQ ID NO: 195; (b) the nucleic acid molecule comprising the nucleotide sequence selected from the group consisting of SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41, SEQ ID NO: 174, SEQ ID NO: 175, SEQ ID NO: 176, SEQ ID NO: 177, SEQ ID NO: 178, SEQ ID NO: 179, SEQ ID NO: 180, SEQ ID NO: 181, SEQ ID NO: 182, SEQ ID NO: 184, SEQ ID NO: 185, SEQ ID NO: 186, SEQ ID NO: 187, SEQ ID NO: 188, SEQ ID NO: 189, SEQ ID NO: 190, and SEQ ID NO: 191; (c) the nucleic acid molecule hybridizing under stringent conditions to the complementary strand of the nucleotide sequence as defined in (a) or (b); (d) the nucleic acid molecule comprising the nucleotide sequence having at least 56% identity to the nucleotide sequence as defined in any one of (a), (b) and (c); (e) the nucleic acid molecule being degenerate as a result of the genetic code to the nucleotide sequence as defined in (a) or (b).
16. The nucleic acid molecule of any one of claims 1 to 15 operably linked in the same reading frame to a nucleic acid encoding a biologically active protein.
17. The nucleic acid molecule of claim 16, wherein said biologically active protein is a therapeutically effective protein.
18. The nucleic acid molecule of claim 16 or 17, wherein said biologically active protein is selected from the group consisting of a binding protein, an antibody fragment, a cytokine, a growth factor, a hormone, an enzyme, a protein vaccine, a peptide vaccine, a peptide which consists of up to 50 amino acid residues or a peptidomimetic.
19. The nucleic acid molecule of claim 18, wherein said binding protein is selected from the group consisting of antibodies, Fab fragments, Fab fragments, F(ab).sub.2 fragments, single chain variable fragments (scFv), (single) domain antibodies, isolated variable regions of antibodies (VL and/or VH regions), CDRs, immunoglobulin domains, CDR-derived peptidomimetics, lectins, protein scaffolds, fibronectin domains, tenascin domains, protein A domains, SH3 domains, ankyrin repeat domains, and lipocalins.
20. The nucleic acid molecule of any one of claims 16 to 18, wherein said biologically active protein is selected from the group consisting of interleukin 1 receptor antagonist, leptin, acid sphingomyelinase, adenosine deaminase, agalsidase alfa, alpha-1 antitrypsin, alpha atrial natriuretic peptide, alpha-galactosidase, alpha-glucosidase, alpha-N-acetylglucosaminidase, alteplase, amediplase, amylin, amylin analog, anti-HIV peptide fusion inhibitor, arginine deiminase, asparaginase, B domain deleted factor VIII, bone morphogenetic protein, bradykinin antagonist, B-type natriuretic peptide, bouganin, growth hormone, chorionic gonadotropin, CD3 receptor antagonist, CD19 antagonist, CD20 antagonist, CD40 antagonist, CD40L antagonist, cerebroside sulfatase, coagulation factor VIIa, coagulation factor XIII, coagulation factor IX, coagulation factor X, complement component C3 inhibitor, complement component 5a antagonist, C-peptide, CTLA-4 antagonist, C-type natriuretic peptide, defensin, deoxyribonuclease I, EGFR receptor antagonist, epidermal growth factor, erythropoietin, exendin-4, ezrin peptide 1, Fc?IIB receptor antagonist, fibroblast growth factor 21, follicle-stimulating hormone, gastric inhibitory polypeptide (GIP), GIP analog, glucagon, glucagon receptor agonist, glucagon-like peptide 1 (GLP-1), GLP-1 analog, glucagon-like peptide 2 (GLP-2), GLP-2 analog, gonadorelin, gonadotropin-releasing hormone agonist, gonadotropin-releasing hormone antagonist, gp120, gp160, granulocyte colony stimulating factor (G-CSF), granulocyte macrophage colony stimulating factor (GM-C SF), grehlin, grehlin analog, growth hormone, growth hormone-releasing hormone, hematide, hepatocyte growth factor, hepatocyte growth factor receptor (HGFR) antagonist, hepcidin antagonist, hepcidin mimetic, Her2/neu receptor antagonist, histrelin, hirudin, hsp70 antagonist, humanin, hyaluronidase, hydrolytic lysosomal glucocerebroside-specific enzyme, iduronate-2-sulfatase, IgE antagonists, insulin, insulin analog, insulin-like growth factor 1, insulin-like growth factor 2, interferon-alpha, interferon-alpha antagonist, interferon-alpha superagonist, interferon-alpha-n3, interferon-beta, interferon-gamma, interferon-lambda, interferon tau, interleukin, interleukin 2 fusion protein, interleukin-22 receptor subunit alpha (IL-22ra) antagonist, irisin, islet neogenesis associated protein, keratinocyte growth factor, Kv1.3 ion channel antagonists, lanthipeptide, lipase, luteinizing hormone, lutropin alpha, lysostaphin, mannosidase, N-acetylgalactosamine-6-sulfatase, N-acetylglucosaminidase, neutrophil gelatinase-associated lipocalin, octreotide, ?-conotoxin, Ornithodoros moubata complement inhibitor, osteogenic protein-1, osteoprotegerin, oxalate decarboxylase, P128, parathyroid hormone, Phylomer, PD-1 antagonist, PDGF antagonist, phenylalanine ammonia lyase, platelet derived growth factor, proinsulin, protein C, relaxin, relaxin analog, secretin, RGD peptide, ribonuclease, senrebotase, serine protease inhibitor, soluble complement receptor type 1, soluble DCC receptor, soluble TACI receptor, soluble tumor necrosis factor I receptor (sTNF-RI), soluble tumor necrosis factor II receptor (sTNF-RII), soluble VEGF receptor Flt-1, soluble Fc?IIB receptor, somatostatin, somatostatin analog, streptokinase, T-cell receptor ligand, tenecteplase, teriparatide, thrombomodulin alpha, thymosin alpha 1, toll like receptor inhibitor, tumor necrosis factor (TNF?), tumor necrosis factor a antagonist, uricase, vasoactive intestinal peptide, vasopressin, vasopressin analog, VEGF antagonist, von Willebrand factor.
21. A vector comprising the nucleic acid molecule of any one of claims 1 to 20.
22. A host or host cell comprising the nucleic acid molecule of any one of claims 1 to 20, a host or host cell comprising the vector of claim 21, or a host or host cell transformed with a vector of claim 21.
23. A method for preparing the nucleic acid molecule of any one of claims 1 to 20, wherein the method comprises culturing the host or host cell of claim 22 and optionally isolating the produced nucleic acid molecule.
24. A method for preparing the vector of claim 21, wherein the method comprises culturing the host or host cell of claim 22 and optionally isolating the produced vector.
25. A method for preparing a polypeptide encoded by the nucleic acid molecule of any one of claims 1 to 20, wherein the method comprises culturing/raising the host or host cell of claim 22 and optionally isolating the produced polypeptide.
26. A method for preparing a drug conjugate, wherein said drug conjugate comprises the polypeptide encoded by the nucleic acid molecule of any one of claims 1 to 15 and further comprises (i) a biologically active protein and/or (ii) a small molecule and/or (iii) a carbohydrate, wherein the method further comprises culturing the host or host cell of claim 22 and optionally isolating the produced polypeptide and/or drug conjugate.
27. The method for preparing the drug conjugate of claim 26, wherein said biologically active protein is a therapeutically effective protein.
28. The method for preparing the drug conjugate of claim 26 or 27, wherein said biologically active protein is selected from the group consisting of a binding protein, an antibody fragment, a cytokine, a growth factor, a hormone, an enzyme, a protein vaccine, a peptide vaccine, a peptide which consists of up to 50 amino acid residues or a peptidomimetic.
29. The method for preparing the drug conjugate of claim 28, wherein said binding protein is selected from the group consisting of antibodies, Fab fragments, Fab fragments, F(ab).sub.2 fragments, single chain variable fragments (scFv), (single) domain antibodies, isolated variable regions of antibodies (VL and/or VH regions), CDRs, immunoglobulin domains, CDR-derived peptidomimetics, lectins, protein scaffolds, fibronectin domains, tenascin domains, protein A domains, SH3 domains, ankyrin repeat domains, and lipocalins.
30. The method for preparing the drug conjugate of any one of claims 26 to 28, wherein said biologically active protein is selected from the group consisting of interleukin 1 receptor antagonist, leptin, acid sphingomyelinase, adenosine deaminase, agalsidase alfa, alpha-1 antitrypsin, alpha atrial natriuretic peptide, alpha-galactosidase, alpha-glucosidase, alpha-N-acetylglucosaminidase, alteplase, amediplase, amylin, amylin analog, anti-HIV peptide fusion inhibitor, arginine deiminase, asparaginase, B domain deleted factor VIII, bone morphogenetic protein, bradykinin antagonist, B-type natriuretic peptide, bouganin, growth hormone, chorionic gonadotropin, CD3 receptor antagonist, CD19 antagonist, CD20 antagonist, CD40 antagonist, CD40L antagonist, cerebroside sulfatase, coagulation factor VIIa, coagulation factor XIII, coagulation factor IX, coagulation factor X, complement component C3 inhibitor, complement component 5a antagonist, C-peptide, CTLA-4 antagonist, C-type natriuretic peptide, defensin, deoxyribonuclease I, EGFR receptor antagonist, epidermal growth factor, erythropoietin, exendin-4, ezrin peptide 1, Fc?IIB receptor antagonist, fibroblast growth factor 21, follicle-stimulating hormone, gastric inhibitory polypeptide (GIP), GIP analog, glucagon, glucagon receptor agonist, glucagon-like peptide 1 (GLP-1), GLP-1 analog, glucagon-like peptide 2 (GLP-2), GLP-2 analog, gonadorelin, gonadotropin-releasing hormone agonist, gonadotropin-releasing hormone antagonist, gp120, gp160, granulocyte colony stimulating factor (G-CSF), granulocyte macrophage colony stimulating factor (GM-CSF), grehlin, grehlin analog, growth hormone, growth hormone-releasing hormone, hematide, hepatocyte growth factor, hepatocyte growth factor receptor (HGFR) antagonist, hepcidin antagonist, hepcidin mimetic, Her2/neu receptor antagonist, histrelin, hirudin, hsp70 antagonist, humanin, hyaluronidase, hydrolytic lysosomal glucocerebroside-specific enzyme, iduronate-2-sulfatase, IgE antagonists, insulin, insulin analog, insulin-like growth factor 1, insulin-like growth factor 2, interferon-alpha, interferon-alpha antagonist, interferon-alpha superagonist, interferon-alpha-n3, interferon-beta, interferon-gamma, interferon-lambda, interferon tau, interleukin, interleukin 2 fusion protein, interleukin-22 receptor subunit alpha (IL-22ra) antagonist, irisin, islet neogenesis associated protein, keratinocyte growth factor, Kv1.3 ion channel antagonists, lanthipeptide, lipase, luteinizing hormone, lutropin alpha, lysostaphin, mannosidase, N-acetylgalactosamine-6-sulfatase, N-acetylglucosaminidase, neutrophil gelatinase-associated lipocalin, octreotide, ?-conotoxin, Ornithodoros moubata complement inhibitor, osteogenic protein-1, osteoprotegerin, oxalate decarboxylase, P128, parathyroid hormone, Phylomer, PD-1 antagonist, PDGF antagonist, phenylalanine ammonia lyase, platelet derived growth factor, proinsulin, protein C, relaxin, relaxin analog, secretin, RGD peptide, ribonuclease, senrebotase, serine protease inhibitor, soluble complement receptor type 1, soluble DCC receptor, soluble TACI receptor, soluble tumor necrosis factor I receptor (sTNF-RI), soluble tumor necrosis factor II receptor (sTNF-RII), soluble VEGF receptor Flt-1, soluble FcyIIB receptor, somatostatin, somatostatin analog, streptokinase, T-cell receptor ligand, tenecteplase, teriparatide, thrombomodulin alpha, thymosin alpha 1, toll like receptor inhibitor, tumor necrosis factor (TNFa), tumor necrosis factor a antagonist, uricase, vasoactive intestinal peptide, vasopressin, vasopressin analog, VEGF antagonist, von Willebrand factor.
31. The method for preparing the drug conjugate of claims 26 to 30, wherein said small molecule is selected from the group consisting of angiogenesis inhibitors, anti-allergic drugs, anti-emetic drugs, anti-depressant drugs, anti-hypertensive drugs, anti-inflammatory drugs, anti-infective drugs, anti-psychotic drugs, anti-proliferative (cytotoxic and cytostatic) drugs, calcium antagonists and other circulatory organ drugs, cholinergic agonists, drugs acting on the central nervous system, drugs acting on the respiratory system, hormones, steroids, polyketides, carbohydrates, oligosaccharides, nucleic acids, nucleic acid derivatives, antisense nucleic acids, small interference RNAs (siRNAs), micro RNA (miR) inhibitors, microRNA mimetics, DNA aptamers and RNA aptamers.
32. Method for sequencing of the nucleic acid molecule of any one of claims 1 to 20.
33. Method for amplification of the nucleic acid molecule of any one of claims 1 to 20.
34. Method for cloning of the nucleic acid molecule of any one of claims 1 to 20.
35. A method for selecting a genetically stable nucleic acid molecule, wherein said nucleic acid molecule comprises a nucleotide sequence encoding a polypeptide consisting of proline, alanine and, optionally, serine, wherein said nucleotide sequence has a length of at least 300 nucleotides, the method comprising a step of selecting a nucleic acid molecule comprising a nucleotide sequence having a Nucleotide Repeat Score (NRS) lower than 50,000, wherein said Nucleotide Repeat Score (NRS) is determined according to the formula:
Description
EXAMPLE 1
Synthesis of Low Repetitive Nucleotide Sequence Units Encoding Proline/Alanine-Rich Amino Acid Repeat Sequences
[0402] A set of different nucleotide sequences, each encoding a proline/alanine-rich amino acid repeat sequence of 200 residues were optimized, including manual adjustment, with regard to low repetitivity on the nucleotide level, low GC content, low RNA secondary structure, preferred codon-usage for expression in E. coli and avoidance of antiviral motifs as well as CIS-acting elements. To this end, established algorithms such as the condition-specific codon optimization approach (Lanza (2014) BMC Syst Biol 8:33) or the GeneOptimizer algorithm (Raab (2010) Syst Synth Biol 4:215-225) were applied. The initial sequences obtained thereof were manually adjusted in the following manner.
[0403] Repeats longer than a given threshold (e.g., 14 nucleotides) were identified using the Visual Gene Developer software version 1.2, which is freely available at http://visualgenedeveloper.net. Subsequently, codons within identified repeats were stepwise substituted. In particular, GC-rich codons within the identified repeats were replaced by AT-rich codons prevalent in highly expressed genes in a host organism of choice (e.g., E. coli, P. pastoris or CHO). After each substitution, the entire nucleotide sequence was again analyzed for repeats. In case the substitution led to a new repeat longer than the given threshold, the nucleotide exchange(s) was rejected and a different codon within the previously identified repeat was substituted. If this approach failed, two codons within the identified long repeat were substituted in parallel. In this way, all identified repeats above a given threshold were iteratively eliminated while maintaining the encoded proline/alanine-rich amino acid sequence.
[0404] In a second step, the codon adaptation index (CAI), GC content and stable mRNA structures of the optimized nucleotide sequence was analyzed using the Visual Gene Developer software and compared to the start sequence. Additional manual adjustments, again by codon substitution/silent mutation, were performed until the optimized nucleotide sequence reached a CAI, GC content or mRNA structure equal or better than the start sequence. The repeat analysis from step 1 was carried out again, and, if necessary, other codons were exchanged in order to meet the objectives, which were repeat threshold, CAI, GC content and mRNA structures (secondary structures).
[0405] In a third step, different individually optimized nucleotide sequences, each encoding the same 200-residue proline/alanine-rich amino acid repeat sequence were combined, i.e. appended to each other, and the resulting longer nucleotide sequence was optimized in the same manner as in steps 1 and 2. Finally, the resulting long acid sequence was divided into shorter, e.g., DNA cassettes of 600 nucleotides lengths. For example, the 2400 nucleotide sequence PAS#1d/1f/1c/1b (SEQ ID NO: 39) was divided into four shorter cassettes (SEQ ID NO: 19, 20, 21, 23). Similarly, the 2400 nucleotide sequence PA#1e/1d/1c/1b (SEQ ID NO: 44) was divided into four shorter cassettes (SEQ ID NO: 28, 29, 30, 31), each comprising 600 nucleotides.
[0406] Flanked by two SapI recognition sites (5-GCTCTTC-3) in reverse complementary orientation, resulting in 5-GCC/5-GGC nucleotide overhangs after restriction enzyme digest, these optimized nucleotide sequence units were individually synthesized by different commercial vendors. Of note, due to the presence of the two GCC/GGC nucleotide overhangs, only the middle 597 nucleotides form a DNA double strand after excision and, hence, comprise base pairs (bp). Also, the optimized 600 nucleotide sequence is extended by an additional Ala codon due to the presence of the second SapI restriction site, thus leading to a cloned DNA cassette of overall 603 nucleotides encoding a proline/alanine-rich amino acid sequence. The presence of the two flanking SapI restriction sites enables precise excision and subcloning, e.g., on pXL2, of the entire DNA cassette of the invention.
[0407] Further sets of nucleotide sequence units encoding proline/alanine-rich amino acid repeat sequences, codon-optimized for expression in Escherichia coli, Pichia pastoris, human embryonic kidney (HEK) cells, Pseudomonas fluorescens, Corynebacterium glutamicum, Bacillus subtilis, Tetrahymena thermophila, Saccharomyces cerevisiae, Kluyveromyces lactis, Physcomitrella patens or Cricetulus griseus, were designed and synthesized in the same manner. Codon preference tables for these organisms are available for download at http://www.kazusa.or.jp/codon. The synthesized nucleic acid molecules according to the invention and their nucleotide sequence characteristics are summarized in Table 1.
EXAMPLE 2
Assembly of Low Repetitive Nucleotide Sequence Units to Longer Nucleotide Sequences Encoding Proline/Alanine-Rich Amino Acid Repeat Sequences
[0408] Plasmids obtained from commercial vendors, each carrying a cloned synthesized DNA fragment, were digested with SapI and the resulting 600 nucleotide DNA fragment was purified via agarose gel electrophoresis according to standard procedures (Sambrook (2001) loc. cit.)The individual nucleotide sequence units were assembled to longer nucleotide sequences using the plasmid pXL2 (SEQ ID NO: 48), a derivative of pUC19 (Yanisch-Perron (1985). Gene. 33, 103-119) shown in
[0409] As an example, first the nucleotide sequence unit PAS#1b(200) (SEQ ID NO: 19), then the sequence unit PAS#1c(200) (SEQ ID NO: 20), and subsequently the sequence unit PAS#1f(200) (SEQ ID NO: 23) were inserted into pXL2 via the SapI restriction site in the described manner, resulting in the plasmid pXL2-PAS#1f/1c/1b(600) (SEQ ID NO: 38). In a subsequent step, the sequence unit PAS#1d(200) (SEQ ID NO: 19) was additionally inserted in the same manner using the SapI restriction site. The resulting plasmid contained the assembled 2400 bp DNA cassette PAS#1d/1f/1c/1b(800) which in total revealed nucleotide sequence repeats with a maximum length of 14 nucleotides (SEQ ID NO: 39). As the recognition sequence of EarI (5-CTCTTC-3) downstream of the low repetitive DNA cassette cloned on pXL2 is also part of the recognition sequence of SapI, the entire assembled DNA cassette can be easily excised via restriction digest with Earl, thus cutting twice, allowing subsequent use for further subcloning.
[0410] In the same manner, the low repetitive nucleotide sequence PA#1e/1d/1c/1b(800) (SEQ ID NO: 44) was assembled from the nucleotide sequence units PA#1b(200) (SEQ ID NO: 28), PA#1c(200) (SEQ ID NO: 29), PA#1d(200) (SEQ ID NO: 30) and PA#1e(200) (SEQ ID NO: 31) in the stated order. The described assembled nucleotide sequences as well as further exemplary low repetitive nucleic acid molecules encoding proline/alanine-rich amino acid repeat sequences according to this invention, also with codon usage optimized for host organisms different from E. coli, are summarized in Table 1. The disclosed cloning strategy offers a simple, stepwise assembly of complex gene cassettes comprising long low repetitive nucleic acid molecules encoding proline/alanine-rich amino acid repeat sequences, which cannot be directly obtained by common gene synthesis methods.
EXAMPLE 3
Repetitivity Analysis of Nucleotide Sequences Encoding Proline/Alanine-Rich Amino Acid Repeat Sequences
[0411] A dot plot analysis was performed for different nucleotide sequences encoding the proline/alanine-rich amino acid repeat sequences PA#3 (SEQ ID NO: 15) (
EXAMPLE 4
DNA-Sequencing of Low Repetitive Nucleic Acid Molecules Encoding Long Proline/Alanine-Rich Amino Acid Repeat Sequences
[0412] The low repetitive PAS#1f/1c/1b(600) DNA cassette (SEQ ID NO: 38) cloned on the plasmid pXL2 and described in Example 2 was sequenced by a DNA-sequencing service provider (Eurofins Genomics, Ebersberg, Germany) using Sanger cycle sequencing on an ABI 3730XL instrument (Thermo Fisher Scientific, Waltham, Mass.). To this end, 8 ?l (150 ng/?l) of pXL2-PAS#1f/1c/1b(600) plasmid DNA, isolated from transformed E. coli XL1-blue cells using the QlAprep Spin Miniprep kit (Qiagen, Hilden, Germany) was mixed with 5?l doubly distilled H.sub.2O and 2 ?l primer XLP-1 (10 ?M) (SEQ ID NO: 3), which hybridizes within the coding region of the PAS#1b(200) nucleotide sequence unit and submitted to the DNA-sequencing service provider. As result, an error-free electropherogram comprising more than 900 assignable nucleotides (
EXAMPLE 5
Construction of pASK75-PAS#1f/1c/1b(600), a Genetically Stable Expression Vector for the Bacterial Production of a Therapeutic PAS#1(600)-IL1Ra Fusion Protein
[0413] For the construction of an expression plasmid encoding the interleukin-1 receptor antagonist (IL-1Ra) as fusion with a 600 residue PAS#1 amino acid repeat sequence (SEQ ID NO: 38), the vector pASK75-IL1Ra (
EXAMPLE 6
Long-Term Genetic Stability Testing of a Plasmid Harboring a Low Repetitive Nucleic Acid Molecule Encoding a Proline/Alanine-Rich Amino Acid Repeat Sequence
[0414] The genetic stability of the plasmid pASK75-PAS#1f/1c/1b(600)-IL1Ra (SEQ ID NO: 50) was compared to the genetic stability of pASK75-PAS#1a(600)-IL1Ra (SEQ ID NO: 51), a derivative wherein the PAS#1f/1c/1b(600) DNA cassette was substituted by the repetitive nucleic acid PAS#1a(600) (SEQ ID NO: 12). To this end, E. coli KS272 (Strauch (1988) Proc. Natl. Acad. Sci. USA 85:1576-1580) was transformed with the respective plasmid using the calcium chloride method (Sambrook (2001) loc. cit.) and cultured for 7 days at 37? C., 170 rpm, in 50 ml Luria Bertani (LB) medium supplemented with 100 mg/mL ampicillin in a 100 mL shake flask without induction of gene expression. During this period, bacterial cells were twice daily (in the morning and in the evening) transferred into fresh medium using a 1:1000 dilution. On day 7, after a continuous growth over approximately 70 generations, the culture was finally grown to stationary phase and cells were plated on LB/Amp agar. Then, individual clones were picked, used for inoculation of 50 mL cultures in LB medium and, after growth to stationary phase over night, plasmid DNA from five clones for each of the two plasmids was prepared using the Qiagen Miniprep Kit (Qiagen, Hilden, Germany) and analyzed by a XbaI/HindIII restriction digest (
[0415] Only 1 out of 5 analyzed clones of pASK75-PAS#1a(600)-IL1Ra showed the expected bands corresponding to 3093 bp and 2377 bp (
EXAMPLE 7
Seamless and Directed Cloning of a Low Repetitive Nucleotide Sequence Encoding Proline/Alanine-Rich Amino Acid Repeat Sequences on an Expression Plasmid Encoding the Biologically Active Protein IL-1Ra.
[0416] With the goal of pharmaceutical application, fusion proteins comprising solely the biologically active protein and a proline/alanine-rich amino acid repeat sequence are desired. The absence of additional amino acid linkers, e.g., introduced in order to provide or utilize restriction sites for cloning, may prevent potential immune responses during clinical use and/or avoid unintended interactions on the protein level. Therefore, a seamless cloning strategy was developed (
[0417] At first, a synthetic DNA fragment encoding the mature amino acid sequence of ILl-Ra (UniProt ID P18510) was obtained from a gene synthesis provider (Thermo Fisher Scientific, Regensburg, Germany). This gene fragment (SEQ ID NO: 46) comprised an XbaI restriction site, followed by a ribosomal binding site, the nucleotide sequence encoding the OmpA signal peptide, followed by a GCC alanine codon, a first SapI recognition sequence GCTCTTC on the non-coding strand, a GC dinucleotide spacer, and a second SapI restriction sequence in reverse complementary orientation, with its recognition sequence GCTCTTC on the coding strand, followed by a GCC alanine codon directly linked to the coding sequence for mature IL1Ra (UniProt ID P18510), which was finally followed by a HindIII restriction site.
[0418] This gene fragment was cloned on pASk75 via the flanking restriction sites XbaI and HindIII according to standard procedures (Sambrook (2001) loc. cit.). The resulting plasmid (cf.
EXAMPLE 8
Bacterial Production and Purification of a Fusion Protein Between the PAS#1(600) Sequence and IL-1Ra Encoded on the Genetically Stable Plasmid pASK75-PAS#1f/1c/1b(600)-IL1Ra
[0419] The PAS#1(600)-IL1-Ra fusion protein (calculated mass: 68 kDa) was produced at 25? C. in E. coli KS272 harboring the genetically stable expression plasmid pASK75-PAS#1f/1c/1b(600)-IL1Ra from Example 6 and the folding helper plasmid pTUM4 (Schlapschy (2006) Protein Eng. Des. Sel. 20:273-284) using an 8 L bench top fermenter with a synthetic glucose mineral medium supplemented with 100 mg/L ampicillin and 30 mg/L chloramphenicol according to a published procedure (Schiweck (1995) Proteins 23:561-565). Recombinant gene expression was induced by addition of 500 ?g/L anhydrotetracycline (Skerra (1994) loc. cit.) as soon as the culture reached OD.sub.550=28. After an induction period of 2.5 h, cells were harvested by centrifugation and resuspended during 10 min in ice-cold periplasmic fractionation buffer (500 mM sucrose, 1 mM EDTA, 100 mM Tris/HCl pH 8.0; 2 ml per L and OD.sub.550). After adding 15 mM EDTA and 250 ?g/mL lysozyme, the cell suspension was incubated for 20 min on ice, centrifuged several times, and the cleared supernatant containing the recombinant protein was recovered.
[0420] The periplasmic extract was dialyzed four times at 4? C. against 5 L 40 mM Na-phosphate pH 7.5, 500 mM NaCl, respectively and purified by means of the His.sub.6-tag using an 80 ml HisTrap HP column (GE Healthcare, Freiburg, Germany). The protein was eluted with an imidazole/HCl pH 7.5 concentration gradient from 0 to 200 mM in 40 mM Na-phosphate pH 7.5, 0.5 M NaCl. The purified protein was pooled and dialyzed twice against 5 L 20 mM Tris/HCl pH 8.0, 1 mM EDTA at 4? C. for at least 6 h, respectively. The dialyzed protein solution was subjected to anion exchange chromatography using a 60 ml XK column (GE Healthcare, Freiburg, Germany) packed with Sourcel5Q resin, connected to an Akta purifier system (GE Healthcare, Freiburg, Germany), using 20 mM Tris/HCl pH 8.0, 1 mM EDTA as running buffer. The protein was eluted using an NaCl concentration gradient from 0 to 200 mM in running buffer.
[0421] Eluted fractions were dialyzed twice against 10 mM MES/HCl pH 6.0, 1 mM EDTA at 4? C. for at least 6 h, respectively, and subsequently subjected to a cation exchange chromatography using an XK column packed with 36 ml Sourcel5S resin (GE Healthcare, Freiburg, Germany). The cation exchange chromatography was performed on an Akta purifier system using 10 mM MES/HCl pH 6.0, 1 mM EDTA as running buffer and a NaCl concentration gradient from 0 to 500 mM in running buffer over 4 column volumes to elute the protein. The eluted protein fractions containing PAS#1(600)-IL1-Ra were again pooled, dialyzed against 5 L phosphate-buffered saline (PBS: 115 mM NaCl, 4 mM KH.sub.2PO.sub.4 and 16 mM Na.sub.2HPO.sub.4 pH 7.4) at 4? C. overnight, concentrated to 5 mg/ml using an Amicon Ultra centrifugal filter device (30000 MWCO; 15 mL; Millipore, Billerica, Mass.) and further purified via size exclusion chromatography using a HiLoad 26/60 Superdex 200 prepgrade column (GE Healthcare, Freiburg, Germany) equilibrated with PBS.
[0422] A homogeneous protein preparation without signs of aggregation was obtained with a final yield of 70 mg from one 8 L fermenter. Protein concentration was determined by measuring the absorption at 280 nm using a calculated extinction coefficient (Gill (1989) Anal. Biochem. 182:319-326) of 15720 M.sup.-1 cm.sup.-1. SDS-PAGE was performed using a high molarity Tris buffer system (Fling (1986) Anal. Biochem. 155:83-88) (
EXAMPLE 9
ESI-MS Analysis of the PAS#1(600)-IL1Ra Fusion Protein
[0423] PAS#1(600)-IL1Ra produced and purified as described in Example 8 was dialyzed twice against a 1000-fold volume of 10 mM ammonium acetate pH 6.8 and analyzed via ESI mass spectrometry on a Q-Tof Ultima instrument (Waters, Eschbronn, Germany) using the positive ion mode. The deconvoluted spectrum of the PA#1(600)-IL1Ra fusion protein revealed a mass of 67994.8 Da, which essentially coincides with the calculated mass of 67994.8 Da (
EXAMPLE 10
Construction of pASK37-MP-PA#1d/1c/1b(600), a Genetically Stable Plasmid for the Production of a Proline/Alanine-Rich Amino Acid Repeat Polypeptide in E. coli
[0424] For the construction of a stable expression plasmid encoding the pure PA#1(600) polypeptide, 100 pmol of the primers NdeI-MP-SapI-HindIIIfw (SEQ ID NO: 4) and NdeI-MP-SapI-HindIIIrev (SEQ ID NO: 5) were phosphorylated, mixed, heated up to 80? C. for 10 min and slowly cooled down to room temperature overnight to allow hybridization. The resulting double stranded DNA fragment exhibited sticky ends compatible to NdeI and HindIII overhangs. The plasmid pASK37 (Skerra (1991) loc. cit) was cut with NdeI and HindIII and the backbone fragment was ligated with the hybridized primers.
[0425] The resulting plasmid was digested with SapI, which led to the liberation of a small (24 bp) insert containing two SapI recognition sites and a cleaved vector backbone with compatible sticky 5-GCC/5-GGC ends. These sticky ends are ideally suited for insertion of the low repetitive nucleotide sequence encoding the proline/alanine-rich amino acid repeat sequence at the position directly downstream of the N-terminal start methionine codon (ATG) followed by the proline codon CCA, which was found to allow efficient translational initiation. After isolation of the vector fragment using the QIAquick gel extraction kit and dephosphorylation with the thermosensitive alkaline phosphatase FastAP according to the manufacturer's instructions, it was ligated with the low repetitive gene cassette PA#1d/1c/1b(600) (SEQ ID NO: 42) excised from pXL2-PA#1d/1c/1b(600) via EarI restriction digest. The resulting plasmid (SEQ ID NO: 53) permits expression of a polypeptide comprising solely a proline/alanine-rich amino acid repeat sequence (
EXAMPLE 11
Bacterial Expression and Purification of a PA#1(600) Polypeptide Encoded on the Genetically Stable Plasmid pASK37-MP-PA#1d/1c/1b(600)
[0426] The PA#1(600) polypeptide, with an additional Pro residue at the N-terminus and an additional Ala residue at the C-terminus (calculated mass: 48302 Da), was produced in the cytoplasm of E. coli KS272 harboring the expression plasmid pASK37-PA#ld/1c/1b(600) described in Example 10. 4 ml LB medium in a sterile 13 mL polypropylene tube (Sarstedt, N?mbrecht, Germany), substituted with 1% w/v glucose and 100 mg/L ampicillin, were inoculated with a colony of E. coli KS272 transformed with pASK37-PA# 1d/1c/1b(600) and grown overnight at 37? C., 170 rpm. Bacterial protein production was performed at 30? C. in a 5 L shake flask with 2 L terrific broth (TB) medium (Sambrook (2001) loc. cit.) supplemented with 2.5 g/L D-glucose and 100 mg/L ampicillin.
[0427] E. coli cultures were inoculated with 2 ml overnight culture, cells were grown overnight and recombinant gene expression was induced at OD.sub.550=5 by addition of isopropyl-?-D-thiogalactopyranoside (IPTG) to a final concentration of 0.5 mM. Bacteria were harvested 3 h after induction, resuspended in 20 ml 40 mM Na-phosphate pH 7.2, 1 mM EDTA and lysed using a French pressure cell (Thermo Scientific, Waltham, MA). After centrifugation (17,000 rpm, 1 h, 4? C.) of the lysate, no inclusion bodies were observed. The supernatant containing the soluble PA#1(600) polypeptide was subjected to an ammonium sulfate precipitation by stepwise addition of solid (NH.sub.4).sub.2SO.sub.4 to a final concentration of 20% w/v under continuous stirring at room temperature. The supernatant was centrifuged at 17,000 rpm at room temperature for 20 min. The sediment containing the precipitated PA#1(600) polypeptide was dissolved in 20 mM Tris/HCl pH 8.0 and the solution was centrifuged (13,000 rpm, 10 min, room temperature) to remove insoluble contaminants.
[0428] Pure acetic acid (Sigma-Aldrich, Steinheim, Germany) was added to a final concentration of 1% v/v and impurities were sedimented by centrifugation at 13,000 rpm for 10 min. The supernatant containing the almost pure PA#1(600) polypeptide was dialyzed against a 100-fold volume of 1% v/v acetic acid overnight at 4? C. To remove residual impurities, the dialysed protein was subjected to a subtractive cation exchange chromatography using a 1 ml Sourcel5S column (GE Healthcare, Freiburg, Germany) connected to an ?kta purifier system using 1% v/v acetic acid as running buffer.
[0429] Samples from each purification step were analyzed by SDS-PAGE using a high molarity Tris buffer system (Fling (1986) loc. cit.). After SDS-PAGE, the gel was first stained with barium iodide as described for the analysis of PEG (Kurfurst (1992) Anal. Biochem. 200:244-248). Briefly, the polyacrylamide gel was rinsed with water and then incubated in a 2.5% w/v BaI.sub.2 (barium iodide dihydrate; Sigma-Aldrich, Steinheim, Germany) solution in water for 5 min. After rinsing with water, the gel was transferred into Lugol solution (10% w/v p.a. grade KI (AppliChem, Darmstadt, Germany 5% p.a. grade 1.sub.2 (Riedel de Haen AG, Seelze, Germany) in water) for 5 min. After destaining in 10% v/v acetic acid, orange PA#1(600) polypeptide bands became visible (
EXAMPLE 12
ESI-MS Analysis of a Pure PA#1(600) Polypeptide
[0430] 200 ?l of the isolated PA#1(600) polypeptide from Example 11 at a concentration of 5 mg/mL was applied to a 1 mL Resource RPC column (GE Healthcare, Freiburg, Germany) connected to an Akta purifier system using 2% v/v acetonitrile, 1% v/v formic acid as running buffer. The protein was eluted using an acetonitrile gradient from 2% v/v acetonitrile, 1% v/v formic acid to 80% v/v acetonitrile, 0.1% v/v formic acid over 20 column volumes. The eluted protein was directly analyzed via ESI mass spectrometry on a Q-Tof Ultima instrument using the positive ion mode. The deconvoluted spectrum of the PA#1(600) polypeptide revealed a mass of 48301.78 Da, which essentially coincides with the calculated mass of the PA#1(600) polypeptide, with an additional Pro residue at the N-terminus and an additional Ala residue at the C-terminus but devoid of the start methionine (48301.4 Da) (
EXAMPLE 13
Repeat Analysis of Nucleotide Sequences Encoding Proline/Alanine-Rich Amino Acid Sequences
[0431] As a measure to assess the quality of nucleic acid molecules encoding proline/alanine-rich sequences with regard to the frequency (occurrence) of nucleotide sequence repeats we have devised the Nucleotide Repeat Score (NRS), which is calculated according to the following formula:
[0432] In this formula, N.sub.tot is the total length of the nucleotide sequence analyzed, n is the length of a sequence repeat within the nucleotide sequence analyzed and the frequency f.sub.i(n) is the number of occurrences of this sequence repeat. In case there are several different sequence repeats with the same length n, these different sequence repeats are distinguished by the index i and the number of different sequence repeats with the same length n is k(n). If there is just one type of sequence repeat with length n, k(n) equals 1. The NRS is defined as the sum of the squared repeat length multiplied with the root of the respective overall frequency, divided through the total length of the analyzed nucleotide sequence. The minimal repeat length considered for the calculation of NRS comprises 4 nucleotides, which includes all nucleotide sequences longer than one codon triplet, and it ranges up to N.sub.tot?1, that is the length of the longest nucleotide sequence repeat that can occur more than once in the analyzed nucleotide sequence.
[0433] In this context the term repeat means that a nucleotide sequence occurs at least twice within the nucleotide sequence analyzed. When counting the frequencies we have considered both nucleotide stretches with identical sequence that occur at least twice as well as different sequences of the same length which each also occur at least twice. For example, if the overall frequency of a 14mer repeat is five, this can mean either that the same 14mer nucleotide stretch occurs 5 times, or one 14mer nucleotide sequence occurs twice and a different 14 nucleotide sequence occurs three times in the analyzed nucleotide sequence.
[0434] Furthermore, each shorter repeat contained within a longer nucleotide sequence repeat is counted separately. For example, if the analyzed nucleotide sequence contains two GCACC nucleotide stretches (i.e., repeats), GCAC and CACC repeats are also counted individually, regardless if they occur within said GCACC nucleotide stretch or, possibly, in addition elsewhere within the analyzed nucleotide sequence. Of note, only repeats on the coding strand of the nucleic acid molecule are considered.
[0435] A person skilled in the art can identify nucleotide sequence repeats either manually or with the aid of generic software programs such as the Visual Gene Developer (Jung (2011) loc. cit.), available for download at http://www.visualgenedeveloper.net, or the Repfind tool (Betley (2002) loc. cit), available at http://zlab.bu.edu/repfind. However, not every algorithm detects each kind of repeat, e.g., the result of the Visual Gene Developer does not include overlapping repeats. Thus, results of software tools have to be checked and, if necessary, manually corrected. Alternatively, the algorithm termed NRS-Calculator described in Example 14 can be used to unambiguously identify nucleotide sequence repeats and to calculate the NRS automatically.
[0436] Natural as well as certain synthetic nucleic acids encoding proline/alanine-rich amino acid sequences are known in the art. However, all those sequences are highly repetitive on the genetic level as it becomes clearly evident from the NRS analysis described below and, thus, their use for biotechnological and/or biopharmaceutical applications is limited.
[0437] Several prior art nucleotide sequences encoding proline/alanine-rich amino acid sequences were compared to low repetitive nucleic acid molecules encoding proline/alanine-rich amino acid repeat sequences according to this invention using the NRS-Calculator described in Example 14: the nucleotide sequence PAS#1a(200) (SEQ ID NO: 11) disclosed in WO 2008/155134 (
[0438] The calculated repeat frequencies were plotted against the respective repeat length using Kaleidagraph V3.6 software (Synergy Software, Reading, PA) (
[0439] The difference in repetitivity between the prior art nucleotide sequences and the low repetitive nucleotide sequences of the invention becomes even more evident when comparing their Nucleotide Repeat Scores. Whereas all prior art sequences reveal an NRS above 80000 (Table 2), the 600 nucleotide sequence PAS#1b(200) and the 2400 nucleotide sequence PA#1e/1d/1c/1b(800) show NRS values of just 13 and 14, respectively (Table 1). This clearly demonstrates that the repeat quality of the low repetitive nucleotide sequences encoding proline/alanine-rich amino acid repeat sequences according to this invention is much higher compared to prior art sequences, with both fewer and shorter nucleotide sequence repeats.
TABLE-US-00004 TABLE1 Characteristicsofnucleicacidmoleculesaccordingtothisinvention Lowrepetitive Codon- nucleotide SEQ optimized Encodedamino sequenceno. ID: for: acidrepeat n.sub.max N.sub.tot NRS A:Nucleotidesequenceunits(buildingblocks) 1 PAS#1b(200) 19 E.coli ASPAAPAPASPA 14 600 13 APAPSAPA (SEQIDNO:1) 2 PAS#1c(200) 20 E.coli ASPAAPAPASPA 12 600 12 APAPSAPA (SEQIDNO:1) 3 PAS#1d(200) 21 E.coli ASPAAPAPASPA 12 600 11 APAPSAPA (SEQIDNO:1) 4 PAS#1e(200) 22 CHO ASPAAPAPASPA 12 600 12 (C.griseus) APAPSAPA (SEQIDNO:1) 5 PAS#1f(200) 23 E.coli ASPAAPAPASPA 12 600 11 APAPSAPA (SEQIDNO:1) 6 PAS#1g(200) 24 Pichia ASPAAPAPASPA 14 600 24 pastoris APAPSAPA (SEQIDNO:1) 7 PAS#1h(200) 25 CHO ASPAAPAPASPA 12 600 20 (C.griseus) APAPSAPA (SEQIDNO:1) 8 PAS#1i(200) 26 CHO ASPAAPAPASPA 14 600 17 (C.griseus) APAPSAPA (SEQIDNO:1) 9 PAS#1j(200) 27 CHO ASPAAPAPASPA 14 600 16 (C.griseus) APAPSAPA (SEQIDNO:1) 10 PA#1b(200) 28 E.coli AAPAAPAPAAP 14 600 21 AAPAPAAPA (SEQIDNO:2) 11 PA#1c(200) 29 E.coli AAPAAPAPAAP 14 600 18 AAPAPAAPA (SEQIDNO:2) 13 PA#1d(200) 30 E.coli AAPAAPAPAAP 14 600 19 AAPAPAAPA (SEQIDNO:2) 14 PA#1e(200) 31 E.coli AAPAAPAPAAP 14 600 22 AAPAPAAPA (SEQIDNO:2) 15 PA#1f(200) 32 CHO AAPAAPAPAAP 14 600 24 (C.griseus) AAPAPAAPA (SEQIDNO:2) 16 PA#1g(200) 33 CHO AAPAAPAPAAP 14 600 24 (C.griseus) AAPAPAAPA (SEQIDNO:2) 17 PA#1h(200) 34 CHO AAPAAPAPAAP 17 600 32 (C.griseus) AAPAPAAPA (SEQIDNO:2) 18 PA#1i(200) 35 CHO AAPAAPAPAAP 17 600 17 (C.griseus) AAPAPAAPA (SEQIDNO:2) 19 PA#3b(200) 36 E.coli AAAPAAAPAAA 14 600 26 PAAAPAAAP (SEQIDNO:57) 20 PA#5b(198) 37 E.coli AAAAAPAAAAA 14 594 27 PAAAAAP (SEQIDNO:58) 101 PA#1j(200) 87 P.pastoris AAPAAPAPAAP 17 600 39 AAPAPAAPA (SEQIDNO:2) 102 PA#1k(200) 88 P.pastoris AAPAAPAPAAP 17 600 29 AAPAPAAPA (SEQIDNO:2) 103 PA#1l(200) 89 P.pastoris AAPAAPAPAAP 17 600 31 AAPAPAAPA (SEQIDNO:2) 104 PA#1m(200) 90 P.pastoris AAPAAPAPAAP 14 600 24 AAPAPAAPA (SEQIDNO:2) 105 PA#1n(200) 91 S.cerevisiae AAPAAPAPAAP 17 600 38 AAPAPAAPA (SEQIDNO:2) 106 PA#1o(200) 92 S.cerevisiae AAPAAPAPAAP 14 600 20 AAPAPAAPA (SEQIDNO:2) 107 PA#1p(200) 93 S.cerevisiae AAPAAPAPAAP 14 600 19 AAPAPAAPA (SEQIDNO:2) 108 PA#1q(200) 94 K.lactis AAPAAPAPAAP 17 600 28 AAPAPAAPA (SEQIDNO:2) 109 PA#1r(200) 95 K.lactis AAPAAPAPAAP 14 600 23 AAPAPAAPA (SEQIDNO:2) 110 PA#1s(200) 96 K.lactis AAPAAPAPAAP 17 600 34 AAPAPAAPA (SEQIDNO:2) 111 PA#1t(200) 97 H.sapiens AAPAAPAPAAP 14 600 25 (HEK AAPAPAAPA cells) (SEQIDNO:2) 112 PA#1u(200) 98 H.sapiens AAPAAPAPAAP 17 600 29 (HEK AAPAPAAPA cells) (SEQIDNO:2) 114 PA#1v(200) 99 H.sapiens AAPAAPAPAAP 17 600 31 (HEK AAPAPAAPA cells) (SEQIDNO:2) 114 PA#1w(200) 100 Bacillus AAPAAPAPAAP 14 600 23 subtilis AAPAPAAPA (SEQIDNO:2) 115 PA#1x(200) 101 Bacillus AAPAAPAPAAP 16 600 27 subtilis AAPAPAAPA (SEQIDNO:2) 116 PA#1y(200) 102 Bacillus AAPAAPAPAAP 17 600 32 subtilis AAPAPAAPA (SEQIDNO:2) 117 PA#1z(200) 103 E.coli AAPAAPAPAAP 18 600 45 AAPAPAAPA (SEQIDNO:2) 118 PA#1aa(200) 104 E.coli AAPAAPAPAAP 14 600 18 AAPAPAAPA (SEQIDNO:2) 119 PA#1ab(200) 105 E.coli AAPAAPAPAAP 17 600 25 AAPAPAAPA (SEQIDNO:2) 120 PA#1ac(200) 106 E.coli AAPAAPAPAAP 14 600 18 AAPAPAAPA (SEQIDNO:2) 121 PA#1ad(200) 107 E.coli AAPAAPAPAAP 17 600 24 AAPAPAAPA (SEQIDNO:2) 122 PA#1ae(100) 108 E.coli AAPAAPAPAAP 14 300 27 AAPAPAAPA (SEQIDNO:2) 123 PA#1af(200) 109 C.glutamicum AAPAAPAPAAP 14 600 20 AAPAPAAPA (SEQIDNO:2) 124 PA#1ag(200) 110 C.glutamicum AAPAAPAPAAP 17 600 24 AAPAPAAPA (SEQIDNO:2) 125 PA#1ah(200) 111 C.glutamicum AAPAAPAPAAP 17 600 25 AAPAPAAPA (SEQIDNO:2) 126 PA#1ai(200) 112 C.glutamicum AAPAAPAPAAP 16 600 21 AAPAPAAPA (SEQIDNO:2) 127 PA#1aj(200) 113 P.patens AAPAAPAPAAP 17 600 30 AAPAPAAPA (SEQIDNO:2) 128 PA#1ak(200) 114 P.patens AAPAAPAPAAP 17 600 31 AAPAPAAPA (SEQIDNO:2) 129 PA#1al(200) 115 P.patens AAPAAPAPAAP 15 600 24 AAPAPAAPA (SEQIDNO:2) 130 PA#1am(200) 116 P. AAPAAPAPAAP 17 600 32 fluorescens AAPAPAAPA (SEQIDNO:2) 131 PA#1an(200) 117 P. AAPAAPAPAAP 17 600 35 fluorescens AAPAPAAPA (SEQIDNO:2) 132 PA#1ao(200) 118 P. AAPAAPAPAAP 18 600 41 fluorescens AAPAPAAPA (SEQIDNO:2) 133 PA#1ap(200) 119 T. AAPAAPAPAAP 17 600 37 thermophila AAPAPAAPA (SEQIDNO:2) 134 PA#1aq(200) 120 T. AAPAAPAPAAP 17 600 34 thermophila AAPAPAAPA (SEQIDNO:2) 135 PA#1ar(200) 121 T. AAPAAPAPAAP 14 600 22 thermophila AAPAPAAPA (SEQIDNO:2) 136 PA#1as(200) 122 T. AAPAAPAPAAP 17 600 35 thermophila AAPAPAAPA (SEQIDNO:2) 137 PAS#1k(200) 123 E.coli ASPAAPAPASPA 14 600 14 APAPSAPA (SEQIDNO:1) 138 PAS#1l(200) 124 E.coli ASPAAPAPASPA 15 600 17 APAPSAPA (SEQIDNO:1) 139 PAS#1m(200) 125 E.coli ASPAAPAPASPA 14 600 16 APAPSAPA (SEQIDNO:1) 140 PAS#1n(100) 126 E.coli ASPAAPAPASPA 14 300 15 APAPSAPA (SEQIDNO:1) 141 PAS#1o(200) 127 P.pastoris ASPAAPAPASPA 14 600 17 APAPSAPA (SEQIDNO:1) 142 PAS#1p(200) 128 P.pastoris ASPAAPAPASPA 17 600 29 APAPSAPA (SEQIDNO:1) 143 PAS#1q(200) 129 P. ASPAAPAPASPA 17 600 25 Fluorescens APAPSAPA (SEQIDNO:1) 144 PAS#1r(200) 130 P. ASPAAPAPASPA 14 600 14 Fluorescens APAPSAPA (SEQIDNO:1) 145 PAS#1s(200) 131 P. ASPAAPAPASPA 17 600 24 Fluorescens APAPSAPA (SEQIDNO:1) 146 PAS#1t(200) 132 C. ASPAAPAPASPA 14 600 15 glutamicum APAPSAPA (SEQIDNO:1) 147 PAS#1u(200) 133 C. ASPAAPAPASPA 14 600 12 glutamicum APAPSAPA (SEQIDNO:1) 148 PAS#1v(200) 134 C. ASPAAPAPASPA 14 600 11 glutamicum APAPSAPA (SEQIDNO:1) 149 PAS#1w(200) 135 P.patens ASPAAPAPASPA 14 600 15 APAPSAPA (SEQIDNO:1) 150 PAS#1x(200) 136 P.patens ASPAAPAPASPA 12 600 12 APAPSAPA (SEQIDNO:1) 151 PAS#1y(200) 137 P.patens ASPAAPAPASPA 11 600 10 APAPSAPA (SEQIDNO:1) 152 PAS#1z(200) 138 K.lactis ASPAAPAPASPA 14 600 15 APAPSAPA (SEQIDNO:1) 153 PAS#1aa(200) 139 K.lactis ASPAAPAPASPA 15 600 17 APAPSAPA (SEQIDNO:1) 154 PAS#1ab(200) 140 K.lactis ASPAAPAPASPA 14 600 16 APAPSAPA (SEQIDNO:1) 155 PAS#1ac(200) 141 S. ASPAAPAPASPA 14 600 14 cerevisiae APAPSAPA (SEQIDNO:1) 156 PAS#1ad(200) 142 S. ASPAAPAPASPA 14 600 14 cerevisiae APAPSAPA (SEQIDNO:1) 157 PAS#1ae(200) 143 S. ASPAAPAPASPA 14 600 14 cerevisiae APAPSAPA (SEQIDNO:1) 158 PAS#1af(200) 144 T. ASPAAPAPASPA 17 600 25 thermophila APAPSAPA (SEQIDNO:1) 159 PAS#1ag(200) 145 T. ASPAAPAPASPA 17 600 25 thermophila APAPSAPA (SEQIDNO:1) 160 PAS#1ah(200) 146 T. ASPAAPAPASPA 15 600 20 thermophila APAPSAPA (SEQIDNO:1) 161 PAS#1ai(200) 147 H.sapiens ASPAAPAPASPA 14 600 13 (HEK APAPSAPA cells) (SEQIDNO:1) 162 PAS#1aj(200) 148 H.sapiens ASPAAPAPASPA 12 600 10 (HEK APAPSAPA cells) (SEQIDNO:1) 163 PAS#1ak(200) 149 H.sapiens ASPAAPAPASPA 14 600 11 (HEK APAPSAPA cells) (SEQIDNO:1) 164 PAS#1al(200) 150 B.subtilis ASPAAPAPASPA 12 600 11 APAPSAPA (SEQIDNO:1) 165 PAS#1am(200) 151 B.subtilis ASPAAPAPASPA 14 600 13 APAPSAPA (SEQIDNO:1) 166 PAS#1an(200) 152 B.subtilis ASPAAPAPASPA 14 600 14 APAPSAPA (SEQIDNO:1) 167 PA#1at(200) 192 E.coli AAPAAPAPAAP 31 600 190 AAPAPAAPA (SEQIDNO:2) 168 PA#1au(200) 193 E.coli AAPAAPAPAAP 26 600 105 AAPAPAAPA (SEQIDNO:2) 169 PAS#1ao(200) 194 E.coli ASPAAPAPASPA 32 600 211 APAPSAPA (SEQIDNO:1) 170 PAS#1ap(200) 195 E.coli ASPAAPAPASPA 26 600 105 APAPSAPA (SEQIDNO:1) B:Assembledlow-repetitivenucleotidesequences 21 PAS#1f/1c/1b(600) 38 E.coli ASPAAPAPASP 14 1800 9 AAPAPSAPA (SEQIDNO:1) 22 PAS#1d/1f/1c/1b(800) 39 E.coli ASPAAPAPASP 14 2400 8 AAPAPSAPA (SEQIDNO:1) 23 PAS#1h/1e/1i(600) 40 CHO ASPAAPAPASP 14 1800 14 (C.griseus) AAPAPSAPA (SEQIDNO:1) 24 PAS#1j/1h/1e/1i(800) 41 CHO ASPAAPAPASP 14 2400 13 (C.griseus) AAPAPSAPA (SEQIDNO:1) 25 PA#1d/1c/1b(600) 42 E.coli AAPAAPAPAAP 14 1800 15 AAPAPAAPA (SEQIDNO:2) 26 PA#1i/1h/1g/1f(800) 43 CHO AAPAAPAPAAP 17 2400 22 (C.griseus) AAPAPAAPA (SEQIDNO:2) 27 PA#1e/1d/1c/1b(800) 44 E.coli AAPAAPAPAAP 14 2400 14 AAPAPAAPA (SEQIDNO:2) 28 PA#1i/1h/1g/1f/ 45 E.coli/ AAPAAPAPAAP 27 4800 24 1e/1d/1c/1b(1600) CHO AAPAPAAPA (C.griseus) (SEQIDNO:2) 171 PA#1ae/1c(300) 153 E.coli AAPAAPAPAAP 14 900 18 AAPAPAAPA (SEQIDNO:2) 172 PA#1ae/1d(300) 154 E.coli AAPAAPAPAAP 14 900 17 AAPAPAAPA (SEQIDNO:2) 173 PA#1d/1c(400) 155 E.coli AAPAAPAPAAP 14 1200 17 AAPAPAAPA (SEQIDNO:2) 174 PA#1b/1c/1d(600) 156 E.coli AAPAAPAPAAP 14 1800 15 AAPAPAAPA (SEQIDNO:2) 175 PA#1d/1b/1c(600) 157 E.coli AAPAAPAPAAP 20 1800 17 AAPAPAAPA (SEQIDNO:2) 176 PA#1c/1b/1d(600) 158 E.coli AAPAAPAPAAP 17 1800 16 AAPAPAAPA (SEQIDNO:2) 177 PA#1c/1d/1b(600) 159 E.coli AAPAAPAPAAP 20 1800 17 AAPAPAAPA (SEQIDNO:2) 178 PA#1b/1d/1c(600 160 E.coli AAPAAPAPAAP 17 1800 16 AAPAPAAPA (SEQIDNO:2) 179 PA#1aa/1e/1d/1c/ 161 E.coli AAPAAPAPAAP 20 3000 17 1b(1000) AAPAPAAPA (SEQIDNO:2) 180 PA#1ab/1aa/1e/ 162 E.coli AAPAAPAPAAP 20 3600 17 1d/1c/1b(1200) AAPAPAAPA (SEQIDNO:2) 181 PA#1ac/1ab/1aa/1e/ 163 E.coli AAPAAPAPAAP 20 4200 16 1d/1c/1b(1400) AAPAPAAPA (SEQIDNO:2) 182 PA#1ad/1ac/1ab/1aa/ 164 E.coli AAPAAPAPAAP 20 4800 16 1e/1d/1c/1b(1600) AAPAPAAPA (SEQIDNO:2) 183 PA#1ao/1an/1am(600) 165 P. AAPAAPAPAAP 19 1800 27 fluorescens AAPAPAAPA (SEQIDNO:2) 184 PA#1ai/1ah/1ag/ 166 C. AAPAAPAPAAP 17 2400 17 1af(800) glutamicum AAPAPAAPA (SEQIDNO:2) 185 PA#1y/1x/1w(600) 167 B.subtilis AAPAAPAPAAP 17 1800 24 AAPAPAAPA (SEQIDNO:2) 186 PA#1j/1k/1l/1m(800) 168 P.pastoris AAPAAPAPAAP 17 2400 23 AAPAPAAPA (SEQIDNO:2) 187 PA#1p/1o/1n(600) 169 S. AAPAAPAPAAP 18 1800 21 cerevisiae AAPAPAAPA (SEQIDNO:2) 188 PA#1s/1r/1q(600) 170 K.lactis AAPAAPAPAAP 17 1800 23 AAPAPAAPA (SEQIDNO:2) 189 PA#1as/1ar/1aq/ 171 T. AAPAAPAPAAP 20 2400 30 1ap(800) thermophila AAPAPAAPA (SEQIDNO:2) 190 PA#1v/1u/1t(600) 172 H.sapiens AAPAAPAPAAP 19 1800 28 (HEKcells) AAPAPAAPA (SEQIDNO:2) 191 PA#1al/1ak/1j(600) 173 P.patens AAPAAPAPAAP 18 1800 24 AAPAPAAPA (SEQIDNO:2) 192 PAS#1n/1b(300) 174 E.coli ASPAAPAPASP 14 900 12 AAPAPSAPA (SEQIDNO:1) 193 PAS#1n/1c(300) 175 E.coli ASPAAPAPASP 14 900 13 AAPAPSAPA (SEQIDNO:1) 194 PAS#1b/1f/1c(600) 176 E.coli ASPAAPAPASP 14 1800 9 AAPAPSAPA (SEQIDNO:1) 195 PAS#1b/1c/1f(600) 177 E.coli ASPAAPAPASP 14 1800 9 AAPAPSAPA (SEQIDNO:1) 196 PAS#1c/1b/1f(600) 178 E.coli ASPAAPAPASP 14 1800 9 AAPAPSAPA (SEQIDNO:1) 197 PAS#1f/1b/1c(600) 179 E.coli ASPAAPAPASP 14 1800 9 AAPAPSAPA (SEQIDNO:1) 198 PAS#1c/1f/1b(600) 180 E.coli ASPAAPAPASP 14 1800 9 AAPAPSAPA (SEQIDNO:1) 199 PAS#1k/1d/1f/1c/ 181 E.coli ASPAAPAPASP 20 3000 11 1b(1000) AAPAPSAPA (SEQIDNO:1) 200 PAS#1l/1k/1d/1f/ 182 E.coli ASPAAPAPASP 20 3600 12 1c/1b(1200) AAPAPSAPA (SEQIDNO:1) 201 PAS#1s/1q/1r(600) 183 P. ASPAAPAPASP 20 1800 21 fluorescens AAPAPSAPA (SEQIDNO:1) 202 PAS#1v/1t/1u(600) 184 C. ASPAAPAPASP 17 1800 13 glutamicum AAPAPSAPA (SEQIDNO:1) 203 PAS#1an/am/1l(600) 185 B.subtilis ASPAAPAPASP 14 1800 11 AAPAPSAPA (SEQIDNO:1) 204 PAS#1p/1o/1g(600) 186 P.pastoris ASPAAPAPASP 17 1800 20 AAPAPSAPA (SEQIDNO:1) 205 PAS#1ae/1ad/1ac(600) 187 S. ASPAAPAPASP 15 1800 12 cerevisiae AAPAPSAPA (SEQIDNO:1) 206 PAS#1ab/1aa/1z(600) 188 K.lactis ASPAAPAPASP 17 1800 15 AAPAPSAPA (SEQIDNO:1) 207 PAS#1ah/1ag/1af(600) 189 T. ASPAAPAPASP 17 1800 19 thermophila AAPAPSAPA (SEQIDNO:1) 208 PAS#1ak/aj/ah(600) 190 H.sapiens ASPAAPAPASP 14 1800 10 (HEKcells) AAPAPSAPA (SEQIDNO:1) 209 PAS#1y/1x/1w(600) 191 P.patens ASPAAPAPASP 17 1800 14 AAPAPSAPA (SEQIDNO:1)
TABLE-US-00005 TABLE 2 Characteristics of prior art nucleotide sequences GenBank entry/ Sequence name Organism SEQ ID: patent no. n.sub.max N.sub.tot NRS 1 PAS#1a(200) synthetic 11 WO 2008155134 540 600 1 127 680 2 PA#1a(200) synthetic 14 WO2011144756 540 600 1 127 680 3 PA#3a(200) synthetic 15 WO2011144756 540 600 1 127 680 4 [(AP).sub.5].sub.20APA synthetic 16 US2006/0252120 579 609 1 315 159 5 [AAPAPAPAP].sub.10AS synthetic 17 DQ399411.1 243 276 150 961 module of pBI-SS- (Tom)(AP)51-EGFP 6 Large tegument Macacine 18 NP_851896.1 197 225 81 858 protein Herpes virus 1
EXAMPLE 14
NRS-Calculator, an Algorithm to Unambiguously Identify Nucleotide Sequence Repeats and to Calculate the Nucleotide Repeat Score
[0440] Generally available software programs such as the Visual Gene Developer (Jung (2011) loc. cit) or the Repfind tool (Betley (2002) loc. cit) do not always work reliably and may require manual corrections in order to calculate all sequence repeats within an analyzed nucleotide sequence properly. In addition, repeats have to be counted manually and the NRS must be calculated separately according to the formula described in Example 13. To provide an algorithm that yields unambiguous results and to facilitate the calculation of the NRS, a simple Python script termed NRS-Calculator is described here. This script, executed on the runtime environment Python 2.7.10 (http://www.python.org), is based on a dot matrix sequence comparison and identifies all forward repeats within a potentially long nucleotide sequence, including overlapping repeats, without considering gaps. The dot matrix sequence comparison is a method well known by a person skilled in the art and is described in common bioinformatics text books such as, e.g., Mount (2004) Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press, 2.sup.nd edition, N.Y.
[0441] NRS-Calculator counts the frequencies for each repeat length and automatically calculates the NRS according to the formula described in Example 13. To execute the NRS-Calculator script the runtime environment Python version 2.7.10 was downloaded from https://www.python.org/downloads and installed on a ThinkPad L530 notebook (Lenovo, Stuttgart, Germany) running a Windows 7 operating system. The NRS-Calculator script listed below was saved as plain text file designated NRScalculator.py using Microsoft Windows Editor Version 6.1. The nucleotide sequence to be analyzed was saved as FASTA file named sequence.fas within the same folder. Subsequently, the command line shell was opened and the directory containing both the NRScalculator.py and the sequence.fas file was selected. To start the calculation, the following command line was executed: c:\user\admin\NRSfolder>c:\Python27\python.exe NRScalculator.py sequence.fas
[0442] This command resulted in a screen output of two rows: the left row indicating the repeat length (Length) and the right (second) row indicating the respective repeat frequency (Frequency). In addition, N.sub.tot and NRS (number rounded as integer) were stated at the beginning and the end of the output, respectively.
[0443] NRS-Calculator Script:
TABLE-US-00006 import math import sys class NRSCalculator: def __init__(self): self.repeats = dict( ) self.sums = dict( ) self.seq = None self.range_min = None self.range_max = None def _match_at(self, row, column): return self.seq[row] == self.seq[column] def _get_repeats_at(self, row, column): length = 1 search_row = row search_column = column while True: if not 0 <= search_row < len(self.seq): break if not 0 <= search_column < search_row: break if length > self.range_max: break if not self._match_at(search_row, search_column): break if length >= self.range_min: repeats = self.repeats.setdefault(self.seq[row:row + length], set( )) repeats.add(row) repeats.add(column) search_row += 1 search_column += 1 length += 1 def _get_repeats(self): self.repeats = dict( ) for row in xrange(len(self.seq)): for column in xrange(row): self._get_repeats_at(row, column) def _get_sums(self): self.sums = dict( ) for (seq, repeats) in self.repeats.iteritems( ): length = len(seq) self.sums[length] = self.sums.get(length, 0) + len(repeats) def set_range(self, range_min, range_max): self.range_min = range_min self.range_max = range_max def set_sequence(self, seq): self.seq = seq def work(self): if not self.seq and not self.range_min and not self.range_max: raise RuntimeError(Can not work without initialization) self._get_repeats( ) self._get_sums( ) def print_repeats(self): print(Sequence (Length bp) : NumRepeats (Positions)) for seq, repeats in sorted(self.repeats.iteritems( ), key=lambda t: len(t[0])): list = [seq, len(seq), len(repeats)] list.extend(map(lambda value: value + 1, sorted(repeats))) print(%s Ntot = %u : %u (%s) % (seq, len(seq), len(repeats), , .join(map(lambda value: str(value + 1), sorted(repeats))))) def print_sums(self): print(Length\tFrequency) for item in self.sums.iteritems( ): print(%u\t%u % item) def print_score(self): sum = 0 for length, count in self.sums.iteritems( ): sum += (length ** 2) * math.sqrt(count) print(NRS = %.0f % (sum / len(self.seq))) def handle_sequence(finder, name, sequence): finder.set_range(4 , len(sequence)) finder.set_sequence(sequence) finder.work( ) print(%s: Ntot = %u % (name, len(sequence))) #finder.print_repeats( ) finder.print_sums( ) finder.print_score( ) if len(sys.argv) != 2: print(Usage: %s FILENAME % sys.argv[0]) sys.exit(1) finder = NRSCalculator( ) with open(sys.argv[1], r) as infile: name = Unnamed seq = for line in infile: line = line.strip( ) if line.startswith(>): if len(seq) > 0: handle_sequence(finder, name, seq) name = line seq = continue seq += line.upper( ) handle_sequence(finder, name, seq)
[0444] Exemplary Output from NRS-Calculator:
TABLE-US-00007 >PAS#1b(200): Ntot = 600 Length Frequency 4 587 5 547 6 478 7 388 8 281 9 158 10 90 11 45 12 6 13 4 14 2 NRS = 13
EXAMPLE 15
Construction of pASK75-PA#1d/1c/1b(600)-IL1Ra, a Genetically Stable Expression Vector for the Bacterial Production of a Therapeutic PA#1(600)-IL1Ra Fusion Protein
[0445] For the construction of an expression plasmid encoding the interleukin-1 receptor antagonist (IL-1Ra) as fusion with a 600 residue PA#1 amino acid repeat sequence, the vector pASK75-IL1Ra (
EXAMPLE 16
Long-Term Genetic Stability Testing of the Plasmid pASK75-PA#1b/1c/1b(600)-IL1Ra Harboring the Low Repetitive Nucleic Acid Molecule PA#1d/1c/1b(600) Encoding a Proline/Alanine-Rich Amino Acid Repeat Sequence
[0446] The genetic stability of the plasmid pASK75-PA#1d/1c/1b(600)-IL1Ra (SEQ ID NO: 77) was compared to the genetic stability of pASK75-PA#1a(600)-IL1Ra (SEQ ID NO: 78), a derivative wherein the PA#1d/1c/1b(600) DNA cassette was replaced by the repetitive nucleic acid PA#1a(600) (SEQ ID NO: 80). To this end, E. coli JM83 (Yanisch-Perron C. (1985) loc. cit.) was transformed with the respective plasmid using the calcium chloride method (Sambrook (2001) loc. cit.) and cultured for 7 days at 37? C., 170 rpm, in 50 ml Luria Bertani (LB) medium supplemented with 100 mg/L ampicillin in a 100 mL shake flask without induction of gene expression. During this period, bacterial cells were twice daily (in the morning and in the evening) transferred into fresh medium using a 1:1000 dilution. On day 7, after a continuous growth over approximately 70 generations, the culture was finally grown to stationary phase and cells were plated on LB/Amp agar. Then, ten individual colonies for each of the two plasmids were picked, each used for inoculation of a 50 mL culture in LB/Amp medium and, after growth to stationary phase over night, plasmid DNA was prepared using the Qiagen Miniprep Kit (Qiagen, Hilden, Germany) and analyzed via XbaI/HindIII restriction digest (
[0447] Only 6 out of 10 analyzed clones of pASK75-PA#1a(600)-IL1Ra showed the expected bands corresponding to 3093 bp and 2377 bp (
EXAMPLE 17
Construction of Genetically Stable Expression Vectors for the Bacterial Production of Human Leptin Fused with Proline/Alanine-Rich Amino Acid Repeat Sequences
[0448] For the construction of an expression plasmid encoding human Leptin (huLeptin) N-terminally fused with a 600 residue PA#1 amino acid repeat sequence (SEQ ID NO: 82), the vector pASK37-MP-huLeptin (
EXAMPLE 18
Bacterial Production, Purification and Characterization of a Fusion Protein Between a Proline/Alanine-Rich Amino Acid Repeat Sequence and a Human Leptin Mutant Encoded on the Genetically Stable Plasmid pASK37-PA#1d/1c/1b(600)hu-Leptin(W100Q)
[0449] PA#1(600)-huLeptin(W100Q) a fusion protein between a human Leptin mutant with a tryptophan to glutamine substitution at position 100 of the mature amino acid sequence (UniProtKB accession code P41159) and the proline/alanine-rich amino acid repeat sequence PA#1(600) (SEQ ID NO: 85) (calculated mass: 64.25 kDa) was produced at 30? C. in the cytoplasm of Origami B (Novagene/Merck Millipore, Billerica, Mass.), an E. coli strain which has an oxidizing cytoplasm due to trxB, gor and ahpC mutations (Bessette (1999) Proc. Natl. Acad. Sci. USA 96:13703-13708). To this end, 4 ml LB medium in a sterile 13 mL polypropylene tube (Sarstedt, N?mbrecht, Germany), supplemented with 1% w/v D-glucose and 100 mg/L ampicillin, was inoculated with a colony of E. coli Origami B transformed with the genetically stable expression plasmid pASK37-MP-PA#1d/1c/1b(600)-huLep(W100Q) (SEQ ID NO: 86). Bacterial cells were grown overnight at 30? C. in a shaker at 170 rpm.
[0450] Bacterial protein production was performed at 30? C. in a 5 L baffle flask with 2 L terrific broth (TB) medium (Sambrook (2001) loc. cit.) supplemented with 2.5 g/L D-glucose and 100 mg/L ampicillin, which was inoculated with 2 ml of the E. coli overnight culture. Bacterial cells were grown at 30? C. and recombinant gene expression was induced at OD.sub.550=0.85 by addition of isopropyl-?-D-thiogalactopyranoside (IPTG) to a final concentration of 0.5 mM. Bacteria were harvested 19 h after induction, resuspended in 3 ml PBS/E (PBS supplemented with 10 mM EDTA) per 1 g bacterial cell wet weight and lysed using a Panda cell homogenizer (GEA, Parma, Italy). After centrifugation (20,000 rpm, 30 min, 4? C.) of the lysate, no inclusion bodies were observed. 1 mM 2,2-dithiodipyridine was added to the supernatant to boost disulfide bridge formation in the recombinant Leptin. The supernatant containing the soluble Leptin fusion protein was dialyzed over night at 4? C. against a 100-fold volume of PBS. Subsequently, the fusion protein was precipitated at room temperature by dropwise addition of 4 M (NH.sub.4).sub.2SO.sub.4 (dissolved in water) under continuous stirring until a final concentration of 1 M (NH.sub.4).sub.2SO.sub.4 was reached. After centrifugation for 20 min at 17,000 rpm at room temperature the sediment containing the precipitated PA#1(600)-hu-Leptin(W100/Q) fusion protein was dissolved in PBS and the solution was centrifuged (13,000 rpm, 10 min, room temperature) to remove insoluble contaminants.
[0451] The PA#1(600)-hu-Leptin(W100Q) fusion protein was dialyzed twice against 5 L 20 mM Tris/HCl pH 8.5 at 4? C., each for at least 6 h. Then, the protein solution was subjected to anion exchange chromatography using a 6 ml ResourceQ column (GE Healthcare, Freiburg, Germany) connected to an ?kta purifier system (GE Healthcare, Freiburg, Germany), using 20 mM Tris/HCl pH 8.5 as running buffer. The fusion protein was subsequently eluted using a NaCl concentration gradient. Eluted fractions were collected and further purified via size exclusion chromatography using a Superdex 200 HR10/300 column (GE Healthcare, Freiburg, Germany) equilibrated with PBS.
[0452] By this procedure a homogeneous protein preparation without signs of aggregation was obtained with a final yield of 0.8 mg/L bacterial culture. Protein concentration was determined by measuring the absorption at 280 nm using a calculated extinction coefficient (Gill (1989) loc. cit.) of 8605 M.sup.?1 cm.sup.?1. SDS-PAGE was performed using a 10% high molarity Tris buffer system (Fling (1986) loc. cit.) (