STRATEGIES FOR HIGH THROUGHPUT IDENTIFICATION AND DETECTION OF POLYMORPHISMS

20210202035 · 2021-07-01

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention relates to a method for identifying one or more polymorphisms in nucleic acid samples, comprising: (a) performing a reproducible complexity reduction on a plurality of nucleic acid samples to provide a plurality of libraries of the nucleic acid samples comprising amplified fragments, wherein the reproducible complexity reduction comprises amplifying fragments of the nucleic acid samples using one or more primers to obtain the amplified fragments, and wherein the amplified fragments in each library comprise a unique identifier sequence to indicate origin of each library obtained by the reproducible complexity reduction; (b) combining the plurality of libraries to obtain a combined library and sequencing at least a portion of the combined library to obtain sequences; (c) aligning the sequences to obtain an alignment; and (d) identifying one or more polymorphisms in the plurality of nucleic acid samples.

Claims

1. A kit for use in a method for detecting one or more polymorphisms in a plurality of nucleic acid samples, comprising: (a) an adaptor comprising an overhang that can be ligated to a protruding end of a restriction fragment produced by digestion with one or more endonucleases; and (b) a set of primers for PCR amplification, wherein at least one of the adaptor and the primers comprises an identifier sequence.

2. The kit of claim 1, wherein the identifier sequence identifies the origin of a sample.

3. The kit of claim 1, wherein at least one primer of the set of primers comprises at least one selective oligonucleotide at its 3′-end for hybridizing to a subset of restriction fragments.

4. The kit according of claim 1, wherein the adaptor comprises the identifier sequence.

5. The kit according of claim 1, wherein at least one of the primers comprises the identifier sequence.

6. The kit according of claim 1, wherein the adaptor comprises a PCR primer-binding sequence for hybridizing to a PCR primer.

7. The kit of claim 1, wherein the adaptor comprises a sequencing primer-binding sequence for hybridizing to a sequencing primer.

8. The kit of claim 1, wherein the adaptor comprises sequences complementary to sequences attached to a solid support for annealing the adaptor to the solid support.

9. The kit of claim 1, wherein the adaptor comprises sequences complementary to sequences attached to a bead for annealing the adaptor to the bead.

10. The kit of claim 1, wherein at least one of the primers is phosphorylated.

11. The kit of claim 1, wherein the kit further comprises a biotinylated capture oligonucleotide hybridization probe for capturing a subset of nucleic acid restriction fragments.

12. The kit of claim 1, wherein the kit further comprises a polymerase for a polymerase chain reaction (PCR).

13. The kit of claim 12, wherein the polymerase substantially lacks 3′-5′ exonuclease activity.

14. The kit of claim 1, wherein the adaptor comprises a non-ligatable 5′-end.

15. The kit of claim 1, wherein the adaptor comprises two at least partly complementary synthetic oligonucleotides.

16. The kit of claim 1, wherein the length of the adapter is between 10 and 30 base pairs.

17. The kit of claim 1, wherein the adaptor comprises a PCR primer-binding sequence for hybridizing to a primer of the set of primers for PCR amplification.

18. The kit of claim 1, further comprising one or more endonucleases for producing a plurality of nucleic acid fragments having protruding ends.

19. The kit of claim 18, comprising the endonuclease EcoRI and/or MseI.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0128] FIG. 1A shows the primer sequences for preamplification of PSP-11 and PI20234. FIG. 1A also shows a fragment according to the present invention annealed onto a bead (‘454 bead’) and the sequence of primer used for pre-amplification of the two pepper lines. ‘DNA fragment’ denotes the fragment obtained after digestion with a restriction endonuclease, ‘keygene adaptor’ denotes an adaptor providing an annealing site for the (phosphorylated) oligonucleotide primers used to generate a library, ‘KRS’ denotes an identifier sequence (tag), ‘454 SEQ. Adaptor’ denotes a sequencing adaptor, and ‘454 PCR adaptor’ denotes an adaptor to allow for emulsion amplification of the DNA fragment. The PCR adaptor allows for annealing to the bead and for amplification and may contain a 3′-T overhang.

[0129] FIG. 1B shows a schematic primer used in the complexity reduction step. Such a primer generally comprises a recognition site region indicated as (2), a constant region that may include a tag section indicated as (1) and one or more selective nucleotides in a selective region indicated as (3) at the 3′-end thereof).

[0130] FIGS. 2A and 2B show DNA concentration estimation using 2% agarose gel-electrophoresis. S1 denotes PSP11; S2 denote PI201234. 50, 100, 250 and 500 ng denotes respectively 50 ng, 100 ng, 250 ng and 500 ng to estimate DNA amounts of S1 and S2. FIGS. 2C and 2D show DNA concentration determination using Nanodrop spectrophotometry.

[0131] FIGS. 3A and 3B show the results of intermediate quality assessments of example 3. FIG. 3C shows DNA concentrations of each sample noted using Nanodrop.

[0132] FIG. 4A shows flow charts of the sequence data processing pipeline, i.e. the steps taken from the generation of the sequencing data to the identification of putative SNPs, SSRs and indels, via steps of the removal of known sequence information in Trimming & Tagging resulting in trimmed sequence data which are clustered and assembled to yield contigs and singletons (fragments that cannot be assembled in a contig) after which putative polymorphisms can be identified and assessed. FIG. 4B further elaborates on the process of polymorphisms mining.

[0133] FIGS. 5A, 5B and 5C address the problem of mixed tags and provides in panel 1 an example of a mixed tag, carrying tags associated with sample 1 (MS1) and sample 2 (MS2). Panel 2 provides a schematic explanation of the phenomenon. AFLP Restriction fragments derived from sample 1 (S1) and from sample 2 (S2) are ligated with adaptors (“Keygene adaptor”) on both sides carrying sample specific tags S1 and S2. After amplification and sequencing, e Expected fragments are those with S1-S1 tags and S2-S2 tags. What additionally and unexpectedly is observed are also fragments that carry tags S1-S2 or S2-S1. Panel 3 explains the hypothesized cause of the generation of mixed tags whereby heteroduplex products are formed from fragments from samples 1 and 2. The heteroduplexes are subsequently, due to the 3′-5′ exonuclease activity of T4 DNA polymerase or Klenow, rendered free from the 3′-protruding ends. During polymerization, the gaps are filled with nucleotides and the incorrect tag is introduced. This works for heteroduplexes of about the same length (top panel) but also for heteroduplexes of more varying length. Panel 4 provides on the right the conventional protocol leading to the formation of mixed tags and on the right the modified protocol.

[0134] FIGS. 6A, 6B and 6C address the problem of concatamer formation, whereby in panel 1 a typical example of a concatamer is given, whereby the various adapter and tag sections are underlined and with their origin (i.e. MS1, MS2, ES1 and ES2 corresponding respectively to a MseI restriction site-adapter from sample 1, MseI restriction site-adapter from sample 2, EcoRI restriction site-adapter from sample 1, EcoRI restriction site-adapter from sample 2). Panel 2 demonstrates the expected fragments carrying S1-S1 tags and S2-S2 tags and the observed but unexpected S1-S1-S2-S2, being a concatamer of a fragments from sample 1 and from sample 2. Panel 3 solution to avoid the generation of concatamers as well as mixed tags by introducing an overhang in the AFLP adaptors, modified sequencing adaptors and omission of the end-polishing step when ligating sequencing adaptors. No concatamer formation is found because the ALP fragments can not ligate to each other and no mixed fragments occur as the end-polishing step is omitted. Panel 4 provides the modified protocol using modified adaptors to avoid concatamer formation as well as mixed tags.

[0135] FIG. 7. Multiple alignment “10037 CL989contig2” of pepper AFLP fragment sequences, containing a putative single nucleotide polymorphism (SNP). Note that the SNP (indicated by an the black arrow) is defined by an A allele present in both reads of sample 1 (PSP11), denoted by the presence of the MS1 tag in the name of the top two reads, and a G allele present in sample 2 (PI201234), denoted by the presence of the MS2 tag in the name of the bottom two reads. Read names are shown on the left. The consensus sequence of this multiple alignment is (5′-3′):

TABLE-US-00001 (SEQ ID NO: 47) TAACACGACTTTGAACAAACCCAAACTCCCCCAATCGATTTCAAACCTA GAACA[A/G]TGTTGGTTTTGGTGCTAACTTCAACCCCACTACTGTTTT GCTCTATTTTTG.
FIG. 7 discloses full-length sequences as SEQ ID NOS: 64-68, respectively, in order of appearance.

[0136] FIG. 8A. Schematic representation of enrichment strategy for targeting simple sequence repeats (SSRs) in combination with high throughput sequencing for de novo SSR discovery.

[0137] FIG. 8B: Validation of a G/A SNP in pepper using SNPWAVE (multiplexed SNP genotyping) detection. P1=PSP11; P2=PI201234. Eight ML offspring are indicated by numbers 1-8.

EXAMPLES

Example 1

[0138] EcoRI/MseI restriction ligation mixture (1) was generated from genomic DNA of the pepper lines PSP-11 and PI20234. The restriction ligation mixture was 10 times diluted and 5 microliter of each sample was pre-amplified (2) with EcoRI+1(A) and MseI+1(C) primers (set I). After amplification the quality of the pre-amplification product of the two pepper samples was checked on a 1% agarose gel. The preamplification products were 20 times diluted, followed by a KRSEcoRI+1(A) and KRSMseI+2(CA) AFLP pre-amplification. The KRS (identifier)sections are underlined and the selective nucleotides are in bold at the 3′-end in the primersequence SEQ ID NOS:1-4 below. After amplification the quality of the pre-amplification product of the two pepper samples was checked on a 1% agarose gel and by an EcoRI+3(A) and MseI+3(C) (3) AFLP fingerprint (4). The pre-amplification products of the two pepper lines were separately purified on a QiagenPCR column (5). The concentration of the samples was measured on the nanodrop. A total of 5006.4 ng PSP-11 and 5006.4 ng PI20234 was mixed and sequenced.

TABLE-US-00002 Primer set I used for preamplification of PSP-11 E01LKRS1 [8 SEQ ID NO: 1] 5′-CGTCAGACTGCGTACCAATTCA-3′ M15KKRS1 [8 SEQ ID NO: 2] 5′-TGGTGATGAGTCCTGAGTAACA-3′ Primer set II used for preamplification of PI20234 E01LKRS2 [8 SEQ ID NO: 3] 5′-CAAGAGACTGCGTACCAATTCA-3′ M15KKRS2 [8 SEQ ID NO: 4] 5′-AGCCGATGAGTCCTGAGTAACA-3′

[0139] (1) EcoRI/MseI Restriction Ligation Mixture

TABLE-US-00003 Restriction mix (40 ul/sample) DNA 6 μl (±300 ng) ECoRI (5 U) 0.1 μl MseI(2 U) 0.05 μl 5×RL 8 μl MQ 25.85 μl Totaal 40 μl Incubation during 1 h. at 37° C.

[0140] Addition of:

TABLE-US-00004 Ligation mix (10 μl/sample) 10 mM ATP 1 μl T4 DNA ligase 1 μl ECoRI adapt. (5 pmol/μl) 1 μl MseI adapt. . (50 pmol/μl) 1 μl 5×RL 2 μl MQ 4 μl Totaal 10 μl Incubation during 3 h. at 37° C.

TABLE-US-00005 EcoRI-adaptor 91M35/91M36: [SEQ ID NO: 5] *-CTCGTAGACTGCGTACC: 91M35 [SEQ ID NO: 6] ±bio CATCTGACGCATGGTTAA: 91M36 MseI-adaptor 92A18/92A19: [SEQ ID NO: 7] 5-GACGATGAGTCCTGAG-3: 92A18 [SEQ ID NO: 8] 3-TACTCAGGACTCAT-5: 92A19

[0141] (2) Pre-Amplification

TABLE-US-00006 Preamplification (A/C): RL-mix (10×) 5 μl EcoRI-pr E01L(50 ng/ul) 0.6 μl MseI-pr M02K(50 ng/ul) 0.6 μl dNTPs (25 mM) 0.16 μl Taq.pol.(5 U) 0.08 μl 10×PCR 2.0 μl MQ 11.56 μl Total 20 μl/reaction

[0142] Pre-Amplification Thermal Profile

[0143] Selective pre amplification was done in a reaction volume of 50 μl. The PCR was performed in a PE GeneAmp PCR System 9700 and a 20 cycle profile was started with a 94° C. denaturation step for 30 seconds, followed by an annealing step of 56° C. for 60 seconds and an extension step of 72° C. for 60 seconds.

TABLE-US-00007 EcoRI + 1(A).sup.1 E01 L 92R11: [SEQ ID NO: 9] 5-AGACTGCGTACCAATTCA-3 MseI + 1(C).sup.1 M02k 93E42: [SEQ ID NO: 10] 5-GATGAGTCCTGAGTAAC-3

[0144] Preamplification A/CA:

[0145] PA+1/+1-mix (20×):5 μl

[0146] EcoRI-pr:1.5 μl

[0147] MseI-pr.:1.5 μl

[0148] dNTPs (25 mM):0.4 μl

[0149] Taq.pol.(5U):0.2 μl

[0150] 10×PCR:5 μl

[0151] MQ:36.3 μl

[0152] Total:50 μl

[0153] Selective pre amplification was done in a reaction volume of 50 μl. The PCR was performed in a PE GeneAmp PCR System 9700 and a 30 cycle profile was started with a 94° C. denaturation step for 30 seconds, followed by an annealing step of 56° C. for 60 seconds and an extension step of 72° C. for 60 seconds.

[0154] (3) KRSFcoRI +1(A)and KRSMsel +2(CA).sup.2

TABLE-US-00008 05F212 E01LKRS1 [SEQ ID NO: 11] CGTCAGACTGCGTACCAATTCA -3′ 05F213 E01LKRS2 [SEQ ID NO: 12] CAAGAGACTGCGTACCAATTCA -3′ 05F214 M15KKRS1 [SEQ ID NO: 13] TGGTGATGAGTCCTGAGTAACA -3′ 05F215 M15KKRS2 [SEQ ID NO: 14] AGCCGATGAGTCCTGAGTAACA -3′

[0155] selective nucleotides in bold and tags (KRS) underlined

[0156] Sample PSP11: E01LKRS1/M15KKRS1

[0157] Sample PI120234: E01LKRS2/M15KKRS2

[0158] (4) AFLP Protocol

[0159] Selective amplification was done in a reaction volume of 20 μl. The PCR was performed in a PE GeneAmp PCR System 9700. A 13 cycle profile was started with a 94° C. denaturation step for 30 seconds, followed by an annealing step of 65° C. for 30 seconds, with a touchdown phase in witch the annealing temperature was lowered 0.7° C. in each cycle, and an extension step of 72° C. for 60 seconds. This profile was followed by a 23 cycle profile with a 94° C. denaturation step for 30 seconds, followed by an annealing step of 56° C. for 30 seconds and an extension step of 72° C. for 60 seconds.

TABLE-US-00009 EcoRI + 3(AAC) and MseI + 3(CAG) E32 92S02: [SEQ ID NO: 15] 5-GACTGCGTACCAATTCAAC-3 M49 92G23: [SEQ ID NO: 16] 5-GATGAGTCCTGAGTAACAG-3

[0160] (5) Qiagen Column

[0161] Qiagen purification was performed according to the manufacturer's instruction: QIAquick® Spin Handbook.

Example 2: PEPPER

[0162] DNA from the Pepper lines PSP-11 and PI20234 was used to generate AFLP product by use of AFLP Keygene Recognition Site specific primers. (These AFLP primers are essentially the same as conventional AFLP primers, e.g. described in EP 0 534 858, and will generally contain a recognition site region, a constant region and one or more selective nucleotides in a selective region.

From the pepper lines PSP-11 or PI20234 150 ng of DNA was digested with the restriction endonucleases EcoRI (5U/reaction) and MseI (2U/reaction) for 1 hour at 37° C. following by inactivation for 10 minutes at 80° C. The obtained restriction fragments were ligated with double-stranded synthetic oligonucleotide adapter, one end of witch is compatible with one or both of the ends of the EcoRI and/or MseI restriction fragments. AFLP preamplification reactions (20 μl/reaction) with +1/+1 AFLP primers were performed on 10 times diluted restriction-ligation mixture. PCR profile:20*(30 s at 94° C. +60 s at 56° C. +120 s at 72° C.). Additional AFLP reactions (50 μl/reaction) with different +1 EcoRI and +2 MseI AFLP Keygene Recognition Site specific primers (Table below, tags are in bold, selective nucleotides are underlined.) were performed on 20 times diluted +1/+1 EcoRI/MseI AFLP preamplification product. PCR profile: 30*(30 s at 94° C. +60 s at 56° C. +120 s at 72° C.). The AFLP product was purified by using the QIAquick PCR Purification Kit (QIAGEN) following the QIAquick® Spin Handbook 07/2002 page 18 and the concentration was measured with a NanoDrop® ND-1000 Spectrophotometer. A total of 5 μg of +1/+2 PSP-11 AFLP product and 5 μg of +1/+2 PI20234 AFLP product was put together and solved in 23.3 μl TE. Finally a mixture with a concentration of 430 ng/μl+1/+2 AFLP product was obtained.

TABLE-US-00010 TABLE AFLP SEQ ID PCR primer Primer - 3′ Pepper reaction [SEQ ID 05F21 CGTCAGACTGCGTACCAA PSP- 1 NO: 17] TTCA [SEQ ID 05F21 TGGTGATGAGTCCTGAGT PSP- 1 NO: 18] AACA [SEQ ID 05F21 CAAGAGACTGCGTACCAA PI2023 2 NO: 19] TTCA [SEQ ID 05F21 AGCCGATGAGTCCTGAGT PI2023 2 NO: 20] AACA

Example 3: Maize

[0163] DNA from the Maize lines B73 and M017 was used to generate AFLP product by use of AFLP Keygene Recognition Site specific primers. (These AFLP primers are essentially the same as conventional AFLP primers, e.g. described in EP 0 534 858, and will generally contain a recognition site region, a constant region and one or more selective nucleotides at the 3′-end thereof.).

[0164] DNA from the pepper lines B73 or M017 was digested with the restriction endonucleases Tag′ (5U/reaction) for 1 hour at 65° C. and MseI (2U/reaction) for 1 hour at 37° C. following by inactivation for 10 minutes at 80° C. The obtained restriction fragments were ligated with double-stranded synthetic oligonucleotide adapter, one end of witch is compatible with one or both of the ends of the TaqI and/or MseI restriction fragments.

[0165] AFLP preamplification reactions (20 μl/reaction) with +1/+1 AFLP primers were performed on times diluted restriction-ligation mixture. PCR profile:20*(30 s at 94° C. +60 s at 56° C. +120 s at 72° C.). Additional AFLP reactions (50 μl/reaction) with different +2 TaaI and MseI AFLP Keygene Recognition Site primers (Table below, tags are in bold, selective nucleotides are underlined.) were performed on 20 times diluted +1/+1 TaqI/MseI AFLP preamplification product. PCR profile: 30*(30 s at 94° C. +60 s at 56° C. +120 s at 72° C.). The AFLP product was purified by using the QIAquick PCR Purification Kit (QIAGEN) following the QIAquick® Spin Handbook 07/2002 page 18 and the concentration was measured with a NanoDrop® ND-1000 Spectrophotometer. A total of 1.25 μg of each different B73+2/+2 AFLP product and 1.25 μg of each different M017+2/+2 AFLP product was put together and solved in 30 μl TE. Finally a mixture with a concentration of 333 ng/μl+2/+2 AFLP product was obtained.

TABLE-US-00011 TABLE PCR AFLP SEQ ID Primer Primer sequence Maize Reaction [SEQ ID 05G360 ACGTGTAGACTGCGTACCG B73 1 NO: 21] AAA [SEQ ID 05G368 ACGTGATGAGTCCTGAGTA B73 1 NO: 22] ACA [SEQ ID 05G362 CGTAGTAGACTGCGTACCG B73 2 NO: 23] AAC [SEQ ID 05G370 CGTAGATGAGTCCTGAGTA B73 2 NO: 24] ACA [SEQ ID 05G364 GTACGTAGACTGCGTACCG B73 3 NO: 25] AAG [SEQ ID 05G372 GTACGATGAGTCCTGAGTA B73 3 NO: 26] ACA [SEQ ID 05G366 TACGGTAGACTGCGTACCG B73 4 NO: 27] AAT [SEQ ID 05G374 TACGGATGAGTCCTGAGTA B73 4 NO: 28] ACA [SEQ ID 05G361 AGTCGTAGACTGCGTACCG M017 5 NO: 29] AAA [SEQ ID 05G369 AGTCGATGAGTCCTGAGTA M017 5 NO: 30] ACA [SEQ ID 05G363 CATGGTAGACTGCGTACCG M017 6 NO: 31] AAC [SEQ ID 05G371 CATGGATGAGTCCTGAGTA M017 6 NO: 32] ACA [SEQ ID 05G365 GAGCGTAGACTGCGTACCG M017 7 NO: 33] AAG [SEQ ID 05G373 GAGCGATGAGTCCTGAGTA M017 7 NO: 34] ACA [SEQ ID 05G367 TGATGTAGACTGCGTACCG M017 8 NO: 35] AAT [SEQ ID 05G375 TGATGATGAGTCCTGAGTA M017 8 NO: 36] ACA

[0166] Finally the 4 P1-samples and the 4 P2-samples were pooled and concentrated. A total amount of 25 μl of DNA product and a final concentration of 400 ng/ul (total of 10 μg) was obtained. Intermediate quality assessments are given in FIGS. 3A-3C.

Sequencing by 454

[0167] Pepper and maize AFLP fragment samples as prepared as described hereinbefore were processed by 454 Life Sciences as described (Margulies et al., 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437 (7057):376-80. Epub Jul. 31, 2005).

Data Processing

Processing Pipeline:

Input Data

[0168] raw sequence data were received for each run: [0169] 200,000-400,000 reads [0170] base calling quality scores

Trimming and Tagging

[0171] These sequence data are analyzed for the presence of Keygene Recognition Sites (KRS) at the beginning and end of the read. These KRS sequences consist of both AFLP-adaptor and sample label sequence and are specific for a certain AFLP primer combination on a certain sample. The KRS sequences are identified by BLAST and trimmed and the restriction sites are restored. Reads are marked with a tag for identification of the KRS origin. Trimmed sequences are selected on length (minimum of 33 nt) to participate in further processing.

Clustering and Assembly

[0172] A MegaBlast analysis is performed on all size-selected, trimmed reads to obtain clusters of homologous sequences. Consecutively all clusters are assembled with CAP3 to result in assembled contigs. From both steps unique sequence reads are identified that do not match any other reads. These reads are marked as singletons.

[0173] The processing pipeline carrying out the steps described herein before is shown in FIG. 4A

Polymorphism Mining and Quality Assessment

[0174] The resulting contigs from the assembly analysis form the basis of polymorphism detection. Each ‘mismatch’ in the alignment of each cluster is a potential polymorphism. Selection criteria are defined to obtain a quality score: [0175] number of reads per contig [0176] frequency of ‘alleles’ per sample [0177] occurrence of homopolymer sequence [0178] occurrence of neighbouring polymorphisms

[0179] SNPs and indels with a quality score above the threshold are identified as putative polymorphisms. For SSR mining we used the MISA (MlcroSAtellite identification) tool. This tool identifies di-, tri-, tetranucleotide and compound SSR motifs with predefined criteria and summarizes occurrences of these SSRs.

[0180] The polymorphism mining and quality assignment process is shown in FIG. 4B

Results

[0181] The table below summarizes the results of the combined analysis of sequences obtained from 2 454 sequence runs for the combined pepper samples and 2 runs for the combined maize samples.

TABLE-US-00012 Pepper Maize Total number of reads 457178 492145 Number of trimmed reads 399623 411008 Number singletons 105253 313280 Number of contigs 31863 14588 Number of reads in contigs 294370 97728 Total number of sequences containing SSRs 611 202 Number of different SSR-containing 104 65 sequences Number of different SSR motifs (di, tri, 49 40 tetra and compound) Number SNPs with Q score ≥ 0.3 * 1636 782 Number of indels * 4090 943 * both with selection against neighboring SNPs, at least 12 bp flanking sequence and not occurring in homopolymer sequences larger than 3 nucleotides.

Example 4. Single Nucleotide Polymorphism (SNP) Discovery in Pepper

[0182] DNA Isolation

[0183] Genomic DNA was isolated from the two parental lines of a pepper recombinant inbred (ML) population and 10 ML progeny. The parental lines are PSP11 and PI201234. Genomic DNA was isolated from leaf material of individual seedlings using a modified CTAB procedure described by Stuart and Via (Stuart, C. N., Jr and Via, L. E. (1993) A rapid CTAB DNA isolation technique useful for RAPD fingeprinting and other PCR applications. Biotechniques, 14, 748-750). DNA samples were diluted to a concentration of 100 ng/μl in TE (10 mM Tris-HCl pH 8.0, 1 mM EDTA) and stored at −20° C.

[0184] AFLP Template Preparation Using Tagged AFLP Primers

[0185] AFLP templates of the pepper parental lines PSP11 and PI201234 were prepared using the restriction endonuclease combination EcoRI/MseI as described by Zabeau & Vos, 1993: Selective restriction fragment amplification; a general method for DNA fingerprinting. EP 0534858-A1, B1; U.S. Pat. No. 6,045,994) and Vos et al (Vos, P., Hogers, R., Bleeker, M., Reijans, M., van de Lee, T., Hornes, M., Frijters, A., Pot, J., Peleman, J., Kuiper, M. et al. (1995) AFLP: a new technique for DNA fingerprinting. Nucl. Acids Res., 21, 4407-4414).

[0186] Specifically, restriction of genomic DNA with EcoRI and MseI was carried out as follows:

[0187] DNA Restriction

TABLE-US-00013 DNA 100-500 ng EcoRI 5 units MseI 2 units 5×RLbuffer 8 μl MilliQ water to 40 μl

[0188] Incubation was for 1 hour at 37° C. After the enzyme restriction, enzymes were inactivated by incubation for 10 minutes at 80° C.

[0189] Ligation of Adapters

TABLE-US-00014 10 mM ATP 1 μl T4 DNA ligase 1 μl EcoRI adaptor (5 pmol/μl) 1 μl MseI adaptor (50 pmol/μl) 1 μl 5×RLbuffer. 2 μl MilliQ water to 40 μl Incubation was for 3 hours at 37° C.

[0190] Selective AFLP Amplification

[0191] Following restriction-ligation, the restriction/ligation reaction was diluted 10-fold with T.sub.10E.sub.0.1 and 50 diluted mix was used as a template in a selective amplification step. Note that since a +1/+2 selective amplification was intended, first a +1/+1 selective pre-amplification step (with standard AFLP primers) was performed. Reaction conditions of the +1/+1 (+A/+C) amplification were as follows.

TABLE-US-00015 Restriction-Ligation mix (10-fold diluted) 5 μl EcoRI-primer + 1 (50 ng/μl): 0.6 μl MseI- primer + 1 (50 ng/μl) 0.6 μl dNTPs (20 mM) 0.2 μl Taq.polymerase (5 U/μl Amplitaq, PE) 0.08 μl 10×PCRbuffer 2.0 μl MilliQ water to 20 μl

TABLE-US-00016 Primers sequences were: EcoRI + 1: [SEQ ID NO: 9] 5′- AGACTGCGTACCAATTCA -3′ and MseI + 1: [SEQ ID NO: 10] 5′- GATGAGTCCTGAGTAAC -3′

[0192] PCR amplifications were performed using a PE9700 with a gold or silver block using the following conditions: 20 times (30 s at 94° C., 60 s at 56° C. and 120 s at 72° C.).

[0193] The quality of the generated +1/+1 preamplification products was checked on a 1% agarose gel using a 100 basepair ladder and a 1 Kb ladder to check the fragment length distribution. Following +1/+1 selective amplification, the reaction was diluted 20-fold with T.sub.10E.sub.0.1 and 5 μl diluted mix is used as a template in the +1/+2 selective amplification step using tagged AFLP primers.

[0194] Finally, +1/+2 (A/+CA) selective AFLP amplifications were performed:

TABLE-US-00017 +1/+1 selective amplification product (20-fold diluted)5.0 μl KRS EcoRI-primer + A (50 ng/μl) 1.5 μl KRS MseI-primer − CA (50 ng/μl) 1.5 μl dNTPs (20 mM) 0.5 μl Taq polymerase (5U/μl Amplitaq, Perkin Elmer) 0.2 μl 10× PCR buffer 5.0 μl MQ to 50 μl
Tagged AFLP Primers Sequences were:

TABLE-US-00018 PSP11: O5F212: EcoRI + 1: [SEQ ID NO: 1] 5′-CGTCAGACTGCGTACCAATTCA-3′ and O5F214: MseI + 2: [SEQ ID NO: 2] 5′-TGGTGATGAGTCCTGAGTAACA-3′ PI201234: 05F213: EcoRI + 1: [SEQ ID NO: 3] 5′-CAAGAGACTGCGTACCAATTCA-3′ and 05F215: MseI + 1: [SEQ ID NO: 4] 5′-AGCCGATGAGTCCTGAGTAACA-3′

[0195] Note that these primers contain 4 bp tags (underlined above) at their 5 prime ends to distinguish amplification products originating from the respective pepper lines at the end of the sequencing process.

[0196] Schematic representation of pepper AFLP+1/+2 amplification products after amplification with AFLP primers containing 4 bp 5 prime tag sequences.

TABLE-US-00019              EcoRI tag                              MseI tag PSP 11: 5′-CGTC ------------------------------------- ACCA-3′         3′-GCAG--------------------------------------- TGGT-5' PI201234 5′-CAAG ----------------------------------- GGCT-3'         3′-GTTC ----------------------------------- CCGA-5'

[0197] PCR amplifications (24 per sample) were performed using a PE9700 with a gold or silver block using the following conditions: 30 times (30 s at 94° C. +60 s at 56° C. +120 s at 72° C.).

[0198] The quality of the generated amplification products was checked on a 1% agarose gel using a 100 basepair ladder and a 1 Kb ladder to check the fragment length distribution.

[0199] AFLP Reaction Purification and Quantification.

[0200] After pooling two 50 microliter+1/+2 selective AFLP reactions per pepper sample, the resulting 12 100 μl AFLP reaction products were purified using the QIAquick PCR Purification Kit

[0201] (QIAGEN), following the QIAquick® Spin handbook (Page 18). On each column a maximum of 100 μl product was loaded. Amplified products were eluted in T.sub.10E.sub.0.1. The quality of the purified products is checked on a 1% agarose gel and concentrations were measured on the Nanodrop (FIGS. 2A and 2B).

[0202] Nanodrop concentration measurements were used to adjust the final concentration of each purified PCR product to 300 nanograms per microliter. Five micrograms purified amplified product of PSP11 and 5 microgram of PI201234 were mixed to generate 10 microgram template material for preparation of the 454 sequencing library.

[0203] Sequence Library Preparation and High-Throughput Sequencing

[0204] Mixed amplification products from both pepper lines were subjected to high-throughput sequencing using 454 Life Sciences sequencing technology as described by Margulies et al., (Margulies et al., Nature 437, pp. 376-380 and Online Supplements). Specifically, the AFLP PCR products were first end-polished and subsequently ligated to adaptors to facilitate emulsion-PCR amplification and subsequent fragment sequencing as described by Margulies and co-workers. 454 adaptor sequences, emulsion PCR primers, sequence-primers and sequence run conditions were all as described by Margulies and co-workers. The linear order of functional elements in an emulsion-PCR fragment amplified on Sepharose beads in the 454 sequencing process was as follows as exemplified in FIG. 1A:

[0205] 454 PCR adaptor—454 sequence adaptor—4 bp AFLP primer tag 1—AFLP primer sequence 1 including selective nucleotide(s)—AFLP fragment internal sequence—AFLP primer sequence 2 including selective nucleotide(s), 4 bp AFLP primers tag 2-454 sequence adaptor—454 PCR adaptor—Sepharose bead

[0206] Two high-throughput 454 sequence runs were performed by 454 Life Sciences (Branford, Conn.; United States of America).

[0207] 454 Sequence Run Data-Processing.

[0208] Sequence data resulting from 2 454 sequence runs were processed using a bio-informatics pipeline (Keygene N.V.). Specifically, raw 454 basecalled sequence reads were converted in FASTA format and inspected for the presence of tagged AFLP adaptor sequences using a BLAST algorithm. Upon high-confidence matches to the known tagged AFLP primer sequences, sequences were trimmed, restriction endonuclease sites restored and assigned the appropriate tags (sample 1 EcoRI (ES1), sample 1 MseI (MS1), sample 2 EcoRI (ES2) or sample 2 MseI (MS2), respectively). Next, all trimmed sequences larger than 33 bases were clustered using a megaBLAST procedure based on overall sequence homologies. Next, clusters were assembled into one or more contigs and/or singletons per cluster, using a CAP3 multiple alignment algorithm. Contigs containing more than one sequence were inspected for the sequence mismatches, representing putative polymorphisms. Sequence mismatches were assigned quality scores based on the following criteria: [0209] the numbers of reads in a contig [0210] the observed allele distribution [0211] The above two criteria form the basis for the so called Q score assigned to each putative SNP/indel. Q scores range from 0 to 1; a Q score of 0.3 can only be reached in case both alleles are observed at least twice. [0212] location in homopolymers of a certain length (adjustable; default setting to avoid polymorphism located in homopolymers of 3 bases or longer). [0213] number of contigs in cluster. [0214] distance to nearest neighboring sequence mismatches (adjustable; important for certain types of genotyping assays probing flanking sequences) [0215] the level of association of observed alleles with sample 1 or sample 2; in case of a consistent, perfect association between the alleles of a putative polymorphism and samples 1 and 2, the polymorphism (SNP) is indicated as an “elite” putative polymorphism (SNP). An elite polymorphism is thought to have a high probability of being located in a unique or low-copy genome sequence in case two homozygous lines have been used in the discovery process. Conversely, a weak association of a polymorphism with sample origin bears a high risk of having discovered false polymorphisms arising from alignment of non-allelic sequences in a contig.

[0216] Sequences containing SSR motifs were identified using the MISA search tool.

[0217] Overall statistics of the run is shown in the Table below.

TABLE-US-00020 TABLE Overall statistics of a 454 sequence run for SNP discovery in pepper. Enzyme combination Run Trimming All reads 254308 Fault 5293 (2%) Correct 249015 (98%) Concatamers 2156 (8.5%) Mixed tags 1120 (0.4%) Correct reads Trimmed one end 240817 (97%) Trimmed both ends 8198 (3%) Number of reads sample 1 136990 (55%) Number of reads sample 2 112025 (45%) Clustering Number of contigs 21918 Reads in contigs 190861 Average number reads per contig 8.7 SNP mining SNPs with Q score ≥ 0.3 * 1483 Indel with Q score ≥ 0.3 * 3300 SSR mining Total number of SSR motifs identified 359 Number of reads containing one or more 353 SSR motifs Number of SSR motif with unit size 1 0 (homopolymer) Number of SSR motif with unit size 2 102 Number of SSR motif with unit size 3 240 Number of SSR motif with unit size 4 17 * SNP/indel mining criteria were as follows: No neighbouring polymorphisms with Q score larger than 0.1 within 12 bases on eitherside, not present in homopolymers of 3 or more bases. Mining criteria did not take into account consistent association with sample 1 and 2, i.e. the SNPs and indels are not necessarily elite putative SNPs/indels

[0218] An example of a multiple alignment containing an elite putative single nucleotide polymorphism is shown in FIG. 7.

Example 5. SNP Validation by PCR Amplification and Sanger Sequencing

[0219] In order to validate the putative A/G SNP identified in example 1, a sequence tagged site (STS) assay for this SNP was designed using flanking PCR primers. PCR primer sequences were as follows:

TABLE-US-00021 Primer_1.2f: [SEQ ID NO: 37] 5′- AAACCCAAACTCCCCCAATC-3′, and Primer_1.2r: [SEQ ID NO: 38] 5'- AGCGGATAACAATTTCACACAGGACATCAGTAGTCACACTGGTA CAAAAATAGAGCAAAACAGTAGTG -3' 

[0220] Note that primer 1.2r contained an M13 sequence primer binding site and length stuffer at its 5 prime end. PCR amplification was carried out using +A/+CA AFLP amplification products of PSP11 and PI210234 prepared as described in example 4 as template. PCR conditions were as follows:

For 1 PCR reaction the following components were mixed:
5 μl 1/10 diluted AFLP mixture (app. 10 ng/μ1)
5 μl 1 pmol/μl primer 1.2f (diluted directly from a 500 μM stock)
5 μl 1 pmol/μl primer 1.2r (diluted directly from a 500 μM stock)
5 μl PCR mix—2 μl 10×PCR buffer [0221] 1 μl 5 mM dNTPs [0222] 1.5 μl 25 mM MgCl.sub.2 [0223] 0.5 μl H.sub.2O
5 μl Enzyme mix—0.5 μl 10×PCR buffer (Applied Biosystems) [0224] 0.1 μl 5 U/μl AmpliTaq DNA polymerase (Applied Biosystems) [0225] 4.4 μl H.sub.2O
The following PCR profile was used:

TABLE-US-00022 Cycle 1 2′; 94° C. Cycle 2-34 20″; 94° C. 30″; 56° C. 2′30″; 72° C. Cycle 35 7'; 72° C. ∞;  4° C.

[0226] PCR products were cloned into vector pCR2.1 (TA Cloning kit; Invitrogen) using the TA Cloning method and transformed into INVαF′ competent E. coli cells. Transformants were subjected to blue/white screening. Three independent white transformants each for PSP11 and PI-201234 were selected and grown 0/N in liquid selective medium for plasmid isolation.

[0227] Plasmids were isolated using the QIAprep Spin Miniprep kit (QIAGEN). Subsequently, the inserts of these plasmids were sequenced according to the protocol below and resolved on the MegaBACE 1000 (Amersham). Obtained sequences were inspected on the presence of the SNP allele. Two independent plasmids containing the PI-201234 insert and 1 plasmid containing the PSP11 insert contained the expected consensus sequence flanking the SNP. Sequence derived from the PSP11 fragment contained the expected A (underlined) allele and sequence derived from PI-201234 fragment contained the expected G allele (double underlined):

TABLE-US-00023 PSP11 (sequence 1): (5′-3′) [SEQ ID NO: 39] AAACCCAAACTCCCCCAATCGATTTCAAACCTAGAACAATGTTGGTTTTGG TGCTAACTTCAACCCCACTACTGTTTTGCTCTATTTTTGT PI-201234 (sequence 1): (5′-3′) [SEQ ID NO: 40] AAACCCAAACTCCCCCAATCGATTTCAAACCTAGAACAGTGTTGGTTTTGG TGCTAACTTCAACCCCACTACTGTTTTGCTCTATTTTTG PI-201234 (sequence 2): (5′-3′) [SEQ ID NO: 41] AAACCCAAACTCCCCCAATCGATTTCAAACCTAGAACAcustom-character TGTTGGTTTTGG TGCTAACTTCAACCCCACTACTGTTTTGCTCTATTTTTG

[0228] This result indicates that the putative pepper A/G SNP represents a true genetic polymorphism detectable using the designed STS assay.

Example 6: SNP Validation by SNPWAVE (Multiplexed SNP Genotyping) Detection

[0229] In order to validate the putative A/G SNP identified in example 1, SNPWAVE ligation probes sets were defined for both alleles of this SNP using the consensus sequence. Sequence of the ligation probes were as follows:

TABLE-US-00024 SNPWAVE probe sequences (5′-3′): 06A162  [SEQ ID NO: 42] GATGAGTCCTGAGTAACCCAATCGATTTCAAACCTAGAACAA (42 bases) 06A163 [SEQ ID NO: 43] GATGAGTCCTGAGTAACCACCAATCGATTTCAAACCTAGAACAG (44 bases) 06A164 Phosphate- [SEQ ID NO: 44] TGTTGGTTTTGGTGCTAACTTCAACCAACATCTGGAATTGGTACGCAGTC (52 bases)

[0230] Note the allele specific probes 06A162 and 06A163 for the A and G alleles, respectively, differ by 2 bases in size, such that upon ligation to the common locus-specific probe 06A164, ligation product sizes of 94 (42+54) and 96 (44+52) bases result.

[0231] SNPWAVE ligation and PCR reactions were carried as described by Van Eijk and co-workers (M. J. T. van Eijk, J. L. N. Broekhof, H. J.A. van der Poel, R. C. J. Hogers, H. Schneiders, J. Kamerbeek, E. Verstege, J. W. van Aart, H. Geerlings, J. B. Buntjer, A. J. van Oeveren, and P. Vos. (2004). Nucleic Acids Research 32: e47), using 100 ng genomic DNA of pepper lines PSP11 and PI201234 and 8 ML offspring as starting material. Sequences of the PCR primers were:

TABLE-US-00025 93L01FAM (E00k): [SEQ ID NO: 45] 5-GACTGCGTACCAATTC-3′ 93E40 (M00k): [SEQ ID NO: 46] 5-GATGAGTCCTGAGTAA-3′

[0232] Following PCR amplification, PCR product purification and detection on the MegaBACE1000 was as described by van Eijk and co-workers (vide supra). A pseudo-gel image of the amplification products obtained from PSP11, PI201234 and 8 ML offspring is shown in FIG. 8B.

[0233] The SNPWAVE results demonstrate clearly that the A/G SNP is detected by the SNPWAVE assay, resulting in 92 bp products (=AA homozygous genotype) for P1 (PSP11) and ML offspring 1, 2, 3, 4, 6 and 7), and in 94 bp products (=GG homozygous genotype) for P2 (PI201233) and RIL offspring 5 and 8.

Example 7: Strategies for Enriching AFLP Fragment Libraries for Low-Copy Sequences

[0234] This example describes several enrichment methods to target low-copy of unique genome sequences in order to increase the yield of elite polymorphisms such as described in example 4. The methods can be divided into four categories:

[0235] 1) Methods Aimed at Preparing High-Quality Genomic DNA, Excluding Chloroplast Sequences.

[0236] Here it is proposed to prepare nuclear DNA instead of whole genomic DNA as described in Example 4, to exclude co-isolation of abundant chloroplast DNA, which may result in reduced number of plant genomic DNA sequences, depending on the restriction endonucleases and selective AFLP primers used in the fragment library preparation process. A protocol for isolation of highly pure tomato nuclear DNA has been described by Peterson, D G., Boehm, K. S. & Stack S. M. (1997). Isolation of Milligram Quantities of Nuclear DNA From Tomato (Lycopersicon esculentum), A Plant Containing High Levels of Polyphenolic Compounds. Plant Molecular Biology Reporter 15 (2), pages 148-153.

[0237] 2) Methods Aimed at Using Restriction Endonucleases in the AFLP Template Preparation Process which are Expected to Yield Elevated Levels of Low-Copy Sequences.

[0238] Here it is proposed to use certain restriction endonucleases in the AFLP template preparation process, which are expected to target low-copy or unique genome sequences, resulting in fragment libraries enriched for polymorphisms with increased ability to be convertible into genotyping assays. An examples of a restriction endonuclease targeting low-copy sequence in plant genomes is PstI. Other methylation sensitive restriction endonucleases may also target low-copy or unique genome sequences preferentially.

[0239] 3) Methods Aimed a Selectively Removing Highly Duplicated Sequences Based on Re-Annealing Kinetics of Repeat Sequences Versus Low-Copy Sequences.

[0240] Here it is proposed to selectively remove highly duplicated (repeat) sequences from either the total genomic DNA sample or from the (cDNA-)AFLP template material prior to selective amplification.

[0241] 3a) High-Cot DNA preparation is a commonly used technique to enrich slowly annealing low-copy sequences from a complex plant genomic DNA mixture (Yuan et al. 2003; High-Cot sequence analysis of the maize genome. Plant 34: 249-255). It is suggested to take High-Cot instead of total genomic DNA as starting material to enrich for polymorphisms located in low-copy sequences.

[0242] 3b) An alternative to laborious high-Cot preparation may be incubate denatured and reannealing dsDNA with a novel nuclease from the Kamchatka crab, which cleaves short, perfectly matched DNA duplexes at a higher rate than nonperfectly matched DNA duplexes, as described by Zhulidov and co-workers (2004; Simple cDNA normalization using Kamchatka crab duplex-specific nuclease. Nucleic Acids Research 32, e37) and Shagin and co-workers (2006; a novel method for SNP detection using a new duplex-specific nuclease from crab hepatopancreas. Genome Research 12: 1935-1942). Specifically, it is proposed to incubate AFLP restriction/ligation mixtures with this endonuclease to deplete the mixture of highly duplicated sequences, followed by selective AFLP amplification of the remaining low-copy or unique genome sequences.

[0243] 3c) Methyl filtration is a method to enrich for hypomethylated genomic DNA fragments using the restriction endonuclease McrBC which cuts methylated DNA in the sequence [A/G]C, where the C is methylated (see Pablo D. Rabinowicz, Robert Citek, Muhammad A. Budiman, Andrew Nunberg, Joseph A. Bedell, Nathan Lakey, Andrew L. O'Shaughnessy, Lidia U. Nascimento, W. Richard McCombie and Robert A. Martienssen. Differential methylation of genes and repeats in land plants. Genome Research 15:1431-1440, 2005). McrBC may be used to enrich the low-copy sequence fraction of a genome as starting material for polymorphism discovery.

[0244] 4) The Use of cDNA as Opposed to Genomic DNA in Order to Target Gene Sequences.

[0245] Finally, here it is proposed to use oligodT-primed cDNA as opposed to genomic DNA as starting material for polymorphism discovery, optionally in combination with the use the Crab duplex-specific nuclease described in 3b above for normalization. Note that the use of oligodT primed cDNA also excludes chloroplast sequences. Alternatively, cDNA-AFLP templates instead of oligodT primed cDNA is used to facilitate amplification of the remaining low-copy sequences in analogy to AFLP (see also 3b above).

Example 8: Strategy for Simple-Sequence Repeat Enrichment

[0246] This example describes the proposed strategy for discovery of Simple Sequence repeats sequences, in analogy to SNP discovery described in Example 4.

[0247] Specifically, Restriction-ligation of genomic DNA of two or more samples is performed, e.g. using restriction endonucleases PstI/MseI. Selective AFLP amplification is performed as described in Example 4. Next fragments containing the selected SSR motifs are enriched by one of two methods:

[0248] 1) Southern blot hybridization onto filters containing oligonucleotides matching the intended SSR motifs (e.g. (CA).sub.15 in case of enrichment for CA/GT repeats), followed by amplification of bound fragments in a similar fashion as described by Armour and co-workers (Armour, J., Sismani, C., Patsalis, P., and Cross, G. (2000). Measurement of locus copy number by hybridization with amplifiable probes. Nucleic Acids Research vol 28, no. 2, pp. 605-609) or by

[0249] 2) enrichment using biotinylated capture oligonucleotide hybridization probes to capture (AFLP) fragments in solution as described by Kijas and co-workers (Kijas, J. M, Fowler, J. C., Garbett C. A., and Thomas, M. R., (1994). Enrichment of microsatellites from the citrus genome using biotinylated oligonucleotide sequences bound to streptavidin-coated magnetic particles. Biotechniques, vol. 16, pp. 656-662.

[0250] Next, the SSR-motif enriched AFLP fragments are amplified using the same AFLP primers are used in the preamplification step, to generate a sequence library. An aliqout of the amplified fragments are T/A cloned and 96 clones are sequences to estimate the fraction of positive clones (clones containing the intended SSR motif, e.g. CA/GT motifs longer than 5 repeat units. Another aliquot of the enriched AFLP fragment mixture is detected by polyacrylamide gel electrophoresis (PAGE), optionally after further selective amplification to obtain a readable fingerprint, in order to visually inspect whether SSR containing fragments are enriched. Following successful completion of these control steps, the sequence libraries are subjected to high-throughput 454 sequencing.

[0251] The above strategy for de novo SSR discovery is schematically depicted in FIG. 8A, and can be adapted for other sequence motifs by substituting the capture oligonucleotide sequences accordingly.

Example 9. Strategy for Avoiding Mixed Tags

[0252] Mixed tags refers to the observation that besides the expected tagged AFLP primer combination per sample, a low fraction of sequences are observed which contain a sample 1 tag at one end, and a sample 2 tag a the other end (See also the table 1 in example 4). Schematically, the configuration of sequences containing mixed tags is depicted here-in below.

TABLE-US-00026 Schematic representation of the expected sample tag combinations.         EcoRI tag                                  MseI tag PSP 11: 5′-CGTC ------------------------------------------- ACCA-3′      3′-GCAG-------------------------------------------- TGGT-5′ PI-201234 5′-CAAG --------------------------------------- GGCT-3'      3′-GTTC --------------------------------------- CCGA-5′ Schematic representation of the mixed tags. EcoRI tag                                MseI tag 5′-CGTC ----------------------------------------- GGCT-3′ 3′-GCAG-------------------------------------------CCGA-5′ 5′-CAAG ------------------------------------------ ACCA-3′ 3′-GTTC ----------------------------------------- TGGT-5′
The observation of mixed tags precludes correct assignment of the sequence to either PSP11 or PI-201234.

[0253] An example of a mixed tag sequence observed in the pepper sequence run described in Example 4 is shown in FIG. 5A. An overview of the configuration of observed fragments containing expected tags and mixed tags is shown in panel 2 of FIG. 5A.

[0254] The proposed molecular explanation for mixed tags is that during the sequence library preparation step, DNA fragments are made blunt by using T4 DNA polymerase or Klenow enzyme to remove 3 prime protruding ends, prior to adaptor ligation (Margulies et al., 2005). While this may work well when a single DNA sample is processed, in case of using a mixture of two or more samples differently tagged DNA samples, fill in by the polymerase results in incorporation of the wrong tag sequence in case when a heteroduplex has been formed between the complementary strands derived from different samples (FIG. 5B panel 3 mixed tags) The solution has been found to pool samples after the purification step that followed adaptor ligation in the 454 sequence library construction step as shown in FIG. 5C panel 4.

Example 10. Strategy for Avoiding Mixed Tags and Concatamers Using an Improved Design for 454 Sequence Library Preparation

[0255] Besides the observation of low frequencies of sequence reads containing mixed tags as described in Example 9, a low frequency of sequence reads observed from concatenated AFLP fragments have been observed.

[0256] An example of a sequence read derived from a concatamer is shown in FIG. 6A Panel 1. Schematically, the configuration of sequences containing expected tags and concatamers is shown in FIG. 6A Panel 2.

[0257] The proposed molecular explanation for the occurrence of concatenated AFLP fragments is that during the 454 sequence library preparation step, DNA fragments are made blunt using T4 DNA polymerase or Klenow enzyme to remove 3 prime protruding ends, prior to adaptor ligation (Margulies et al., 2005). As a result, blunt end sample DNA fragments are in competition with the adaptors during the ligation step and may be ligated to each other prior to being ligated to adaptors. This phenomenon is in fact independent of whether a single DNA sample or a mixture of multiple (tagged) samples are included in the library preparation step, and may therefore also occur during the conventional sequencing as described by Margulies and co-workers. In case of the using multiple tagged samples as described in Example 4, concatamers complicate correct assignment of sequence reads to samples based on the tag information and are therefore to be avoided.

[0258] The proposed solution to the formation of concatamers (and mixed tags) is to replace blunt-end adaptor ligation with ligation of adaptors containing a 3 prime T overhang, in analogy to T/A cloning of PCR products, as shown in FIG. 6B Panel 3. Conveniently, these modified 3′prime T overhang-containing adaptors are proposed to contain a C overhang at the opposite 3′end (which will not be ligated to the sample DNA fragment, to prevent blunt-end concatamer formation of adaptor sequences (see FIG. 6B Panel 3). The resulting adapted workflow in the sequence library construction process when using the modified adaptor approach is shown schematically in FIG. 6C Panel 4.