METHOD FOR HIGH-THROUGHPUT AFLP-BASED POLYMORPHISM DETECTION

20190144938 · 2019-05-16

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention relates to a method for the high throughput discovery, detection and genotyping of one or more genetic markers in one or more samples, comprising the steps of restriction endonuclease digest of DNA, adaptor-ligation, optional pre-amplification, selective amplification, pooling of the amplified products, sequencing the libraries with sufficient redundancy, clustering followed by identification of the genetic markers within the library and/or between libraries and determination of (co-)dominant genotypes of the genetic markers.

Claims

1. A kit for use in a method for detecting genetic variation in one or more members of a population, comprising: (a) an adaptor adapted for ligation to a plurality of nucleic acid fragments, and (b) a set of primers for PCR amplification having a 5-end and a 3-end, wherein the primers comprise one or more selective nucleotides at the 3 end, and wherein at least one of the adaptor and the primers comprises a sample-specific identifier sequence capable of indicating sample origin of an amplification product.

2. The kit according to claim 1, wherein the adaptor comprises the sample-specific identifier sequence.

3. The kit according to claim 1, wherein at least one of the primers comprises the sample-specific identifier sequence.

4. The kit according to claim 1, further comprising one or more restriction enzymes for producing the plurality of nucleic acid fragments.

5. The kit according to claim 4, wherein the restriction enzymes comprise restriction enzymes producing blunt ended restriction fragments.

6. The kit according to claim 4, wherein the restriction enzymes comprise restriction enzymes producing sticky ended restriction fragments.

7. The kit according to claim 6, wherein the restriction enzymes comprise EcoRI and/or MseI.

8. The kit according to claim 1, wherein the adaptor comprises a 3-T overhang.

9. The kit according to claim 1, wherein the adaptor comprises a PCR primer-binding sequence for hybridizing to a PCR primer.

10. The kit according to claim 1, wherein the adaptor comprises a sequencing primer-binding sequence for hybridizing to a sequencing primer.

11. The kit according to claim 1, wherein the adaptor comprises sequences complementary to sequences attached to a solid support for annealing the adaptor to a solid support.

12. The kit according to claim 1, wherein the adaptor comprises sequences complementary to sequences attached to a bead for annealing the adaptor to a bead.

13. The kit according to claim 1, wherein the adapter is composed of two synthetic oligonucleotides which have nucleotide sequences which are partially complementary to each other.

14. The kit according to claim 1, wherein at least one of the primers is phosphorylated.

15. The kit according to claim 1, wherein at least one of the primers comprises sequences at its 3-end for hybridizing to a subset of nucleic acid sample fragments.

16. The kit according to claim 1, wherein the kit further comprises a biotinylated capture oligonucleotide hybridization probe for capturing a subset of nucleic acid sample fragments.

17. The kit according to claim 1, wherein the kit further comprises a polymerase for a polymerase chain reaction (PCR).

18. The kit according to claim 17, wherein the polymerase substantially lacks 3-5 exonuclease activity.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0094] FIG. 1A shows a fragment according to the present invention annealed onto a bead (454 bead) and the sequence of primer used for pre-amplification of the two pepper lines. DNA fragment denotes the fragment obtained after digestion with a restriction endonuclease, keygene adaptor denotes an adaptor providing an annealing site for the (phosphorylated) oligonucleotide primers (SEQ ID NOS 1-4, respectively, in order of appearance) used to generate a library, KRS denotes an identifier sequence (tag), 454 SEQ. Adaptor denotes a sequencing adaptor, and 454 PCR adaptor denotes an adaptor to allow for emulsion amplification of the DNA fragment. The PCR adaptor allows for annealing to the bead and for amplification and may contain a 3-T overhang.

[0095] FIG. 1B shows a schematic primer used in the complexity reduction step. Such a primer generally comprises a recognition site region indicated as (2), a constant region that may include a tag section indicated as (1) and one or more selective nucleotides in a selective region indicated as (3) at the 3-end thereof).

[0096] FIGS. 2A and 2B show DNA concentration estimation using 2% agarose gel-electrophoresis. S1 denotes PSP11; S2 denote PI201234. 50, 100, 250 and 500 ng denotes respectively 50 ng, 100 ng, 250 ng and 500 ng to estimate DNA amounts of S1 and S2. FIGS. 2C and 2D show DNA concentration determination using Nanodrop spectrophotometry.

[0097] FIGS. 3A and 3B show the results of intermediate quality assessments of example 3. FIG. 3C shows DNA concentrations of each sample noted using Nanodrop.

[0098] FIG. 4A shows flow charts of the sequence data processing pipeline, i.e. the steps taken from the generation of the sequencing data to the identification of putative SNPs, SSRs and indels, via steps of the removal of known sequence information in Trimming & Tagging resulting in trimmed sequence data which are clustered and assembled to yield contigs and singletons (fragments that cannot be assembled in a contig) after which putative polymorphisms can be identified and assessed. FIG. 4B further elaborates on the process of polymorphisms mining.

[0099] FIG. 5: Multiple alignment 10037_CL989contig2 of pepper AFLP fragment sequences (SEQ ID NOS 39-43, respectively, in order of appearance), containing a putative single nucleotide polymorphism (SNP). Note that the SNP (indicated by an the black arrow) is defined by an A allele present in both reads of sample 1 (PSP11), denoted by the presence of the MS1 tag in the name of the top two reads, and a G allele present in sample 2 (PI201234), denoted by the presence of the MS2 tag in the name of the bottom two reads. Read names are shown on the left. The consensus sequence of this multiple alignment is (5-3):

TABLE-US-00001 (SEQIDNO:38) TAACACGACTTTGAACAAACCCAAACTCCCCCAATCGATTTCAAACCTAG AACA[A/G]TGTTGGTTTTGGTGCTAACTTCAACCCCACTACTGTTTTGC TCTATTTTTG.

[0100] FIG. 6: Graphic representation of the probability of correct classification of the genotype based on the number of observed reads per locus.

EXAMPLES

[0101] The method is exemplified as follows:

[0102] 1) AFLP templates are prepared according to a modified protocol of Vos et al. which involves a heat-denaturation step for 20 min at 80 C. between the restriction and ligation steps. After incubation for 20 min at 80 C., the restriction enzyme digest is cooled to room temperature and DNA ligase is added. The denaturation step leads to dissociation of the complementary strands of restriction fragments up to 120 bp such that no adaptors will be ligated to the ends. As a result, fragments smaller than 120 bp will not be amplified, hence size selection is achieved.

[0103] 2) Pre-amplification reactions, if applicable, are performed as in conventional AFLP.

[0104] 3) The last (selective) amplification step is performed using AFLP primers with unique identifier tags for every sample in the population/experiment, (using a unique 4 bp identifier sequence; KIS). The MS are located at the 5end of the selective AFLP primers. One additional selective nucleotide will be used in comparison with the number of selective bases used in conventional AFLP detection by electrophoresis, e.g. +4/+3 for an EcoRI/MseI fingerprint in pepper (gel detection+3/+3) and +4/+4 for and EcoRI/MseI fingerprint in maize (gel detection+4/+3). The number of selective nucleotides that are applied needs to be determined empirically; it may be so that the same number of selective nucleotides can be applied as used for gel detection. This number further depends on the number of samples included in the experiment, since the numbers of sequence traces is assumed to be fixed 200,000 at the current status of sequencing technology, but this may and probably will increase. Preferred starting point is to achieve 10-fold sampling of AFLP fragments per sample library.

[0105] 4) The collection of samples prepared according to steps 1-4 is subjected to sequencing via 454 Life Sciences technology. This means that individual AFLP fragments are cloned on beads, PCR amplified and sequenced. An output of 200,000 sequences of 100 bp length is expected. For a collection of 100 samples, this equals an average of 2000 sequences traces/sample, traceable to sample nr. via the 5 tag.

[0106] 5) Assuming the amplification of 100 AFLP fragments per PC when 1 additional selective nucleotide is used compared to the number used with gel detection, of which 90 percent are constant bands, the AFLP fragments are sampled with 20-fold average redundancy per fragment. However, since sequencing is non directional and most bands are >200 bp, sequencing redundancy will be slightly over 10-fold for each fragment end.

[0107] 6) All sequences are clustered per sample using the KRS tag. Given a 10-fold over sampling, this means that 200 different sequence traces are expected per sample, representing 200100 bp=20 kb sequence/sample. When 10 percent of these sequences are derived from AFLP markers (i.e. 1 allele is amplified and the other is absent in the PCR reaction), 90 percent (18 kb) of the sequences are derived from constant bands.

[0108] 7) Two types of genetic markers are scored:

[0109] A) AFLP markers: these are sequences which are observed in some samples, but absent in others.

[0110] Inspection of the frequency of sequences in the collection of samples will reveal this category. Dominant scoring is performed depending on the presence/absence observation of these sequences in every sample. Reliable scoring of AFLP markers requires a statistical threshold to be set regarding the frequency with which other AFLP sequences are observed in the experiment. I.e. an AFLP marker can be scored as present (dominant) if the AFLP marker sequence is observed in the sample, but the reliability of the absent score depends on the (average) frequency of (constant) AFLP fragments. Statistical threshold levels are required such that presence/absence scoring is performed with preferably at least 99.5% accuracy, depending on the acceptable level needed for the specific application. If a segregating population and its parents is analysed, these markers can possibly be scored co-dominantly as well by defining frequency categories of the marker sequences. The latter may actually be complicated by the influence of sampling variation of the AFLP marker which differs between samples.

[0111] B) (SN) polymorphisms in constant AFLP fragments.

[0112] This is the most interesting (and abundant) category of genetic markers. The essence is that SNP markers contained in the internal sequences of constant AFLP fragments are scored as co-dominant SNP markers. Again, this preferably requires applying a statistical threshold level for accurate calling of the presence or absence of an allele. A 10-fold sequencing redundancy of the fragment library is expected to be sufficient but a statistical analysis method is needed to determine accuracy of the SNP marker genotypes depending on the number each allele sequence is observed. The rationale is that when a constant band contains a SNP and one allele is observed e.g. 5 times while (the sequence containing the) other allele is not observed, it is highly likely that the sample is homozygous for the observed allele. Consequently, when both alleles are observed, the sample is scored heterozygous for the SNP marker, irrespective of their frequencies.

[0113] 8) The result will be a genotyping table containing the genotypes of (co-)dominantly scored AFLP markers and co-dominantly scored SNPs, along with probabilities for correctness of the genotypes for all markers. Alternatively, a dataset is generated which contains genotypes which have surpassed the set statistical threshold level.

[0114] The approach assumes 10-fold over sampling of AFLP fragments per sample, yielding 18 kb of constant sequence/sample and 2 kb of AFLP marker sequences.

[0115] The numbers of genetic markers observed depends on the SNP rate in the germplasm investigated. Below, estimates of the numbers of genetic markers are provided at different germplasm SNP rates, when sampling 20 kb sequence. The average length of AFLP markers/fragments is assumed to be 200 bp:

TABLE-US-00002 TABLE 1 Expected numbers of genetic markers scored by sequencing AFLP fragments using 454 Life sciences technology assuming 10-fold over sampling, 200,000 sequence traces, 90 percent constant bands/10 percent AFLP markers at various SNP rates. SNP rate AFLP markers (2 kb) SNPs in constant bands (18 kb)* 1/250 bp 8 72 1/1000 bp 2 18 1/2000 bp 1 9 1/5000 bp 0.4 3.6 *As the AFLP fragments may be sequenced from both ends, a proportion of the observed SNP can be derived from the same loci.

[0116] It is important to note that the numbers provided in table 1 are averages, which may differ between combinations of different primers. Analogous to conventional AFLP typing, identification of top primer combinations (PC) may yield higher numbers of markers per PC. In addition, the numbers presented in Table 1 may change depending on the required level of over sampling needed in order to reach the required accuracy level.

[0117] The calculation of the correct classification of the genotype is as follows:


P (correct)=P(aa)+P(AA)+P(Aa)*[10.5*exp(n1))]

[0118] Wherein P(aa) is the fraction of the population with genotype aa (in the enclosed graph, FIG. 9, set at 0.25. P(AA) is the fraction of the population with genotype AA (set at 0.25. P(Aa) is the fraction of the population with genotype Aa (in FIG. 6 and table below, set at 0.5. n equals the number of individuals.

TABLE-US-00003 TABLE n P 1 0.5 2 0.75 3 0.875 4 0.9375 5 0.96875 6 0.984375 7 0.992188 8 0.996094 9 0.998047 10 0.999023

Example 1 Pepper

[0119] DNA from the Pepper lines PSP-11 and PI201234 was used to generate AFLP product by use of AFLP Keygene Recognition Site specific primers. (These AFLP primers are essentially the same as conventional AFLP primers, e.g. described in EP 0 534 858, and will generally contain a recognition site region, a constant region and one or more selective nucleotides in a selective region.

[0120] From the pepper lines PSP-11 or PI201234 150 ng of DNA was digested with the restriction endonucleases EcoRI (5 U/reaction) and MseI (2 U/reaction) for 1 hour at 37 C. following by inactivation for 10 minutes at 80 C. The obtained restriction fragments were ligated with double-stranded synthetic oligonucleotide adapter, one end of which is compatible with one or both of the ends of the EcoRI and/or MseI restriction fragments. The restriction ligation mixture was 10 times diluted and 5 microliter of each sample was pre-amplified (2) with EcoRI+1(A) and MseI+1(C) primers (set I). After amplification the quality of the pre-amplification product of the two pepper samples was checked on a 1% agarose gel. The preamplification products were 20 times diluted, followed by a KRSEcoRI+1(A) and KRSMseI+2(CA) AFLP pre-amplification. The KRS (identifier) sections are underlined and the selective nucleotides are in bold at the 3-end in the primersequence SEQ ID 1-4 below. After amplification the quality of the pre-amplification product of the two pepper samples was checked on a 1% agarose gel and by an EcoRI+3(A) and MseI+3(C) (3) AFLP fingerprint (4). The pre-amplification products of the two pepper lines were separately purified on a QiagenPCR column (5). The concentration of the samples was measured on a NanoDrop ND-1000 Spectrophotometer. A total of 5 micrograms PSP-11 and 5 micrograms PI201234 PCR products were mixed and sequenced.

[0121] Primer Set I Used for Preamplification of PSP-11

TABLE-US-00004 [SEQID1] E01LKRS15-CGTCAGACTGCGTACCAATTCA-3 [SEQID2] M15KKRS15-TGGTGATGAGTCCTGAGTAACA-3

[0122] Primer Set II Used for Preamplification of PI201234

TABLE-US-00005 [SEQID3] E01LKRS25-CAAGAGACTGCGTACCAATTCA-3 [SEQID4] M15KKRS25-AGCCGATGAGTCCTGAGTAACA-3

[0123] (1) EcoRI/MseI Restriction Ligation Mixture

[0124] Restriction Mix (40 ul/Sample)

TABLE-US-00006 DNA 6 l (300 ng) ECoRI (5U) 0.1 l MseI (2U) 0.05 l 5xRL 8 l MQ 25.85 l Totaal 40 l Incubation during 1 h. at 37 C.

[0125] Addition of:

[0126] Ligation Mix (10 l/Sample)

TABLE-US-00007 10 mM ATP 1 l T4 DNA ligase 1 l ECoRI adapt. (5 pmol/l) 1 l MseI adapt . . . (50 pmol/l) 1 l 5xRL 2 l MQ 4 l Totaal 10 l Incubation during 3 h. at 37 C.

[0127] EcoRI-Adaptor

TABLE-US-00008 91M35/91M36: [SEQID5] *-CTCGTAGACTGCGTACC:91M35 [SEQID6] bioCATCTGACGCATGGTTAA:91M36

[0128] MseI-Adaptor

TABLE-US-00009 92A18/92A19: [SEQID7] 5-GACGATGAGTCCTGAG-3:92A18 [SEQID8] 3-TACTCAGGACTCAT-5:92A19

[0129] (2) Pre-Amplification

[0130] Preamplification (A/C):

TABLE-US-00010 RL-mix (10x) 5 l EcoRI-pr E01L (50 ng/ul) 0.6 l MseI-pr M02K (50 ng/ul) 0.6 l dNTPs (25 mM) 0.16 l Taq.pol. (5U) 0.08 l 10XPCR 2.0 l MQ 11.56 l Total 20 l/reaction

[0131] Pre-Amplification Thermal Profile

[0132] Selective pre amplification was done in a reaction volume of 50 l. The PCR was performed in a PE GeneAmp PCR System 9700 and a 20 cycle profile was started with a 94 C. denaturation step for 30 seconds, followed by an annealing step of 56 C. for 60 seconds and an extension step of 72 C. for 60 seconds.

TABLE-US-00011 EcoRI+1(A)1 [SEQID9] E01L 92R11:5-AGACTGCGTACCAATTCA-3 MseI+1(C)1 [SEQID10] M02k 93E42:5-GATGAGTCCTGAGTAAC-3

[0133] Preamplification A/CA:

TABLE-US-00012 PA+1/+1-mix (20x): 5 l EcoRI-pr: 1.5 l MseI-pr.: 1.5 l dNTPs (25 mM): 0.4 l Taq.pol. (5U): 0.2 l 10XPCR: 5 l MQ: 36.3 l Total: 50 l

[0134] Selective pre amplification was done in a reaction volume of 50 l. The PCR was performed in a PE GeneAmp PCR System 9700 and a 30 cycle profile was started with a 94 C. denaturation step for 30 seconds, followed by an annealing step of 56 C. for 60 seconds and an extension step of 72 C. for 60 seconds.

[0135] (3) KRSEcoRI+1(A) and KRSMseI+2(CA).sup.2

TABLE-US-00013 [SEQID11] 05F212 E01LKRS1 CGTCAGACTGCGTACCAATTCA-3 [SEQID12] 05F213 E01LKRS2 CAAGAGACTGCGTACCAATTCA-3 [SEQID13] 05F214 M15KKRS1 TGGTGATGAGTCCTGAGTAACA-3 [SEQID14] 05F215 M15KKRS2 AGCCGATGAGTCCTGAGTAACA-3

[0136] Selective Nucleotides in Bold and Tags (KRS) Underlined

TABLE-US-00014 Sample PSP11 E01LKRS1/M15KKRS1 Sample PI120234 E01LKRS2/M15KKRS2

[0137] (4) AFLP Protocol

[0138] Selective amplification was done in a reaction volume of 20 l. The PCR was performed in a PE GeneAmp PCR System 9700. A 13 cycle profile was started with a 94 C. denaturation step for 30 seconds, followed by an annealing step of 65 C. for 30 seconds, with a touchdown phase in which the annealing temperature was lowered 0.7 C. in each cycle, and an extension step of 72 C. for 60 seconds. This profile was followed by a 23 cycle profile with a 94 C. denaturation step for 30 seconds, followed by an annealing step of 56 C. for 30 seconds and an extension step of 72 C. for 60 seconds.

TABLE-US-00015 EcoRI+3(AAC)andMseI+3(CAG) [SEQID15] E32 92S02:5-GACTGCGTACCAATTCAAC-3 [SEQID16] M49 92G23:5-GATGAGTCCTGAGTAACAG-3

[0139] (5) Qiagen Column

[0140] The AFLP product was purified by using the QIAquick PCR Purification Kit (QIAGEN) following the QIAquick Spin Handbook 07/2002 page 18 and the concentration was measured with a NanoDrop ND-1000 Spectrophotometer. A total of 5 g of +1/+2 PSP-11 AFLP product and 5 g of +1/+2 PI201234 AFLP product was put together and solved in 23.3 l TE. Finally a mixture with a concentration of 430 ng/l+1/+2 AFLP product was obtained.

[0141] Sequence Library Preparation and High-Throughput Sequencing

[0142] Mixed amplification products from both pepper lines were subjected to high-throughput sequencing using 454 Life Sciences sequencing technology as described by Margulies et al., (Margulies et al., Nature 437, pp. 376-380 and Online Supplements). Specifically, the AFLP PCR products were first end-polished and subsequently ligated to adaptors to facilitate emulsion-PCR amplification and subsequent fragment sequencing as described by Margulies and co-workers. 454 adaptor sequences, emulsion PCR primers, sequence-primers and sequence run conditions were all as described by Margulies and co-workers. The linear order of functional elements in an emulsion-PCR fragment amplified on Sepharose beads in the 454 sequencing process was as follows as exemplified in FIG. 1A:

[0143] 454 PCR adaptor-454 sequence adaptor-4 bp AFLP primer tag 1-AFLP primer sequence 1 including selective nucleotide(s)-AFLP fragment internal sequence-AFLP primer sequence 2 including selective nucleotide(s), 4 bp AFLP primers tag 2-454 sequence adaptor-454 PCR adaptor-Sepharose bead

[0144] Two high-throughput 454 sequence runs were performed by 454 Life Sciences (Branford, Conn.; United States of America).

[0145] 454 Sequence Run Data-Processing.

[0146] Sequence data resulting from one 454 sequence run were processed using a bio-informatics pipeline (Keygene N.V.). Specifically, raw 454 basecalled sequence reads were converted in FASTA format and inspected for the presence of tagged AFLP adaptor sequences using a BLAST algorithm. Upon high-confidence matches to the known tagged AFLP primer sequences, sequences were trimmed, restriction endonuclease sites restored and assigned the appropriate tags (sample 1 EcoRI (ES1), sample 1 MseI (MS1), sample 2 EcoRI (ES2) or sample 2 MseI (MS2), respectively). Next, all trimmed sequences larger than 33 bases were clustered using a megaBLAST procedure based on overall sequence homologies. Next, clusters were assembled into one or more contigs and/or singletons per cluster, using a CAP3 multiple alignment algorithm. Contigs containing more than one sequence were inspected for the sequence mismatches, representing putative polymorphisms. Sequence mismatches were assigned quality scores based on the following criteria:

[0147] the numbers of reads in a contig

[0148] the observed allele distribution

[0149] The above two criteria form the basis for the so called Q score assigned to each putative SNP/indel. Q scores range from 0 to 1; a Q score of 0.3 can only be reached in case both alleles are observed at least twice.

[0150] location in homopolymers of a certain length (adjustable; default setting to avoid polymorphism located in homopolymers of 3 bases or longer).

[0151] number of contigs in cluster.

[0152] distance to nearest neighboring sequence mismatches (adjustable; important for certain types of genotyping assays probing flanking sequences)

[0153] the level of association of observed alleles with sample 1 or sample 2; in case of a consistent, perfect association between the alleles of a putative polymorphism and samples 1 and 2, the polymorphism (SNP) is indicated as an elite putative polymorphism (SNP). An elite polymorphism is thought to have a high probability of being located in a unique or low-copy genome sequence in case two homozygous lines have been used in the discovery process. Conversely, a weak association of a polymorphism with sample origin bears a high risk of having discovered false polymorphisms arising from alignment of non-allelic sequences in a contig.

[0154] Sequences containing SSR motifs were identified using the MISA search tool (MIcroSAtellelite identification tool; available from http://pgrc.ipk-gatersleben.de/misa/ Overall statistics of the run is shown in the Table below.

TABLE-US-00016 TABLE Overall statistics of a 454 sequence run for SNP discovery in pepper. Enzyme combination Run Trimming All reads 254308 Fault 5293 (2%) Correct 249015 (98%) Concatamers 2156 (8.5%) Mixed tags 1120 (0.4%) Correct reads Trimmed one end 240817 (97%) Trimmed both ends 8198 (3%) Number of reads sample 1 136990 (55%) Number of reads sample 2 112025 (45%) Clustering Number of contigs 21918 Reads in contigs 190861 Average number reads per contig 8.7 SNP mining SNPs with Q score 0.3* 1483 Indel with Q score 0.3* 3300 SSR mining Total number of SSR motifs identified 359 Number of reads containing one or more SSR motifs 353 Number of SSR motif with unit size 1 (homopolymer) 0 Number of SSR motif with unit size 2 102 Number of SSR motif with unit size 3 240 Number of SSR motif with unit size 4 17 *SNP/indel mining criteria were as follows: No neighbouring polymorphisms with Q score larger than 0.1 within 12 bases on either side, not present in homopolymers of 3 or more bases. Mining criteria did not take into account consistent association with sample 1 and 2, i.e. the SNPs and indels are not necessarily elite putative SNPs/indels

[0155] An example of a multiple alignment containing an elite putative single nucleotide polymorphism is shown in FIG. 5.

Example 2: Maize

[0156] DNA from the Maize lines B73 and M017 was used to generate AFLP product by use of AFLP Keygene Recognition Site specific primers. (These AFLP primers are essentially the same as conventional AFLP primers, e.g. described in EP 0 534 858, and will generally contain a recognition site region, a constant region and one or more selective nucleotides at the 3-end thereof).

[0157] DNA from the pepper lines B73 or M017 was digested with the restriction endonucleases TaqI (5 U/reaction) for 1 hour at 65 C. and MseI (2 U/reaction) for 1 hour at 37 C. following by inactivation for 10 minutes at 80 C. The obtained restriction fragments were ligated with double-stranded synthetic oligonucleotide adapter, one end of which is compatible with one or both of the ends of the TaqI and/or MseI restriction fragments.

[0158] AFLP preamplification reactions (20 l/reaction) with +1/+1 AFLP primers were performed on 10 times diluted restriction-ligation mixture. PCR profile: 20*(30 s at 94 C.+60 s at 56 C.+120 s at 72 C.). Additional AFLP reactions (50 l/reaction) with different+2 TaqI and MseI AFLP Keygene Recognition Site primers (Table below, tags are in bold, selective nucleotides are underlined.) were performed on 20 times diluted+1/+1 TaqI/MseI AFLP preamplification product. PCR profile: 30*(30 s at 94 C.+60 s at 56 C.+120 s at 72 C.). The AFLP product was purified by using the QIAquick PCR Purification Kit (QIAGEN) following the QIAquick Spin Handbook 07/2002 page 18 and the concentration was measured with a NanoDrop ND-1000 Spectrophotometer. A total of 1.25 g of each different B73+2/+2 AFLP product and 1.25 g of each different M017+2/+2 AFLP product was put together and solved in 30 l TE. Finally a mixture with a concentration of 333 ng/l+2/+2 AFLP product was obtained.

TABLE-US-00017 TABLE PCR AFLP SEQID Primer Primersequence Maize Reaction [SEQID17] 05G360 ACGTGTAGACTGCGTACCG B73 1 AAA [SEQID18] 05G368 ACGTGATGAGTCCTGAGTA B73 1 ACA [SEQID19] 05G362 CGTAGTAGACTGCGTACCG B73 2 AAC [SEQID20] 05G370 CGTAGATGAGTCCTGAGTA B73 2 ACA [SEQID21] 05G364 GTACGTAGACTGCGTACCG B73 3 AAG [SEQID22] 05G372 GTACGATGAGTCCTGAGTA B73 3 ACA [SEQID23] 05G366 TACGGTAGACTGCGTACCG B73 4 AAT [SEQID24] 05G374 TACGGATGAGTCCTGAGTA B73 4 ACA [SEQID25] 05G361 AGTCGTAGACTGCGTACCG M017 5 AAA [SEQID26] 05G369 AGTCGATGAGTCCTGAGTA M017 5 ACA [SEQID27] 05G363 CATGGTAGACTGCGTACCG M017 6 AAC [SEQID28] 05G371 CATGGATGAGTCCTGAGTA M017 6 ACA [SEQID29] 05G365 GAGCGTAGACTGCGTACCG M017 7 AAG [SEQID30] 05G373 GAGCGATGAGTCCTGAGTA M017 7 ACA [SEQID31] 05G367 TGATGTAGACTGCGTACCG M017 8 AAT [SEQID32] 05G375 TGATGATGAGTCCTGAGTA M017 8 ACA

[0159] Finally the 4 P1-samples and the 4 P2-samples were pooled and concentrated. A total amount of 25 l of DNA product and a final concentration of 400 ng/ul (total of 10 g) was obtained. Intermediate quality assessments are given in FIGS. 3A, 3B and 3C.

Sequencing by 454

[0160] Pepper and maize AFLP fragment samples as prepared as described hereinbefore were processed by 454 Life Sciences as described (Margulies et al., 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437 (7057):376-80. Epub Jul. 31, 2005).

Data Processing

Processing Pipeline:

Input Data

[0161] raw sequence data were received for each run: [0162] 200,000-400,000 reads [0163] base calling quality scores

Trimming and Tagging

[0164] These sequence data are analyzed for the presence of Keygene Recognition Sites (KRS) at the beginning and end of the read. These KRS sequences consist of both AFLP-adaptor and sample label sequence and are specific for a certain AFLP primer combination on a certain sample. The KRS sequences are identified by BLAST and trimmed and the restriction sites are restored. Reads are marked with a tag for identification of the KRS origin. Trimmed sequences are selected on length (minimum of 33 nt) to participate in further processing.

Clustering and Assembly

[0165] A MegaBlast analysis is performed on all size-selected, trimmed reads to obtain clusters of homologous sequences. Consecutively all clusters are assembled with CAP3 to result in assembled contigs. From both steps unique sequence reads are identified that do not match any other reads. These reads are marked as singletons.

[0166] The processing pipeline carrying out the steps described herein before is shown in FIG. 4A

Polymorphism Mining and Quality Assessment

[0167] The resulting contigs from the assembly analysis form the basis of polymorphism detection. Each mismatch in the alignment of each cluster is a potential polymorphism. Selection criteria are defined to obtain a quality score: [0168] number of reads per contig [0169] frequency of alleles per sample [0170] occurrence of homopolymer sequence [0171] occurrence of neighbouring polymorphisms

[0172] SNPs and indels with a quality score above the threshold are identified as putative polymorphisms. For SSR mining we use the MISA (MIcroSAtellite identification) tool (http://pgrc.ipk-gatersleben.de/misa). This tool identifies di-, tri-, tetranucleotide and compound SSR motifs with predefined criteria and summarizes occurrences of these SSRs. The polymorphism mining and quality assignment process is shown in FIG. 4B

Results

[0173] The table below summarizes the results of the combined analysis of sequences obtained from 2 454 sequence runs for the combined pepper samples and 2 runs for the combined maize samples.

TABLE-US-00018 Pepper Maize Total number of reads 457178 492145 Number of trimmed reads 399623 411008 Number singletons 105253 313280 Number of contigs 31863 14588 Number of reads in contigs 294370 97728 Total number of sequences containing SSRs 611 202 Number of different SSR-containing sequences 104 65 Number of different SSR motifs (di, tri, tetra and 49 40 compound) Number SNPs with Q score 0.3* 1636 782 Number of indels* 4090 943 *both with selection against neighboring SNPs, at least 12 bp flanking sequence and not occurring in homopolymer sequences larger than 3 nucleotides.

Example 3. SNP Validation by PCR Amplification and Sanger Sequencing

[0174] In order to validate the putative A/G SNP identified in example 1, a sequence tagged site (STS) assay for this SNP was designed using flanking PCR primers. PCR primer sequences were as follows:

TABLE-US-00019 Primer_1.2f: [SEQID33] 5-AAACCCAAACTCCCCCAATC-3, and Primer_1.2r: [SEQID34] 5-AGCGGATAACAATTTCACACAGGACATCAGTAGTCACACTGGTA CAAAAATAGAGCAAAACAGTAGTG-3

[0175] Note that primer 1.2r contained an M13 sequence primer binding site and length stuffer at its 5 prime end. PCR amplification was carried out using +A/+CA AFLP amplification products of PSP11 and P1210234 prepared as described in example 4 as template. PCR conditions were as follows:

[0176] For 1 PCR reaction the following components were mixed:

5 l 1/10 diluted AFLP mixture (app. 10 ng/l)
5 l 1 pmol/l primer 1.2f (diluted directly from a 500 M stock)
5 l 1 pmol/l primer 1.2r (diluted directly from a 500 M stock)

TABLE-US-00020 5 l PCR mix 2 l 10 PCR buffer 1 l 5 mM dNTPs 1.5 l 25 mM MgCl.sub.2 0.5 l H.sub.2O 5 l Enzyme mix 0.5 l 10 PCR buffer (Applied Biosystems) 0.1 l 5 U/l AmpliTaq DNA polymerase (Applied Biosystems) 4.4 l H.sub.2O

[0177] The following PCR profile was used:

TABLE-US-00021 Cycle 1 2; 94 C. Cycle 2-34 20; 94 C. 30; 56 C. 230; 72 C. Cycle 35 7; 72 C. ; 4 C.

[0178] PCR products were cloned into vector pCR2.1 (TA Cloning kit; Invitrogen) using the TA Cloning method and transformed into INVF competent E. coli cells. Transformants were subjected to blue/white screening. Three independent white transformants each for PSP11 and PI-201234 were selected and grown O/N in liquid selective medium for plasmid isolation.

[0179] Plasmids were isolated using the QIAprep Spin Miniprep kit (QIAGEN). Subsequently, the inserts of these plasmids were sequenced according to the protocol below and resolved on the MegaBACE 1000 (Amersham). Obtained sequences were inspected on the presence of the SNP allele. Two independent plasmids containing the PI-201234 insert and 1 plasmid containing the PSP11 insert contained the expected consensus sequence flanking the SNP. Sequence derived from the PSP11 fragment contained the expected A (underlined) allele and sequence derived from PI-201234 fragment contained the expected G allele (double underlined):

TABLE-US-00022 PSP11(sequence1):(5-3) [SEQID35] AAACCCAAACTCCCCCAATCGATTTCAAACCTAGAACAATGTTGGTTTTG GTGCTAACTTCAACCCCACTACTGTTTTGCTCTATTTTTGT PI-201234(sequence1):(5-3) [SEQID36] AAACCCAAACTCCCCCAATCGATTTCAAACCTAGAACAGTGTTGGTTTTG GTGCTAACTTCAACCCCACTACTGTTTTGCTCTATTTTTG PI-201234(sequence2):(5-3) [SEQID37] AAACCCAAACTCCCCCAATCGATTTCAAACCTAGAACAGTGTTGGTTTTG GTGCTAACTTCAACCCCACTACTGTTTTGCTCTATTTTTG

[0180] This result indicates that the putative pepper A/G SNP represents a true genetic polymorphism detectable using the designed STS assay.

REFERENCES

[0181] 1. Zabeau, M. and Vos, P. (1993) Selective restriction fragment amplification; a general method for DNA fingerprinting. EP 0534858-A1, B1, B2; U.S. Pat. No. 6,045,994. [0182] 2. Vos, P., Hogers, R., Bleeker, M., Reijans, M., van de Lee, T., Hornes, M., Frijters, A., Pot, J., Peleman, J., Kuiper, M. et al. (1995) AFLP: a new technique for DNA fingerprinting. Nucl. Acids Res., 21, 4407-4414. [0183] 3. M. van der Meulen, J. Buntjer, M. J. T. van Eijk, P. Vos, and R. van Schaik. (2002). Highly automated AFLP fingerprint analysis on the MegaBACE capillary sequencer. Plant, Animal and Microbial Genome X, San Diego, Calif., January 12-16, P228, pp. 135. [0184] 4. Margulies et al., 2005. Genome sequencing in microfabricated high-density picolitre reactions. Nature advanced online publication 03959, August 1. [0185] 5. R. W. Michelmore, I. Paran, and R. V. Kesseli. (1991). Identification of markers linked to disease-resistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using segregating populations. Proc. Natl. Acad. Sci USA 88(21):9828-32. [0186] 6. Shendure et al., 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Scienceexpress Report, August 4.