High throughput screening of populations carrying naturally occurring mutations

Abstract

Efficient methods are disclosed for the high throughput identification of mutations in genes in members of mutagenized populations. The methods comprise DNA isolation, pooling, amplification, creation of libraries, high throughput sequencing of libraries, preferably by sequencing-by-synthesis technologies, identification of mutations and identification of the member of the population carrying the mutation and identification of the mutation.

Claims

1. A composition comprising a set of primers for use in a method for detecting a mutation in one or more nucleic acid samples, wherein the set of primers comprises a forward primer comprising a tag sequence and a reverse primer comprising a tag sequence, wherein at least one of the primers comprises in a 5′ to 3′ direction: a primer binding site; the tag sequence for nucleic acid and/or sample identification upon sequencing; and a gene-specific sequence.

2. The composition of claim 1, wherein both primers comprise a primer binding site.

3. The composition of claim 1, wherein the tag consists of 2, 3, 4 or 5 nucleotides.

4. The composition of claim 1, wherein the length of the primer binding site is between 10-30 nucleotides.

5. The composition of claim 1, wherein the length of the gene-specific sequence is between 10-30 nucleotides.

6. The composition of claim 1, wherein the primer binding site is a sequence primer binding site.

7. The composition of claim 6, wherein the primer binding site is a sequence primer binding site for high-throughput sequencing.

8. The composition of claim 7, wherein the high-throughput sequencing is sequencing-by-synthesis.

9. The composition of claim 1, wherein each primer binding site is a sequencing primer binding site for high-throughput sequencing, and wherein the high-throughput sequencing is bi-directional sequencing.

10. The composition of claim 1, wherein the primer binding site can anneal to a nucleic acid coupled to a solid support for high-throughput sequencing.

11. The composition of claim 10, wherein the solid support is a bead.

12. The composition of claim 1, wherein the tag sequence is a nucleic acid identifier sequence and/or a sample identifier sequence.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1: Schematic representation of clustered sequences resulting from shotgun sequencing a gene to identify EMS-induced mutations. Mutations are lighter, sequence errors darker colored. Sequence errors are expected to be observed randomly and most often just once.

(2) FIG. 2: Schematic representation of clustered tagged sequencing resulting from a 100 bp gene region amplified with 4 bp-tagged PCR primers from a 3-D pooled library. Mutations are lighter, sequence errors darker colored. Plant IDs are known for mutations identified by 3 tags (1,2,3) and (4,5,6) but not for those identified by less than 2 tag (7,8). Sequence errors are expected to be observed randomly and just once.

(3) FIG. 3: Illustration of the system of long and short PCR primers to use in tagging the sequences.

(4) FIG. 4. Agarose gel estimation of the PCR amplification yield of eIF4E exon 1 amplification for each of the 28 3D pools.

DETAILED DESCRIPTION OF THE INVENTION

(5) In one aspect the invention is directed to a method for the detection of a mutation in a target sequence in a member of a mutagenized population comprising the steps of: (a) Isolating genomic DNA of each member of the mutagenized population to provide for DNA samples of each member in the population; (b) pooling the DNA obtained in step (a); (c) amplifying the target sequence with a pair of (optionally labeled) primers from the DNA pools; (d) pooling the amplification products of step (c) to create a library of amplification products; (e) optionally, fragmenting the amplification products in the library; (f) determining the nucleotide sequence of the products and/or fragments using high throughput sequencing; (g) identifying mutations by clustering (aligning) the sequences of the fragments; (h) screening the identified mutations for a modified function of the target sequence; (i) designing a primer directed to hybridize to the identified mutation; (j) amplifying the library of step (d) with the primer of step (i) and one of the primers of step (c); (k) identifying the member(s) carrying the mutation; (l) optionally, confirming the mutation by amplifying the target sequence from the member(s) of step (k) using the primers of step (c) and determining the sequence of the amplified product.

(6) The isolation of DNA is generally achieved using common methods in the art such as the collection of tissue from a member of the population, DNA extraction (for instance using the Q-Biogene fast DNA kit), quantification and normalization to obtain equal amounts of DNA per sample. As an example, the present invention is illustrated based on a TILLING population of 3072 plants and a gene of 1500 bp.

(7) The pooling of the isolated DNA can for instance be achieved using a 3-dimensional pooling scheme (Vandenbussche et al., 2003, The Plant Cell, 15: 2680-93). The pooling is achieved preferably using equal amounts of DNA. The 3D-pooling scheme may comprise 15×15×14, resulting in 44 pools (15+15+14) containing 3072/14=219 or 3072/15=205 different DNA samples per pool. Other pooling schemes can be used.

(8) The pooling step typically serves to identify the plant containing an observed mutation after one round of PCR screening. Pooling of the DNA further serves to normalize the DNAs prior to PCR amplification to provide for a more equal representation in the libraries for sequencing. The additional advantage of the pooling of the DNA is that not all sequences have to be determined separately, but that the pools allow for rapid identification of the sequences of interest, in particular when tagged libraries are used. This facilitates the screening of large or complex populations in particular.

(9) The amplification of the target sequence with a pair of optionally labeled primers from the pools can be achieved by using a set of primers that have been designed to amplify the gene of interest. As stated, the primers may be labeled to visualize the amplification product of the gene of interest.

(10) The amplification products are pooled, preferably in equal or normalized amounts to thereby create a library of amplification products. Exemplary, the complexity of the library will be 3072 plants×1500 by gene sequence=4.6 Mb sequence.

(11) The amplification products in the library may be randomly fragmented prior to sequencing of the fragments in case the PCR product length exceeds the average length of the sequence traces. Fragmentation can be achieved by physical techniques, i.e., shearing, sonication or other random fragmentation methods. In step (f), at least part, but preferably the entire, nucleotides sequence of at least part of, but preferably of all the fragments contained in the libraries is determined. In certain embodiments, the fragmentation step is optional. For instance, when the read length of the sequencing technique and the PCR fragments length are about the same, there is no need for fragmentation. Also in the case of larger PCR products this may not be necessary if it is acceptable that only part of the PCR product is sequenced for instance in case of 1500 bp PCR product and read length of 400 (from each side) 700 bp remain unsequenced.

(12) The sequencing may in principle be conducted by any means known in the art, such as the dideoxy chain termination method (Sanger sequencing), but this is less preferred given the large number of sequences that have to be determined. It is however preferred and more advantageous that the sequencing is performed using high-throughput sequencing methods, such as the methods disclosed in WO 03/004690, WO 03/054142, WO 2004/069849, WO 2004/070005, WO 2004/070007, and WO 2005/003375 (all in the name of 454 Life Sciences), by Seo et al. (2004) Proc. Natl. Acad. Sci. USA 101:5488-93, and technologies of Helios, Solexa, US Genomics, etcetera, which are herein incorporated by reference. It is most preferred that sequencing is performed using the apparatus and/or method disclosed in WO 03/004690, WO 03/054142, WO 2004/069849, WO 2004/070005, WO 2004/070007, and WO 2005/003375 (all in the name of 454 Life Sciences), which are herein incorporated by reference. The technology described allows sequencing of 40 million bases in a single run and is 100 times faster and cheaper than competing technology. The sequencing technology roughly consists of 5 steps: 1) fragmentation of DNA and ligation of specific adaptor to create a library of single-stranded DNA (ssDNA); 2) annealing of ssDNA to beads, emulsification of the beads in water-in-oil microreactors and performing emulsion PCR to amplify the individual ssDNA molecules on beads; 3) selection of/enrichment for beads containing amplified ssDNA molecules on their surface 4) deposition of DNA carrying beads in a PicoTiterPlate®; and 5) simultaneous sequencing in at least 100,000 wells by generation of a pyrophosphate light signal. The method will be explained in more detail below.

(13) In a preferred embodiment, the sequencing comprises the steps of: (a) annealing adapted fragments to beads, with a single adapted fragment being annealed to each bead; (b) emulsifying the beads in water-in-oil microreactors, each water-in-oil microreactor comprising a single bead; (c) loading the beads in wells, each well comprising a single bead; and generating a pyrophosphate signal.

(14) In the first step (a), sequencing adaptors are ligated to fragments within the library. The sequencing adaptor includes at least a “key” region for annealing to a bead, a sequencing primer region and a PCR primer region. Thus, adapted fragments are obtained.

(15) In a second step, adapted fragments are annealed to beads, each bead annealing with a single adapted fragment. To the pool of adapted fragments, beads are added in excess as to ensure annealing of one single adapted fragment per bead for the majority of the beads (Poisson distribution).

(16) In a next step, the beads are emulsified in water-in-oil microreactors, each water-in-oil microreactor comprising a single bead. PCR reagents are present in the water-in-oil microreactors allowing a PCR reaction to take place within the microreactors. Subsequently, the microreactors are broken, and the beads comprising DNA (DNA positive beads) are enriched.

(17) In a following step, the beads are loaded in wells, each well comprising a single bead. The wells are preferably part of a PicoTiter™ Plate allowing for simultaneous sequencing of a large amount of fragments.

(18) After addition of enzyme-carrying beads, the sequence of the fragments is determined using pyrosequencing. In successive steps, the PicoTiter™ Plate and the beads as well as the enzyme beads therein are subjected to different deoxyribonucleotides in the presence of conventional sequencing reagents, and upon incorporation of a deoxyribonucleotide a light signal is generated which is recorded. Incorporation of the correct nucleotide will generate a pyrosequencing signal which can be detected.

(19) Pyrosequencing itself is known in the art and described in e.g., WO 03/004690, WO 03/054142, WO 2004/069849, WO 2004/070005, WO 2004/070007, and WO 2005/003375 (all in the name of 454 Life Sciences), which are herein incorporated by reference.

(20) The mutations are identified by clustering of the sequenced fragments in the amplified library. Identification of the mutations is achieved by aligning the determined sequences of the fragments of the libraries. The majority of the sequences are wild-type (not mutated) but the induced mutations and occasional sequencing errors are also observed. As the amplification libraries are sequenced with multifold redundancy (typically about 4- to 5-fold redundant), multiple observations of the same sequence change is indicative of a mutation rather than a sequencing error. See FIG. 1.

(21) The clustering provides alignments of the fragments in the amplified library. In this way for each PCR product in the library, a cluster is generated from sequenced fragments, i.e., a contig of the fragments, is build up from the alignment of the sequence of the various fragments obtained from the fragmenting in step (e).

(22) Methods of alignment of sequences for comparison purposes are well known in the art. Various programs and alignment algorithms are described in: Smith and Waterman (1981) Adv. Appl. Math. 2:482; Needleman and Wunsch (1970) J. Mol. Biol. 48:443; Pearson and Lipman (1988) Proc. Natl. Acad. Sci. USA 85:2444; Higgins and Sharp (1988) Gene 73:237-244; Higgins and Sharp (1989) CABIOS 5:151-153; Corpet et al. (1988) Nucl. Acids Res. 16:10881-90; Huang et al. (1992) Computer Appl. in the Biosci. 8:155-65; and Pearson et al. (1994) Meth. Mol. Biol. 24:307-31, which are herein incorporated by reference. Altschul et al. (1994) Nature Genet. 6:119-29 (herein incorporated by reference) present a detailed consideration of sequence alignment methods and homology calculations.

(23) The NCBI Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990) is available from several sources, including the National Center for Biological Information (NCBI, Bethesda, Md.) and on the Internet, for use in connection with the sequence analysis programs blastp, blastn, blastx, tblastn and tblastx.

(24) In the analysis of mutagenized populations, after the mutations have been identified, the identified mutations are assessed for a modified function of the associated gene, for instance the introduction of a stop codon. This assessment is performed on the sequence itself, for example by six-frame translation. Once the interesting mutations have been identified, the mutations are further investigated to identify the associated member of the population.

(25) For each mutation that has been classified as an interesting mutation, an allele specific primer is designed that targets the mutation of interest. Thus, the allele specific primer is then used in combination with one of the primers used in the amplification of the pooled DNA samples (either the reverse or the forward primer). One or both of the primers may be labeled. The set of primers is used to amplify the pools of DNA. The positive pools are identified and the mutant plant is identified. In the above-mentioned 3D pooling scheme, the allele specific PCR with the set of primers to screen the 3D pooled DNA sample plates results in the identification of 3 positive pools (one in each dimension), which specifies the library address of the mutant plant.

(26) In certain embodiments, the allele-specific primers comprise alternative nucleotides such as Locked Nucleic Acids (LNA) or Peptide Nucleic Acids (PNA) to increase their specificity. Such nucleic acids are widely known in the art and are commercially available from a choice of suppliers.

(27) Confirmation of the mutation is achieved by amplification of the target sequence from the identified mutant plant. This amplification is performed with the primers from step (c). The nucleotide sequence of the amplified product is determined and by comparison with the consensus sequence, the mutation is identified. The sequencing is preferably performed Sanger sequencing.

(28) In one aspect the invention pertains to a method for the detection of a mutation in a target sequence in a member of a mutagenized population comprising the steps of: (a) isolating genomic DNA of each member of the mutagenized population to provide DNA samples of each member in the population; (b) pooling the DNA obtained in step (a); (c) amplifying a part or segment of the target sequence with a pair of tagged (optionally labeled) primers from the DNA pools, preferably wherein at least one of the primers comprise a gene-specific section, a tag and a sequence primer binding site; (d) pooling the amplification products of step (c) to create a library of amplification products; (d) determining the nucleotide sequence of the amplification products using high throughput sequencing; (f) identifying mutations by clustering (aligning) the sequences of the fragments; (g) identifying the member(s) having the mutation using the tags; (h) optionally, confirming the mutation by amplifying the target sequence from the member(s) of step (g) using the primers of step (c) and determining the sequence of the amplified product.

(29) The isolation of genomic DNA of the members of the mutagenized population and the pooling of the isolated DNA can be carried out essentially as described above.

(30) A part or segment of the target sequence is amplified using a pair of tagged primers that may be labeled. Preferably, for each pool of each dimension, a different primer is used. In the above illustration this means that 44 forward and 44 reverse primers are preferred. Preferably, each of the forward and reverse primers comprises (i) a sequence primer binding site that can be used in the following sequencing step, (ii) a tag that serves to link the primer (and the resulting amplification product) to the original member of the population, and (iii) a gene specific sequence that is capable of annealing to the target sequence of interest (i.e., the gene).

(31) In a typical embodiment the primer has the following order:

(32) 5′-Sequence primer binding site—Tag—Gene specific PCR primer sequence-3′

(33) The length of the sequence primer binding site and the gene specific PCR primer sequence are those that are conventional in common PCR use, i.e., independently from about 10 to about 30 bp with a preference for from 15 to 25 bp. Preferably the part or segment of the sequence that is amplified corresponds to a length that can be sequenced in one run using the high throughput sequencing technologies described below. In certain embodiments the part or segment has a length of between about 50 bp to about 500 bp, preferably from about 75 bp to about 300 bp and more preferably between about 90 bp and about 250 bp. As stated above, this length may vary with the sequencing technology employed including those yet to be developed.

(34) By using primers (forward and/or reverse) containing a tag sequence that is unique for each of the primers representing all pool dimensions, the specific plant origin of each tag sequence is known as the sequence primer anneals upstream of the tag and as a consequence, the tag sequence is present in each amplification product. In certain embodiments, both forward and reverse primers are tagged. In other embodiments, only on of the forward or reverse primers is tagged. The choice between one or two tags depends on the circumstances and depends on the read length of the high throughput sequencing reaction and/or the necessity of independent validation. In the case of, e.g., a 100 bp PCR product that is sequenced unidirectionally, only one tag is needed. In the case of a 200 bp PCR product and a 100 bp read-length, double tagging is useful in combination with bi-directional sequencing as it improves efficiency 2-fold. It further provides the possibility of independent validation in the same step. When a 100 bp PCR product is sequenced bi-directionally with two tagged primers, all traces, regardless of orientation, will provide information about the mutation. Hence both primers provide “address information” about which plant contains which mutation.

(35) The tag can be any number of nucleotides, but preferably contains 2, 3, 4 or 5 nucleotides. With 4 nucleotides permuted, 256 tags are possible, whereas 3 nucleotides permuted provide 64 different tags. In the illustration used, the tags preferably differ by >1 base, so preferred tags are 4 by in length. Amplification using these primers results in a library of tagged amplification products.

(36) In certain embodiments, a system of tags can be used wherein the amplification process includes (1) a long PCR primer comprising (a) a 5′-constant section linked to (b) a degenerate tag section (NNNN) linked to (c) a gene specific section-3′ and (2) a short PCR primer in subsequent amplifications that consists of (a) the 5′-contact section linked to (b) non-degenerate tag section-3′ (i.e., a selection amongst NNNN).
The non-degenerate tag section can be unique for each sample, for example, ACTG for sample 1, AATC for sample 2, etc. The short primer anneals to a subset of the long primer. The constant section of the primer can be used as a sequence primer. See FIG. 3.

(37) The library preferably comprises equal, amounts of PCR products from all amplified pools. In the illustrative example, the library contains 3072 plants×100 bp=307 kb sequence to be determined.

(38) The PCR products in the library are subjected to a sequencing process as disclosed above. In particular, the PCR products are attached to beads using the sequence primer binding site that corresponds to the sequence linked to the bead. Thus the present embodiment does not require fragmentation and adapter ligation. Rather, in this embodiment, the adapters have been introduced earlier via the PCR primer design. This improves the reliability of the method. Following the annealing to the beads, sequencing is performed as described above, i.e., (1) emulsification of the beads in water-in-oil microreactors, (2) emulsion PCR to amplify the individual ssDNA molecules on beads; (3) selection of/enrichment for beads containing amplified ssDNA molecules on their surface, (4) transfer of the DNA carrying beads to a PicoTiterPlate®; and (5) simultaneous sequencing in 100,000 wells by a method that generates a pyrophosphate light signal. Typical output is about 200.000×100-200 by sequences, representing a 66 fold coverage of all PCR products in the library.

(39) Clustering and alignment is performed essentially as described above. The individual plant containing the mutation can be identified using the tags. In the examples, the combination of the 3 tags denotes the positive pools and the consequently the coordinates of the individual plant in the pools.

(40) Confirmation of the mutation by re-sequencing of the PCR product of the identified mutant sample is as described above.

(41) Various pooling strategies can be used with the present invention, examples of which are multidimensional pooling (including 3D pooling) or column-, row- or plate pooling.

(42) High throughput sequencing methods that can be used here are described, for example, in Shendure et al., Science 309:1728-32. Examples include microelectrophoretic sequencing, hybridization sequencing/sequencing by hybridization (SBH), cyclic-array sequencing on amplified molecules, cyclic-array sequencing on single molecules, non-cyclical, single-molecule, real-time methods, such as, polymerase sequencing, exonuclease sequencing, or nanopore sequencing.

(43) For optimal results, fragments or amplified products should be sequenced with sufficient redundancy. Redundancy permits distinction between a sequencing error and a genuine possible mutation. In certain embodiments, the redundancy of the sequencing is preferable at least 4, more preferably at least 5, but, as can be seen from the Examples, redundancies of more than 10, preferably more than 25 or even more than 50 are considered advantageous, although not essential for this invention.

(44) Advantages of the methods of the present invention reside inter alia in the fact that mutations can be assessed in silico for their impact on gene function, meaning that a selection is made for the active mutations. Mutations conferring only silent substitutions can be selected against, thereby making the overall process more economical and efficient. This is a particular advantage with regard to the known CEL I based TILLING technology because the majority of CEL I mutations are C/G to T/A transitions, of which only 5% commonly create stop codons (Colbert et al. 2001). The vast majority are missense mutations of reduced interest. Efficient recognition of members in a population with stop codon mutations economizes the process and obviates the need for additional screening of individual members of positive pools.

(45) All mutations can be found with equal probability, irrespective of their position in the PCR product, in particular when the whole target sequence is screened.

(46) The method further avoids the use of CEL I digestion, heteroduplex formation and cumbersome gel scoring. The invention is therefore insensitive to pooling limitations associated with CEL I technology.

(47) The invention further relates to kits that may contain one or more compounds selected form the group consisting of: one or more (labeled) primers for a particular gene or trait, mutation- or allele-specific primers. The kits may further contain beads, sequencing primers, software, descriptions for pooling strategies and other components that are known for kits per se. In certain embodiments, kits are provided that are dedicated to find specific mutations, for instance disease-related mutations.

(48) The invention is now illustrated here in below.

EXAMPLES

(49) Screening a TILLING population can be advanced by using novel high-throughput sequencing methods, such as that of 454 Life Sciences (Margulies et al., 2005) or Polony Sequencing (Shendure et al., 2005). With the current state-of-the-art, 454 Life Sciences technology produces approximately 20 Mb sequence in a single sequencing run. Read lengths are approximately 100 bp per read. Assuming the screening of a population consisting of 3072 plants for mutations in a 1500 bp gene (as described in the above-cited reference in Chapter 2), two approaches are envisaged and described in more detail below. (1) an approach where the entire 1500 bp gene is investigated for the presence of EMS induced mutations; and (1) an approach where one or several 100 bp stretches are investigated for the presence of EMS-induced mutations.

Example I

Screening the Entire 1500 by Region

(50) Genomic DNA of 3072 plants of the TILLING population is isolated. A 3-D pooling scheme of equal amounts of DNA per plant is set up (e.g., 15×15×14), resulting in 44 pools (15+15+14=44) containing 3072/14=219 or 3072/15=205 different DNA samples (Vandenbussche et al., supra).

(51) This pooling step serves to permit identification of a plant containing an observed mutation after one round of PCR screening (step 8). Pooling of genomic DNAs further serves to normalize DNAs prior to PCR amplification to increase the probability that all DNAs are represented equally in the sequence library.

(52) The 1500 bp gene is amplified from the pooled DNA samples using 1 pair of unlabelled PCR primers.

(53) Equal amounts of PCR products from all pools wells are pooled to create a pooled PCR products library (complexity 3072 plants×1500 bp=4.6 Mb sequence).

(54) The pooled PCR product library is subjected to shotgun sequencing using conventional technologies (such as those provided by 454 Life Sciences) wherein PCR products are randomly fragmented, amplified on individual beads and sequenced on the bead. Output is approximately 200,000 100 bp sequences, representing 4- to 5-fold coverage of all PCR products in the library).

(55) All sequences are clustered. The majority of sequences are wild-type but EMS-induced mutations (and sequence errors) are observed as well. Since PCR products are sequenced with 4-5 fold redundancy, multiple observations of the same sequence change is indicative of a mutation rather than a sequencing error (FIG. 1).

(56) Mutations are assessed for their impact on gene function such as introduction of a stop-codon.

(57) An allele-specific primer targeting a mutation of interest (with 3′ Locked Nucleic Acid; LNA; or Peptide Nucleic Acid; PNA) is designed to be used in combination with either the forward or reverse primer used in step 3 to screen the 3-D pooled DNA sample plate. Allele-specific PCR will result in three positive pools (one of each dimension), which specifies the library address of the mutant plant.

(58) The mutation is confirmed by amplifying the 1500 bp gene using the primers of step 3, followed by (bi-directional) Sanger sequencing.

Example II

Screening 100 bp Stretches

(59) (100 by is the read length of one 454 sequence run)

(60) Genomic DNA of 3072 plants of the TILLING population is isolated. A 3-D pooling scheme of equal amounts of DNA per plant is set up (e.g., 15×15×14), resulting in 44 pools (15+15+14=44) containing 3072/14=219 or 3072/15=205 different DNA samples (Vandenbussche et al., supra).

(61) This pooling step serves to permit identification of the plant containing an observed mutation directly from the sequence data. Pooling of genomic DNAs further serves to normalize DNAs prior to PCR amplification to increase the probability that all DNAs are represented equally in the sequence library.

(62) A 100 bp (or 200 bp) region of the gene is amplified from a the pools by PCR using tagged unlabelled PCR primers. This requires 44 forward and 44 reverse primers (one for each pool of each dimension) with the following configuration:

(63) 5′-Sequence primer binding site—4 bp Tag—gene specific primer sequence-3′

(64) By using tailed forward and reverse primers containing a 4 bp sequence tag that is different for each of the 44 primers representing all pool dimensions, the specific plant origin of each sequence is known as the sequence primer anneals upstream of the tag. Hence the tag sequence in present in each sequence trace. A 4 bp tag allows 4.sup.4=256 different tags. A 3 bp tag allows 64 different tag sequences—sufficient to distinguish 44 tags—but tag sequences differing by more than 1 base are preferred.

(65) Equal amounts of PCR products from all pools wells are pooled to create a pooled PCR products library (complexity 3072 plants×100 bp=307 kb sequence).

(66) The pooled PCR product library is provided to 454 for sequencing, i.e., PCR products are amplified and sequenced on the beads. Output is approximately 200,000 100 bp sequences, representing 66-fold coverage of all PCR products in the library.

(67) All sequences (from either direction) are clustered; the majority of sequences are wild-type sequences but EMS-induced mutations (and sequence errors) are observed as well. Since PCR products are sequenced with 66 fold redundancy, multiple observations of the same sequence change are indicative of a mutation rather than a sequencing error (FIG. 1).

(68) The coordinates of the individual plant containing the mutation will be lmown immediately based on the unique combination of 3 tags sequences that occur in the sequence traces harboring the mutation (FIG. 2).

(69) The mutation is confirmed by amplifying the 1500 bp gene using the primers of step 3, followed by (bi-directional) Sanger sequencing.

Example III. Identifying Specific Mutations in a Mutant Library of Tomato

(70) Mutant Library of Tomato

(71) This example describes the screening of a mutant library of tomato by massive parallel sequencing in order to identify point mutations in a specific locus (target gene). The mutant library used is an isogenic library of inbred determinate tomato cultivar M82 consisting of 5075 M2 families derived from EMS mutagenesis treatments. Seeds of each of the 5075 M2 families were stored at 10% RH and 7° C. The origin and characteristics of the library are described in Menda et al. (Plant J. 38: 861-872, 2004).

(72) DNA Isolation

(73) Leaf material was harvested from 5 individual greenhouse-grown plants of each of 3072 M2 families randomly chosen from the library. As any mutation occurring in the library will segregate in a Mendelian fashion in the M2 offspring, the pooling of the leaf material of 5 individual M2 plants reduced the likelihood of overlooking any mutation as a consequence of segregation to less than 0.1%. Genomic DNA was isolated from the pooled leaf material using a modified CTAB procedure described by Stuart and Via (Biotechniques, 14: 748-750, 1993). DNA samples were diluted to a concentration of 100 ng/μl in TE (10 mM Tris-HCl pH 8.0, 1 mM EDTA) and stored at −20° C. in 96-well microtitre plates.

(74) 3D Pooling of the DNA Samples

(75) The isolated DNA samples were normalized to a concentration of 20 ng/μl and subsequently pooled 4-fold resulting in 768 samples comprised in eight 96-well microtitre plates. Subsequently, these eight microtitre plates were subjected to a 3D pooling strategy, resulting in 28 pools of DNA. The 3D pooling strategy consisted of pooling together all DNAs in three different manners, thus ensuring that each single 4-fold pool occurs only once in an X-coordinate pool, only once in a Y-coordinate pool and only once in a Z-coordinate pool. X-pools were assembled by pooling all DNA samples together per column of eight wells (e.g. AH-11) from all eight microtitre plates, resulting in 12 X-pools. Each X-pool therefore held 8 (wells in a column)×8 (plates)=64 samples of 4-fold pools, representing 256 M2 families. Y-pools were assembled by pooling all DNA samples together per row of twelve wells (e.g. A1-A12) from all eight microtitre plates, resulting in 8 Y-pools. Each Y-pool therefore held 12 (wells in a row)×8 (plates)=96 samples of 4-fold pools, representing 384 M2 families. Z-pools were assembled by pooling all DNA samples together from an entire microtitre plate, resulting in 8 Z-pools. Each Z-pool therefore held 12×8=96 samples of 4-fold pools, representing 384 M2 families.

(76) Target Locus

(77) The target locus in this example was part of the tomato gene for eucaryotic initiation factor 4E (eIF4E). This gene has been shown to be involved in susceptibility to infection of potyviruses in Arabidopsis (Duprat et al., Plant J. 32: 927-934, 2002), lettuce (Nicaise et al. Plant Physiol. 132: 1272-1282, 2003) and Solanaceae (Ruffel et al., Plant J. 32: 1067-1075, 2002; Mol. Gen. Genomics 274: 346-353, 2005), and specific mutations in this gene are associated with recessive potyvirus resistance. The mutation screening described in this example was aimed to identify additional mutations in the tomato eIF4E gene as possible sources of new potyvirus resistance. For the tomato eIF4E, only the cDNA sequence was known (NCBI accession numbers AY723733 and AY723734). Using a PCR approach using primers designed on the basis of the cDNA sequence, fragments of the genomic sequence of the eIF4E locus of tomato cultivar Moneyberg were amplified and sequenced. This resulted in a sequence of most of the genomic locus of tomato eIF4E. The locus consists of 4 exons and 3 introns. For the mutation screening, exon 1 of the gene was chosen as the target sequence (SEQ ID NO: 57).

(78) TABLE-US-00001 SEQ ID NO: 57: Sequence of exon 1 of tomato Moneyberg eIF4E: ATGGCAGCAGCTGAAATGGAGAGAACGATGTCGTTTGATGCAGCTGAGAA GTTGAAGGCCGCCGATGGAGGAGGAGGAGAGGTAGACGATGAACTTGAAG AAGGTGAAATTGTTGAAGAATCAAATGATACGGCATCGTATTTAGGGAAA GAAATCACAGTGAAGCATCCATTGGAGCATTCATGGACTTTTTGGTTTGA TAACCCTACCACTAAATCTCGACAAACTGCTTGGGGAAGCTCACTTCGAA ATGTCTACACTTTCTCCACTGTTGAAAATTTTTGGGG
Primer Design for Target Locus Amplification

(79) Primers were designed for the PCR amplification of exon 1 of tomato eIF4E. The forward primers were designed to correspond to the ATG start codon of the Open Reading Frame of exon 1, with 5′ of the ATG a tag sequence of four bases, providing a unique identifier for each of the 28 pools. At the far 5′ end of the forward PCR primers, a 5′-C was added. All primers were phosphorylated at their 5′ end to facilitate subsequent ligation of adaptors. The sequence and names of the 28 forward primers are listed in Table 1. The tag sequences are underlined.

(80) TABLE-US-00002 TABLE 1 Forward primers, sequences and pool identification for exon 1 amplification. 3D SEQ ID name sequence pool NO: 06I009 CACACATGGCAGCAGCTGAAATGG X1 SEQ ID NO: 1 06I010 CACAGATGGCAGCAGCTGAAATGG X2 SEQ ID NO: 2 06I011 CACGAATGGCAGCAGCTGAAATGG X3 SEQ ID NO: 3 06I012 CACGTATGGCAGCAGCTGAAATGG X4 SEQ ID NO: 4 06I013 CACTCATGGCAGCAGCTGAAATGG X5 SEQ ID NO: 5 06I014 CACTGATGGCAGCAGCTGAAATGG X6 SEQ ID NO: 6 06I015 CAGACATGGCAGCAGCTGAAATGG X7 SEQ ID NO: 7 06I016 CAGAGATGGCAGCAOCTGAAATGG X8 SEQ ID NO: 8 06I017 CAGCAATGGCAGCAGCTGAAATGG X9 SEQ ID NO: 9 06I018 CAGCTATGGCAGCAGCTGAAATGG X10 SEQ ID NO: 10 06I019 CAGTCATGGCAGCAGCTGAAATGG X11 SEQ ID NO: 11 06I020 CAGTGATGGCAGCAGCTGAAATGG X12 SEQ ID NO: 12 06I021 CATCGATGOCAGCAGCTGAAATGG Y1 SEQ ID NO: 13 06I022 CATGCATGGCAGCAGCTGAAATGG Y2 SEQ ID NO: 14 06I023 CTACGATGGCAGCAGCTGAAATGG Y3 SEQ ID NO: 15 06I024 CTAGCATGGCAGCAGCTGAAATGG Y4 SEQ ID NO: 16 06I025 CTCACATGGCAGCAGCTGAAATGG Y5 SEQ ID NO: 17 06I026 CTCAGATGGCAGCAGCTGAAATGG Y6 SEQ ID NO: 18 06I027 CTCGAATGGCAGCAGCTGAAATGG Y7 SEQ ID NO: 19 06I028 CTCGTATGGCAGCAGCTGAAATGG Y8 SEQ ID NO: 20 06I029 CTCTCATGGCAGCAGCTGAAATGG Z1 SEQ ID NO: 21 06I030 CTCTGATGGCAGCAGCTGAAATGG Z2 SEQ ID NO: 22 06I031 CTGACATGGCAGCAGCTGAAATGG Z3 SEQ ID NO: 23 06I032 CTGAGATGGCAGCAGCTGAAATGG Z4 SEQ ID NO: 24 06I033 CTGCAATGGCAGCAGCTGAAATGG Z5 SEQ ID NO: 25 06I034 CTGCTATGGCAGCAGCTGAAATGG Z6 SEQ ID NO: 26 06I035 CTGTCATGGCAGCAGCTGAAATGG Z7 SEQ ID NO: 27 06I036 CTGTGATGGCAGCAGCTGAAATGG Z8 SEQ ID NO: 28

(81) The reverse primers were designed to correspond to basepair position 267 to 287 of exon 1 in the non-coding strand. Again, 5′ of the priming part the same series of tag sequences of four bases were included, providing a identifier for each of the 28 pools. At the far 5′ end of the reverse PCR primers, a 5′-C was added. All primers were phosphorylated at their 5′ end to facilitate subsequent ligation of adaptors. The sequence and names of the 28 reverse primers are listed in Table 2. The tags are underlined.

(82) TABLE-US-00003 TABLE 2 Reverse primers sequences and pool identification for exon 1 amplification. 3D SEQ ID name sequence pool NO: 06I037 CACACCCCCAAAAATTTTCAACAGTG X1 SEQ ID NO: 29 06I038 CACAGCCCCAAAAATTTTCAACAGTG X2 SEQ ID NO: 30 06I039 CACGACCCCAAAAATTTTCAACAGTG X3 SEQ ID NO: 31 06I040 CACGTCCCCAAAAATTTTCAACAGTG X4 SEQ ID NO: 32 06I041 CACTCCCCCAAAAATTTTCAACAGTG X5 SEQ ID NO: 33 06I042 CACTGCCCCAAAAATTTTCAACAGTG X6 SEQ ID NO: 34 06I043 CAGACCCCCAAAAATTTTCAACAGTG X7 SEQ ID NO: 35 06I044 CAGAGCCCCAAAAATTTTCAACAGTG X8 SEQ ID NO: 36 06I045 CAGCACCCCAAAAATTTTCAACAGTG X9 SEQ ID NO: 37 06I046 CAGCTCCCCAAAAATTTTCAACAGTG X10 SEQ ID NO: 38 06I047 CAGTCCCCCAAAAATTTTCAACAGTG X11 SEQ ID NO: 39 06I048 CAGTGCCCCAAAAATTTTCAACAGTG X12 SEQ ID NO: 40 06I049 CATCGCCCCAAAAATTTTCAACAGTG Y1 SEQ ID NO: 41 06I050 CATGCCCCCAAAAATTTTCAACAGTG Y2 SEQ ID NO: 42 06I051 CTACGCCCCAAAAATTTTCAACAGTG Y3 SEQ ID NO: 43 06I052 CTAGCCCOCAAAAATTTTCAACAGTG Y4 SEQ ID NO: 44 06I053 CTCACCCCCAAAAATTTTCAACAGTG Y5 SEQ ID NO: 45 06I054 CTCAGCCCCAAAAATTTTCAACAGTG Y6 SEQ ID NO: 46 06I055 CTCGACCCCAAAAATTTTCAACAGTG Y7 SEQ ID NO: 47 06I056 CTCGTCCCCAAAAATTTTCAACAGTG Y8 SEQ ID NO: 48 06I057 CTCTCCCCCAAAAATTTTCAACAGTG Z1 SEQ ID NO: 49 06I058 CTCTGCCCCAAAAATTTTCAACAGTG Z2 SEQ ID NO: 50 06I059 CTGACCCCCAAAAATTTTCAACAGTG Z3 SEQ ID NO: 51 06I060 CTGAGCCCCAAAAATTTTCAACAGTG Z4 SEQ ID NO: 52 06I061 CTGCACCCCAAAAATTTTCAACAGTG Z5 SEQ ID NO: 53 06I062 CTGCTCCCCAAAAATTTTCAACAGTG Z6 SEQ ID NO: 54 06I063 CTGTCCCCCAAAAATTTTCAACAGTG Z7 SEQ ID NO: 55 06I064 CTGTGCCCCAAAAATTTTCAACAGTG Z8 SEQ ID NO: 56
Target Locus Amplification

(83) The exon 1 of the target locus was amplified from the 3D pooled DNAs using the forward and reverse primers described above. For each PCR reaction, a forward and a reverse primer were used with identical tags. For the amplification of exon 1 from each of the 28 3D pools, a different set of forward and reverse primers was used.

(84) The PCR amplification reaction conditions for each sample were as follows:

(85) 25 μl DNA (=50 ng); 5 μl RNase-mix; 10 μl 5× Herculase PCR-buffer; 0.6 μl of the four dNTPs (20 mM); 1.25 μl forward primer (50 ng/μl); 1.25 μl reverse primer (50 ng/μl); 0.5 μl Herculase DNA polymerase; 28.9 μl milliQ-purified water. The RNase-mix consisted of 157.5 milliQ-purified water+17.5 μl RNase.

(86) PCR amplifications were performed in a PE9600 thermocycler with a gold or silver block using the following conditions: 2 minutes hot-start of 94° C., followed by 35 cycles of 30 sec at 94° C., 30 sec at 53° C., 1 min at 72° C., and a final stationary temperature of 4° C. The PCR amplification efficiency was checked by analysis of 10 μl of PCR products on a 1% agarose gel. FIG. 4 shows the efficient amplification of exon 1 PCR products from each of the 28 3D pools in comparison to a concentration range of lambda DNA on the same gel.

(87) Following amplification, equal amounts of PCR products were mixed and purified using the QIAquick PCR Purification Kit (QIAGEN), according to the QIAquick® Spin handbook (page 18). On each column a maximum of 100 μl of product was loaded. Products were eluted in 10 mM Tris-EDTA.

(88) Sequence Library Preparation and High-Throughput Sequencing

(89) Mixed amplification products from the 3D pools were subjected to high-throughput sequencing on a GS20 sequencer using 454 Life Sciences sequencing technology as described by Margulies et al. (Nature 437: 376-380, 2005, and Online Supplements). Specifically, the PCR products were ligated to adaptors to facilitate emulsion-PCR amplification and subsequent fragment sequencing as described by Margulies et al. The 454 adaptor sequences, emulsion PCR primers, sequence primers and sequence run conditions were all as described by Margulies et al. The linear order of functional elements in an emulsion-PCR fragment amplified on Sepharose beads in the 454 sequencing process was as follows:

(90) 454 PCR adaptor—454 sequence adaptor—C-nucleotide—4 by tag—target amplification primer sequence 1—target fragment internal—sequence target amplification primer sequence 2-4 by tag—G-nucleotide—454 sequence adaptor—454 PCR adaptor—Sepharose bead.

(91) 454 Sequence Run Data-Processing.

(92) After base calling with 454 software for each region of the microtiter plate a file with FASTA formatted sequences was produced. These were concatenated into one file. Within this file a search was conducted with a regular expression to a 100% match of the forward primer preceded with 5 nucleotides (C plus four by tag sequence). The same was done with the reverse primer extended with 5 nucleotides (C plus tag sequence). All sequences were then grouped by their tag sequence (pool identifiers) in separate files. Each file was analysed with the ssahaSNP tool and the known exon 1 nucleotide sequence as a reference. The ssahaSNP tool reported about all single nucleotide sequence differences and “indels” (single base insertions or deletions as a result of either mutagenesis or erroneous base-calling) of the 454 sequences versus the reference genome. These single nucleotide sequence difference and indel statistics were saved in a database and used for error rate analysis and point mutation identification.

(93) 454 Sequencing Error Rate

(94) The total number of correct sequences obtained from the data processing for all 28 pools combined was 247,052. The sequences were divided in two groups, those that aligned with the forward primer and coding strand (5′ end) of the exon 1 PCR product (128,594=52%), and those that aligned with the reverse primer and the complementary strand of the PCR product (118,458=48%). The number of sequences obtained from each of the different pools and alignment groups ranged from 69 to 7269. On average, each of the 3072 M2 families should be represented 80 times in the total collection of sequences, and each allele 40 times.

(95) Within the alignment group corresponding to the forward primer, 1338 sequences out of 128,594 (1.2%) showed one or more single nucleotide sequence differences in relation to the eIF4E reference sequence along a stretch of 63 bases of aligned target sequence. For the reverse primer group, 743 sequences out of 118,458 (0.6%) showed one or more single nucleotide sequence differences in relation to the eIF4E reference sequence along a stretch of 102 bases of aligned target sequence. Therefore, the single base substitution error rate for both sequence groups combined equals 0.84% for a 165 base stretch, or 0.0051% per base position (0.5 errors per 10,000 bases). This error rate is similar to the one reported by Margulies et al. of 0.004% for individual read substitution errors in test sequences, but much lower than for whole-genome resequencing (0.68%).

(96) A similar analysis of the occurrence of indels in both alignment groups revealed an indel incidence of 3883 (forward primer group) and 3829 (reverse primer group) in a total of 247,052 sequences (is 3.1% in a 165 by stretch). The indel occurrence rate therefore equals 0.01891% per base position (1.89 indels per 10,000 bases). The indel rate is significant higher than the base substitution error rate. Both types of sequencing error combined occur on average at a frequency of 2.39 per 10,000 bases, or 0.024 per base position. This error rate is much lower than reported by Margulies et al., and may be explained by the absence of long homopolymer stretches in the eIF4e exon 1 sequence.

(97) Detection of a Mutation in the Target Locus

(98) Because the objective of this screen is the identification of (EMS)-induced point mutations (preferentially C.fwdarw.T and G.fwdarw.A mutations), all sequences representing indels in comparison to the reference sequence were discarded for the sake of the analysis in this example. Most of the single base substitutions occurred only once in any given 3D pool, some occurred 2 or 3 times, or rarely more often. Since these single base substitutions occur more or less uniformly at every position of the aligned sequence, and at a more or less uniform frequency of 0.005% per base, they were assumed to represent sequencing errors, and not specific mutations that exist in the mutant library. However, at a few specific base positions in the scanned sequence, a much higher incidence of a specific single base sequence difference occurs. Such single base sequence differences reveal mutations in the library, when the following criteria are fulfilled: 1. the single base sequence difference represents an C.fwdarw.T or G.fwdarw.A mutation; 2. the incidence is higher than 20 per 10,000 sequence reads per 3D pool; 3. the single base sequence difference occurs in precisely one and not more than one X-pool, Y-pool and Z-pool.

(99) In this example, one such mutation was found in the alignment group corresponding to the reverse primer, at base position 221 of the eIF4E exon 1 sequence. This mutation, a G.fwdarw.A mutation (corresponding to C.fwdarw.T in the complementary strand) occurred in pool X12 at a frequency of 70 per 10,000 sequences, in pool Y3 at a frequency of 33 per 10,000 and in pool Z6 at 62 per 10,000 sequences. This same mutation at the same position did not occur in any of the other pools, not even at background error rates.

(100) The unique occurrence of this G221A mutation in only the three pools allowed the identification of the original 4-fold pool of DNA, representing four M2 families. DNA of each of these four M2 families was amplified individually with the primers 06F598 and 06F599 that are identical to the forward and reverse primers of Tables 1 and 2, but without the 5′ five base sequence tags. The amplified PCR products were subjected to conventional Sanger sequencing. The sequence of the eIF4E gene in one of the four families (coded “24”) revealed a dual peak at position 221, corresponding to an overlapping G and A. This is indicative of an M2 family pool, in which half the alleles are wild-type, and the other half carry the G221A point mutation (FIG. 2). The sequences of the other M2 families around base position 221 were according to the reference (wild-type).

(101) The mutation causes an arginine to glutamine substitution. Seeds of this particular M2 family were planted in the greenhouse in order to select for homozygous mutant individuals, that will be used for phenotyping.

(102) In a similar manner, two other point mutations were identified in the 454 sequence reads. An estimation of the mutation density of the M82 tomato mutant library therefore equals 3 mutations per 165 by scanned sequence, or 18 mutations per 1000 bases in 3072 M2 families. This corresponds to mutation densities reported for Arabidopsis (Greene et al., Genetics 164: 731-740, 2003).

REFERENCES

(103) Colbert et al. 2001. High-throughput screening for induced point mutations. Plant Physiology 126: 480-484. Duprat et al., 2002. The Arabidopsis eukaryotic initiation factor (iso)4E is dispensable for plant growth but required for susceptibility to potyviruses. Plant J. 32: 927-934. Epinat et al., 2003. A novel engineered meganuclease induces homologous recombination in yeast and mammalian cells. Nucleic Acids Research, 31(11): 2952-2962. Havre et al., 1993. Targeted mutagenesis of DNA using triple helix-forming oligonucleotides linked to psoralen. Proc. Natl. Acad Sci. USA 90: 7879-7883. McCallum et al., 2000. Targeted screening for induced mutations. Nature Biotechnology 18: 455-457. Greene et al., 2003. Spectrum of chemically induced mutations from a large-scale reverse-genetic screen in Arabidopsis. Genetics 164: 731-740. Lloyd et al., 2005. Targeted mutagenesis using zinc-finger nucleases in Arabidopsis. Proc. Natl. Acad. Sci. USA 102: 2232-2237. Margulies et al., 2005. Genome sequencing in microfabricated high-density picolitre reactions. Nature 437: 376-380. Menda et al., 2004. In silico screening of a saturated mutation library of tomato. Plant J. 38: 861-872. Nicaise et al., 2003. The eukaryotic translation initiation factor 4E controls lettuce susceptibility to the potyvirus lettuce mosaic virus. Plant Physiol. 132: 1272-1282. Ruffel et al., 2002. A natural recessive resistance gene against potato virus Y in pepper corresponds to the eukaryotic initiation factor 4E (eIF4E). Plant 32: 1067-1075. Ruffel et al., 2005. The recessive potyvirus resistance gene pot-1 is the tomato orthologue of the pepper pvr2-eIF4E gene. Mol. Gen. Genomics 274: 346-353. Shendure et al., 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Scienceexpress Report, August 4. Stuart and Via, 1993. A rapid CTAB DNA isolation technique useful for RAPD fingerprinting and other PCR applications. Biotechniques, 14: 748-750. Vandenbussche et al., 2003. Toward the analysis of the petunia MADS box gene family by reverse and forward transposon insertion mutagenesis approaches: B, C, and D floral organ identity functions require SEPALLATA-like MADS box genes in petunia. The Plant Cell 15:2680-2693.

High throughput screening of populations carrying naturally occurring mutations

Assignee

Inventors

Cpc classification

Classification Explorer

C12Q2600/13

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2563/179

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6858

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2537/143

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2525/155

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2563/155

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6869

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6851

CHEMISTRY; METALLURGY

Classification Explorer

G16B30/00

PHYSICS

Classification Explorer

C12Q2563/155

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6827

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2563/179

CHEMISTRY; METALLURGY

Classification Explorer

C12Q2537/143

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6858

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6869

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6806

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6855

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6846

CHEMISTRY; METALLURGY

Classification Explorer

G16B30/10

PHYSICS

Classification Explorer

C12Q2525/155

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6874

CHEMISTRY; METALLURGY

International classification

Classification Explorer

C12Q1/6874

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6844

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6851

CHEMISTRY; METALLURGY

Classification Explorer

C12Q1/6855

CHEMISTRY; METALLURGY