HIGH-THROUGHPUT GENOTYPING BY SEQUENCING LOW AMOUNTS OF GENETIC MATERIAL

20220186291 · 2022-06-16

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention provides a method for analysis of target nucleic acids which are present in low amounts. In particular, the method comprises the following steps: i. providing a sample wherein target nucleic acids are present in a low amount, ii. generating a reduced representation library of said target nucleic acids by a method comprising: fragmenting said target nucleic acids; ligating adaptors to said fragments; and selecting a subset of said adaptor-ligated fragments, iii. massively parallel sequencing said reduced representation library, and iv. identifying variants in said target nucleic acids by analyzing results obtained by said sequencing.

    Claims

    1-18. (canceled)

    19. A method for analysis of target nucleic acids, the method comprising the following steps: i. in a sample that has been provided and wherein target nucleic acids are present in an amount of 100 pg or less, wherein said target nucleic acids originate from an embryo or fetus or from a cancer or tumor cell, wherein said target nucleic acids are DNA, ii. generating a reduced representation library of said target nucleic acids by a method comprising fragmenting said target nucleic acids using one or more restriction enzymes; ligating adaptors to said fragments; and selecting a subset of said adaptor-ligated fragments based on the size of said fragments, wherein generating a reduced representation library comprises amplifying a subset of fragments which, when combined, comprise only a part of the target nucleic acids, and wherein said generating a reduced representation library reduces the complexity at least 5 times, iii. massively parallel sequencing said reduced representation library, and iv. identifying variants in said target nucleic acids by analyzing results obtained by said sequencing.

    20. The method of claim 19, wherein said selecting a subset is performed using PCR-amplification.

    21. The method of claim 19, wherein said selecting a subset includes PCR amplification using a selective primer.

    22. The method of claim 19, further comprising v. constructing a genotype and/or haplotype based on identified variants in said target nucleic acid.

    23. The method of claim 19, further comprising v. identifying a genetic aberration in said sample based on identified variants in said target nucleic acid.

    24. The method of claim 19, wherein providing a sample comprises isolating one or a few target cells.

    25. The method of claim 24, wherein providing a sample further comprises lysing said one or few target cells.

    26. The method of claim 19, further comprising whole genome amplification (WGA) of said target nucleic acids.

    27. The method of claim 19, wherein sequencing said reduced representation library assures that each variant position in said library is sampled with high redundancy.

    28. The method of claim 19, wherein said sequencing is performed with a depth of at least 5×.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0229] Further features of the present invention will become apparent from the examples and figures, wherein:

    [0230] FIG. 1 illustrates accuracy of WGA nucleotide-copying method used in embodiments of the present invention.

    [0231] FIG. 2: Size distribution of the genomic library of 1 horse after restriction digestion with Apekl. X-axis shows the fragment length in basepairs and the Y-axis shows the fluorescence units. Two peaks at 35 bp and 10380 bp refer to lower and upper marker, respectively.

    [0232] FIG. 3: Size distribution of the genomic library of 1 horse after sequencing with a peak around 110 bp. X-axis shows the fragment length in basepairs and the Y-axis shows the number of fragments called at that particular length.

    [0233] FIG. 4: This figure shows an improvement of the complexity reduction of the horse genome when using the standard versus the selective method. The black boxes indicate the average sample (meaning the average of 56 samples) sequenced with the standard method. The transparent boxes indicate the average sample sequenced with the selective method. The Y-axis shows the number of reads.

    [0234] FIG. 5: This snapshot of the IGV browser zooms into of a particular region of 288 bp on chromosome 31. The upper box indicates the chromosome location and the genomic size of the window. Lane 1 visualizes the pooled data of the 56 samples sequenced via the standard method whereas lane 2 visualizes the pooled data of the 56 samples sequenced via the selective method. Lane 3 shows the locations of the recognition sites of the Apekl enzyme. The black bars in lane 1 and 2 indicate the presence of a nucleotide difference with the reference sequence (EquCab2). Each horizontal bar/dot in lanes 1 and 2 refer to a sequence difference in one individual sample.

    DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

    [0235] The present invention will be described with respect to particular embodiments and with reference to certain drawings but the invention is not limited thereto but only by the claims. The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. Where the term “comprising” is used in the present description and claims, it does not exclude other elements or steps. Where an indefinite or definite article is used when referring to a singular noun e.g. “a” or “an”, “the”, this includes a plural of that noun unless something else is specifically stated.

    [0236] The term “comprising”, used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. Thus, the scope of the expression “a system comprising means A and B” should not be limited to systems consisting only of components A and B. It means that with respect to the present invention, the relevant components of the system are A and B.

    [0237] Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

    [0238] In the drawings, like reference numerals indicate like features; and, a reference numeral appearing in more than one figure refers to the same element. The drawings and the following detailed descriptions show specific embodiments of the system and method for high-throughput genotyping by sequencing of single cells.

    [0239] Embodiments of the invention advantageously provide a method whereby at least a single cell DNA-isolation, with or without (n/mtDNA) amplification, can be combined with a complexity reduction of the target, e.g. single cell, DNA product, PCR-based amplification and next generation sequencing to produce a set of markers for genotyping and haplotyping complete genomes, or parts of it, of one to multiple cells. In addition to the novel combination of those steps, other embodiments of the present invention advantageously provide a novel method to filter by for instance bioinformatics/statistical means the artifacts generated by any whole- or partial-genome amplification (WGA or PGA respectively) or PCR of (reduced representation) sequencing library as well as sequencing method.

    [0240] The advent of next generation sequencing (NGS) technologies have revolutionized the way biologists produce, analyze and interpret data. Although NGS platforms provide a cost-effective way to discover genome-wide variants from a single experiment, variants discovered by NGS need follow up validation due to the high error rates associated with various sequencing chemistries, in addition molecular analysis of single cells is challenging due to the low amounts of DNA available. Advantageously whole exome sequencing has been proposed as an affordable option compared to whole genome runs but it still requires follow up validation of all the novel exomic variants. Customarily, a consensus approach is used to overcome the systematic errors inherent to the sequencing technology, alignment and post alignment variant detection algorithms. However, the aforementioned approach warrants the use of multiple sequencing chemistry, multiple alignment tools, multiple variant callers which may not be viable in terms of time and money for individual investigators with limited informatics know-how. Biologists often lack the requisite training to deal with the huge amount of data produced by NGS runs and face difficulty in choosing from the list of freely available analytical tools for NGS data analysis. Hence, there is a need to customize the NGS data analysis pipeline to preferentially retain true variants by minimizing the incidence of false positives and make the choice of right analytical tools easier. To this end, embodiments of the present invention advantageously provide methods which can overcome these drawbacks, by providing advanced data correction methods, resulting in efficient and robust results.

    [0241] In addition, current single-cell genotyping problems, mainly due to allele drop out and drop in and/or preferential allele amplification bias following single-cell DNA-amplification methods can be largely overcome by deep sequencing according to preferred embodiments of the present invention to assure that each base pair is sampled with high redundancy. Embodiments of the method and related bioinformatic means advantageously enable one to identify those (rare) variants.

    [0242] A method according to embodiments of the invention can comprise at least one of the following steps:

    [0243] (i) Isolate single cells, DNA extraction and whole genome amplification (WGA). Briefly, when single or more cells get isolated by either picking of facsing cells, their nuclei containing the DNA and the mitochondrial DNA may then be amplified after cell lysis via genome wide amplification methods based on Multiple Displacement Amplification (MDA) or PCR-based genome-wide amplification. The result is a collection of fragments (large or small depending on the WGA-method used). This collection will then be processed for genotyping by sequencing (GBS) using restriction enzymes to construct a representation library (RRL) for high-throughput massive parallel sequencing. In an optional step, WGA of the single-cell DNA is omitted and only particular or desired fractions of the single-cell genome are amplified. These partial genome amplification (PGA) methods already significantly reduce the complexity of the single-cell genome before massively parallel sequencing/GBS. In another optional step, WGA and PGA of the single-cell DNA are omitted, and the single-cell DNA following cell-lysis is immediately processed for GBS (i.e. direct GBS).

    [0244] (ii) In silico digestion and enzyme selection. Restriction Enzymes can be selected preferably based upon following criteria:

    [0245] (1) predicted fragments length/nr of restriction sites,

    [0246] (2) the proportion of overlap with repetitive elements/methylation sites,

    [0247] (3) the putative SNP content,

    [0248] (4) the frequency of enzyme cutting,

    [0249] (5) predicted coverages of single-cell whole-genome amplification methods. Embodiments of the present invention advantageously provide means to construct and integrate ‘zero-coverage’ maps of a genome, i.e. maps highlighting those bases that are recurrently missed by sequences of single-cell amplification products.

    [0250] Each single-cell WGA-library sequenced for a particular amount of bases preferably produces a WGA-characteristic pattern of sequence coverage breadth and depth across the reference genome. E.g. single-cell PCR-based sequences recurrently miss more parts of the genome than sequences of multiple displacement amplified (MDAed) cells, but loci covered by single-cell PCR-based sequences are often covered deeper when compared to sequences of MDAed cells although both have been sequenced for the same amount of bases.

    [0251] Preferred embodiments of the invention provide a combination of Restriction Enzymes which preferably can be chosen to perform double or more digests to increase SNP discovery rates and thus increase the overall sensitivity of genotyping assays. When the enzymes are chosen, a digest is preferably prepared on the WGA samples followed by a fragment selection based upon size.

    [0252] (iii) Library construction and DNA sequencing

    [0253] Next a purification of the chosen fragments is preferably performed followed by the addition of adaptors with (preferably) a single nucleotide overhang.

    [0254] (iv) SNP calling (e.g. identification and/or typing) and data handling

    [0255] Results of using a method according to embodiments of the invention advantageously demonstrate that sequencing of single-cell WGA-products enables to determine digital frequencies of both alleles of a genetic marker (SNP, Indel . . . ) in the WGA-DNA. This has the advantage that e.g. SNPs in single cells may be typed more accurately when compared to conventional methods that use e.g. SNP-arrays. Indeed, preferential amplification of one allele of a heterozygous SNP will for instance result in a homozygous SNP-call when analyzed on a SNP-array because of the overwhelming signal of this preferentially amplified allele on the SNP-probes of the array. In contrast, in the sequencing approach the heterozygous SNP can be called with much more accuracy and confidence because e.g. hundreds to thousands of sequence reads report the preferentially amplified allele, but also a minority of reads will report the other allele of the SNP. Hence, this insight will allow a genotyping algorithm according to embodiments of the invention (see below) to tilt with statistical confidence the single-cell SNP-call towards a correct heterozygous instead of a false homozygous call. Similar rules apply when single-cell DNA is processed via PGA or direct GBS without intervening WGA/PGA. Although nucleotide substitutions can be identified in single-cell WGA-sequences, WGA-polymerases do not copy every base correctly during the amplification. Those errors may be mistaken for genuine nucleotide substitutions in the cell's genome. To investigate the base-fidelity of WGA-polymerases, the mismatch frequency of bases (having a base-call quality of 30) has been charted to the reference genome across the entire length of reads (having a mapping quality of 30). Strikingly, the mismatch frequency was significantly higher following single-cell PCR-based WGA-sequencing than following single-cell MDA-based or non-WGA DNA-sequencing (as illustrated in FIG. 2 which shows a two-tailed Kolmogorov-Smirnov test, with p-values <2.2e-16), suggesting that certain PCR-based polymerase(s) make significantly more nucleotide copy-errors. The MDA's phi29 polymerase applies 3′->5′ proofreading exonuclease activity and preliminary results indicate that the MDA-sequence error-rate is very low and almost comparable to conventional non-WGA DNA-sequencing when applying base-call and mapping qualities of 30 or more as shown in FIG. 2.

    [0256] FIG. 1 moreover illustrates nucleotide mismatch frequency with the hg19-reference genome at each base of the read. Only bases with a base-call quality of 30 or more in reads having a minimum mapping quality of 30 were considered. It is clear that the single-cell PCR-based WGA-method introduces significantly more WGA-nucleotide errors than single-cell MDA-WGA and non-WGA DNA sequencing.

    [0257] Besides the fidelity of single-cell WGA-polymerases, also the precision of GBS-PCR polymerases and sequence chemistry reactions (e.g. bridge-PCR polymerases) have to be taken into account in the methods for genotyping following single-cell (WGA/PGA-)GBS.

    [0258] There are two main approaches for interpreting the sequence reads resulting from a single-cell (WGA/PGA-)GBS method according to preferred embodiments of the invention:

    [0259] (1) Genotyping of the cells for a known set of polymorphic markers (SNPs, Indels, . . . ) or DNA-mutations covered by the single-cell (WGA/PGA-)GBS reads. Although the workflow can be applied for any nucleotide genetic variant that one wishes to genotype in the resulting single-cell sequences, current known SNP positions in the human genome hg19 can for instance be retrieved from databases as dbSNP or from the 1000 Genomes project. Similar databases exist for other species. The physical positions of the nucleotide genetic variants are preferably applied to generate pileups of the bases covering a particular position. Although there may be various algorithmic methods to achieve this, moreover preferred embodiments of the invention provides a pipeline based on e.g. Burrows Wheeler Alignment (BWA), SAMtools, Perl and R-scripts. In brief, for each position that is interrogated by the algorithm according to embodiments of the invention, a list of the amounts of A-, C-, G- and T-bases covering that position is preferably generated, the reference allele is preferably identified as well as all putative alternative (variant) alleles for that position. Thresholds on read mapping quality, base call quality, start and end of reads (e.g. FIG. 2 indicates that the first and last bases of sequence reads should be omitted from the analysis as they contain more mismatch errors with the reference genome) can be applied to increase accuracy at a cost of coverage. If the reference and alternative allele of the SNP are known (e.g. cytosine and thymidine bases for the major and minor allele of the SNP in the general population respectively), the algorithm according to preferred embodiments of the invention advantageously will return the amount of sequence reads carrying the reference allele (e.g. 20 reads reporting a C-base at that position in the WGA-sequence) and similarly for the alternative allele (e.g. 980 reads reporting a T-base at that position in the WGA-sequence). Subsequently, for instance by using statistical testing these digital allelic counts can be evaluated to be significantly different from a situation where sequence error and/or putative WGA nucleotide-copy error would lead to a similar observation if the underlying SNP is homozygous. Based on subsequent P-value thresholds, heterozygous, homozygous and SNP-No calls may be established. Considering that WGA allele drop-out and preferential amplification artifacts often encompass multiple kilobases, SNPs or nucleotide genetic variants in the haplotype of a near variant are expected to have similar allelic variant frequencies in the single-cell WGA-GBS product. By applying this principle, according to preferred embodiments of the invention, advantageously the accuracy in the final genotype calls are further increased. Similar rules apply when single-cell DNA would undergo PGA-GBS or direct GBS without intervening WGA. For direct GBS, single-cell DNA was immediately digested following lysis, adaptors were ligated, DNA-fragments amplified by PCR, size-selected and the amplicons would be massively parallel sequenced. In this process, also allele amplification bias as well as nucleotide copy errors will be introduced when started from a single cell. Hence, the same algorithmic pipelines, according to embodiments of the invention, can be applied. As the algorithms, according to embodiments of the invention, enable detecting variant alleles with (ultra) low frequencies in the sequences, this pipeline has tremendous value for the detection of (ultra) low-grade genetic mosaicism in deep-sequenced samples as well.

    [0260] (2) De Novo Discovery of Genetic Variants in the Cell.

    [0261] The principles presented above may be applied, according to embodiments of the invention, to all bases covered by the single-cell (WGA/PGA-)GBS for de novo discovery of SNPs in single-cell (WGA/PGA-)GBS products. In addition, these pipelines, according to preferred embodiments of the invention, may be supplemented with standard genetic variant callers (e.g. SAMtools with BCFtools, SOAPsnp, GATK, . . . ), but because of discrepancies in the frequencies of both alleles of a SNP in the single-cell amplification sequences, as well as WGA/PGA-GBS sequence errors, off-the-shelf available variant callers may produce less accurate single-cell genotypes.

    [0262] Some exemplary, numbered embodiments for carrying out the invention are detailed hereunder:

    [0263] 1. A method for genotyping and/or haplotyping at least one cell, the method comprising following steps:

    [0264] i. isolating and lysing the at least one cell,

    [0265] ii. amplifying DNA fragments of the least one cell,

    [0266] iii. massively parallel (genome-wide) genetic polymorphism typing (genotyping) by deep sequencing a reduced representation library of said amplification product,

    [0267] iv. a pipeline for variant discovery, genotyping and/or haplotyping.

    [0268] 2. The method of embodiment 1, whereby said amplifying is performed on the whole genome.

    [0269] 3. The method according to any of embodiments 1 or 2, whereby said amplifying is performed using whole-genome multiple displacement amplification or any whole-genome amplification method.

    [0270] 4.The method according to any of embodiments 1 to 3, the method further comprising constructing a reduced representation library of the amplification product for massively parallel sequencing and subsequent genotyping and/or haplotyping using bioinformatics and statistical means.

    [0271] 5. The method according to embodiment 4, whereby the reduced representation library of the at least one cell's amplification product is produced by restriction digestion using at least one or a combination of restriction enzymes and subsequent adaptor ligation and size-selection by PCR-amplification, or any sequence library reduction method

    [0272] 6. The method according to embodiment 5, whereby said sequence library reduction method is exon capture.

    [0273] 7. The method according to any one of embodiments 1 to 6, whereby said method further comprises the step of deep sequencing of the reduced representation library to assure that each variant position is sampled with high redundancy.

    [0274] 8. The method of any of embodiments 1 to 7, whereby the pipeline for variant calling is based on the detection of variant allele frequencies in the sequence reads that are discriminated from sequencing and/or amplification inconsistencies using a pipeline of sequence alignment, bioinformatics and statistics.

    [0275] 9. The method according to embodiment 8, whereby said variant allele frequencies are rare variant allele frequencies.

    [0276] 10. The method according to any of embodiment 8 or 9, whereby using a pipeline of sequence alignment is performed using a reference genome.

    [0277] 11. The method according to any one of embodiments 1 to 10, whereby said method further comprises the step of inferring genotype calls from detected variant allele frequencies.

    [0278] 12. The method according to any one of embodiments 1 to 11, whereby said method further comprises haplotype assessment and/or prediction of the at least one cell's genotype.

    [0279] 13. The method according to embodiment 1, whereby said amplifying amplifies only part of the genome.

    [0280] 14. The method according to embodiment 13, whereby said partial genome amplifying (PGA) is performed using multiple displacement amplification or any DNA-amplification method.

    [0281] 15. The method according to embodiment 14, whereby said multiple displacement amplification method can be any of PicoPlex, GenomePlex, SurePlex and/or AmpliOne.

    [0282] 16. The method according to any of embodiments 13 to 15, the method further comprising the construction of a (reduced representation) library of the PGA-product for massively parallel sequencing and subsequent genotyping and/or haplotyping using bioinformatics and statistical means.

    [0283] 17. The method according to embodiment 16, whereby the reduced representation library of the at least one cell's PGA-product is produced by restriction digestion using one or a combination of restriction enzymes and subsequent adaptor ligation and size-selection by PCR-amplification, or any sequence library production method with or without further representation reduction method.

    [0284] 18. The method according to any one of embodiments 13 to 17, whereby said method further comprises the step of deep sequencing of the reduced representation library to assure that each variant position is sampled with high redundancy.

    [0285] 19. The method of any of embodiments 13 to 18, whereby the pipeline for variant calling is based on the detection of variant allele frequencies in the sequence reads that can be discriminated from sequencing and/or amplification artifacts using a pipeline of sequence alignment, bioinformatics and statistics.

    [0286] 20. The method according to embodiment 19, whereby said variant allele frequencies are rare variant allele frequencies.

    [0287] 21. The method according to any of embodiment 19 or 20, whereby using a pipeline of sequence alignment is performed using a reference genome.

    [0288] 22. The method according to any one of embodiments 13 to 21, whereby said method further comprises the step of inferring genotype calls from detected variant allele frequencies.

    [0289] 23. The method according to any one of embodiments 13 to 22, whereby said method further comprises haplotype assessment or prediction of the at least one cell's genotype.

    [0290] 24. The method according to embodiment 1, whereby said amplifying involves immediate reduced representation sequence library production from the DNA present in the at least one cell's lysate.

    [0291] 25. The method according to embodiment 24, whereby following lysis, the at least one cell's DNA is immediately digested by one or a combination of restriction enzymes and subsequent adaptor ligation and size-selection by PCR-amplification, or any sequence library production and/or further reduction method.

    [0292] 26. The method according to embodiment 25, whereby said any sequence library production and/or further reduction method is amplicon sequencing libraries produced from DNA following single-cell lysis.

    [0293] 27. The method according to any one of embodiments 24 to 26, whereby said method further comprises the step of deep sequencing of the reduced representation library to assure that each variant position is sampled with high redundancy.

    [0294] 28. The method of any of embodiments 24 to 27, whereby a pipeline for variant calling is based on the detection of variant allele frequencies in the sequence reads that can be discriminated from sequencing and/or amplification artifacts using a pipeline of sequence alignment, bioinformatics and statistics.

    [0295] 29. The method according to embodiment 28, whereby said variant allele frequencies are rare variant allele frequencies.

    [0296] 30. The method according to any of embodiment 28 or 29, whereby using a pipeline of sequence alignment is performed using a reference genome.

    [0297] 31. The method according to any one of embodiments 24 to 30, whereby said method further comprises the step of inferring genotype calls from detected variant allele frequencies.

    [0298] 32. The method according to any one of embodiments 24 to 31, whereby said method further comprises haplotype assessment or prediction of the at least one cell's genotype.

    [0299] 33. The method according to embodiment 1, whereby said amplifying is performed on any desired part of the genome by rolling circle amplification.

    [0300] 34. The method according to embodiment 33, wherein said rolling circle amplication is performed on the circular mitochondrial DNA.

    [0301] 35. The method of any of the previous embodiments wherein the at least one cell is a human or animal blastomere.

    [0302] 36. A computer program comprising computer program code means adapted to perform all the steps of the method of any of embodiments 1 to 35 when the computer program is run on a computer.

    [0303] 37. The computer program according to embodiment 36 embodied on a computer readable medium.

    [0304] 38. A system for haplotyping at least one cell, whereby the system comprises a control unit, said control unit adapted to:

    [0305] isolate and lyse the at least one cell,

    [0306] amplify DNA fragments of the least one cell,

    [0307] massively parallel (genome-wide) genetic polymorphism type (genotype) by deep sequencing a reduced representation library of said amplification product,

    [0308] provide a pipeline for variant discovery, genotyping and/or haplotyping.

    [0309] Various modifications and variations of the forming process described within is embodiments of this invention are possible, which can be made without departing from the scope or spirit of the invention. Other embodiments will be apparent to those skilled in the practice of the invention, and the illustration, examples and specifications described herein can be considered as exemplary only.

    [0310] It is to be understood that this invention is not limited to the particular features of the means and/or the process steps of the methods described as such means and methods may vary. It is also to be understood that the terminology used herein is for purposes of describing particular embodiments only, and is not intended to be limiting. It must be noted that, as used in the specification and the appended claims, the singular forms “a” “an” and “the” include singular and/or plural referents unless the context clearly dictates otherwise. It is also to be understood that plural forms include singular and/or plural referents unless the context clearly dictates otherwise. It is moreover to be understood that, in case parameter ranges are given which are delimited by numeric values, the ranges are deemed to include these limitation values.

    EXAMPLES

    Example 1

    SNP Identification via Genotyping-By-Sequencing (GBS) in Arabian Horse

    [0311] The aim is to determine the genetic diversity within the Arabian purebred horses based on large scale SNP identification using GBS. Hereto, we collected 56 blood samples. DNA extractions were done with puregene kit (Qiagen). Sample concentrations were checked with the nanodrop and fragmentation was checked on agarose gel.

    [0312] In silico digestion based on the EquCab2 reference sequence using Apekl was performed using custom Perl/BioPerl scripts and predicted 2,937,656 fragments <=500 bp or 3,766,233 fragments <=1000 bp. This number reflects the efficiency of the genome complexity reduction. However this does not takes methylation patterns into consideration.

    [0313] DNA Libraries were prepared as described (Elshire et al. PLoS One. 2011 6(5):e19379. doi: 10.1371/journal .pone.0019379) with minor modifications. Restriction enzyme Apekl was used to reduce the genome complexity per sample. Apekl is a type II restriction endonuclease that recognizes the DNA target sequence 5′-G″CWGC-3′ (where W=A or T) and cleaves after the first G to produce fragments with three-base 5′-overhangs. The adapters comprised a set of 56 different barcode-containing adapters and a common adapter and had a concentration of 0.3 ng/μl instead of 0.6 ng/μl. quality control was done for 4 samples, horse 1,2,9 and 10. Fragment size and the presence of adaptor dimmers were determined via the Agilent bioanalyzer 2100 (FIG. 2). After determining the concentration of the samples via a picogreen test, the library was pair-end sequenced on one lane on the Illumina HiSeq2000.

    [0314] The FASTQ Illumina DNA sequences were processed via our data-analysis pipeline. With custom scripts data were sorted by sample based on the inline barcode (first 6-8 bp of readl). After trimming the reads were aligned with BWA v0.6.2 to EquCab2 and regions with a peak coverage >5× identified with SNIFER and custom scripts. Sequence results showed on average 1,8 million reads per sample and on average 1× coverage per sample. Table 1 provides an overview of the data generated after sequencing the standard library of 56 Arabian horses. The sample number is shown in column 1. Column 2 shows the number of raw reads per sample, column 3 shows the processed reads per sample counting all region per sample larger than 80 bp.

    [0315] Fragments size distributions of those samples with Apekl showed a similar pattern amongst all samples (FIG. 3). The barn files of all 56 samples were combined and uploaded in the Integrative genomic viewer (IGV). SNPs were analysed by visual inspection (FIG. 5).

    TABLE-US-00001 TABLE 1 Processed Raw reads reads total count count >80 bp 1 2505434 1582990 2 2844952 1809662 3 1790474 1132522 4 735215 458867 5 3276748 2101719 6 2558348 1625285 7 2858394 1799838 8 2610522 1651114 9 2658906 1661994 10 2321770 1496646 11 3229270 2047758 12 1760285 1109438 13 1392134 878969 14 3270777 2154840 15 3354984 2199428 16 2742378 1759003 17 1167670 729718 18 1507787 910192 19 799647 533114 20 1373434 884782 21 1113017 708423 22 765382 470352 23 154144 96367 24 334883 200191 25 2831872 1780018 26 2856180 1813744 27 1889402 1141160 28 487088 294142 29 1381170 909013 30 3267380 2118613 31 897341 585076 32 611723 389776 33 2758005 1806251 34 3654815 2487642 35 2299255 1565585 36 2640480 1765888 37 531810 349391 38 1740781 1165509 39 1172703 778117 40 153333 100180 41 2368131 1580705 42 1582386 1048634 43 3178144 2162268 44 1911276 1253344 45 895756 595325 46 1170332 778099 47 1324443 885272 48 134803 89902 49 2299009 1531017 50 3403674 2320288 51 1421098 953557 52 1436544 975807 53 1673991 1134550 54 848254 556281 55 413481 278444 56 274165 178610 total 100635380 65375420 average 1797060 1167418

    Example 2

    Further Reduction Improvement of Genome Complexity Using a Selective Primer

    [0316] In addition to the above reduced representation library (further referred to as “standard” library) generation using the Apekl restriction enzyme and the sample set of the same 56 Arabian horses, we've reduced genome complexity further by using a selective primer. This selective primer covers the entire common adapter, the 3′ restriction site and extends 2 bases into the insert region. Due to the 2 selective bases at the 3′ end of the primer, only a subset of adaptor-ligated fragments is amplified.

    TABLE-US-00002 selective reverse primer (5'-3'): CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCG CTCTTCCGATCTCAGCAC standard reverse primer (5'-3'): CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCG CTCTTCCGATCT common forward primer (5'-3'): AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC TTCCGATCT

    [0317] Furthermore, the library preparation was single-end sequenced on a single lane of an Illumina HiSeq2500. Raw sequence reads were processed similar to the above pipeline. Proper quality control was performed to check the correct organisation of the barcode and the restriction site. Poor quality reads, not confirming to our standards, were discarded. Overall, the results show a reduction by half of the genomic complexity in the selective library compared to this of the standard library (FIG. 4) and an improvement of the average coverage up to 7× sequencing depth.

    [0318] SNP identification was done similar to the above example and subsequently visualised in the integrative genomic viewer (IGV) (FIG. 5). The efficiency of the primer is shown as there are fewer regions called in the selective than in the standard library.

    Example 3

    Multi Cell and Single Cell Genotyping-By-Sequencing

    [0319] A skin biopt of a male horse was taken and cultured in a standard incubator at 37° C. and 5% CO2. Fibroblasts of large T175 falcon flask were cultivated, washed and DNA extracted using the blood and tissue kit (Qiagen). The concentration was checked via the nanodrop and DNA fragmentation was checked on agarose gel.

    [0320] From the same cell line, a single fibroblast was used for further downstream processing. The cell was lysed and DNA amplified according to WO2011/157846.

    [0321] Library preparations were done using Pstl restriction enzyme and further processed similar as the procedure in example 1. Pstl was predicted to generate 968,569 fragments in the horse genome (The EquCab2 reference sequence) whereas ApeKl 4461178 fragments in total. Since we wanted to maximise the sequencing power, we decided to test the Pstl digestion on the horse genome. The Pstl enzyme recognises following sequence CTGCAAG and is methylation sensitive. Further in silico predictions estimated 238405 fragments and 388822 fragments smaller than 500bp and 1000bp, respectively.

    [0322] Sequencing was done of both multicell and single cell on an Illumina HiSeq2000. This resulted in 52K paired-end 100 bp reads for the multicell sample and 144K for the single cell sample. Sequence data were processed as described in Example 1. The coverage analyses revealed 15K and 19K regions with a depth of at least 5×, in respectively the multicell and single cell sample, of which 2585 regions were overlapping between both samples. The later is within the expectations given that the total number of predicted regions will be in the range of 250K of which we only observed less than 10% because of the low amount of bases sequenced per sample. Despite a low amount of bases is sequenced per sample, it can lead to local deep-sequencing coverage (e.g. >5× in this example) by applying the RRL. Samtools v 0.1.17 was used for snp calling in both samples. The positions for which a snp call was observed in both samples were 99% concordant.