METHOD FOR NEXT GENERATION SEQUENCING BASED GENETIC TESTING
20180327865 ยท 2018-11-15
Inventors
Cpc classification
G16B40/00
PHYSICS
C12Q1/6888
CHEMISTRY; METALLURGY
G16B10/00
PHYSICS
G16B20/00
PHYSICS
International classification
Abstract
A next generation sequencing (NGS) based method includes applying, for one or more genetic loci, respective NGS data for genotype of a first subject, genotype of a second subject, and genotype of an alleged offspring of the first and second subjects to a statistical model calculating a value representing a likelihood the offspring is a true offspring of the first and second subjects. The NGS data includes genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring. The statistical model utilizes a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
Claims
1. A next generation sequencing (NGS) based method for genetic testing, comprising: applying, for one or more genetic loci, respective NGS data related to genotype of a first tested subject, genotype of a second tested subject, and genotype of an alleged offspring of the first and second tested subjects to a statistical model for calculating a value representing a likelihood that the alleged offspring is a true offspring of the first and second subjects; and determining, based on the respective values calculated for the one or more genetic loci, a likelihood that the alleged offspring is a true offspring of the first and second tested subjects; wherein the NGS data includes: genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring; wherein the statistical model utilizes: a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
2. The method of claim 1, wherein the method is applied to a plurality of genetic loci.
3. The method of claim 1, wherein the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
4. The method of claim 1, wherein the statistical model applies the following for calculating the value of the respective genetic loci:
5. The method of claim 4, wherein first tested subject is a mother of the offspring and the second tested subject is an alleged father of the offspring.
6. The method of claim 1, further comprising the step of: obtaining raw NGS data from the first tested subject, the second tested subject, and the alleged offspring.
7. The method of claim 6, wherein in the raw NGS data, a sequencing coverage of the first tested subject is above or equal to 0.5.
8. The method of claim 6, wherein in the raw NGS data a sequencing coverage of the second tested subject is above or equal to 0.5.
9. The method of claim 6, wherein in the raw NGS data a sequencing coverage of the alleged offspring is above or equal to 0.5.
10. The method of claim 6, further comprising: prior to the application step, filtering raw NGS data to remove marker with more than two alleles to obtain the respective NGS data for the one or more genetic loci.
11. The method of claim 1, further comprising: dividing respective genomes in the corresponding NGS data of the first tested subject, the second tested subject, and the alleged offspring into a plurality of segments; sorting markers in each of the plurality of segments based on a probability of exclusion; selecting a plurality of markers based on the sorting result for application to the statistical model.
12. The method of claim 11, wherein the selection step comprises: selecting a plurality of markers with the highest probability of exclusion.
13. A next generation sequencing (NGS) based system for genetic testing, comprising: means for applying, for one or more genetic loci, respective NGS data related to genotype of a first tested subject, genotype of a second tested subject, and genotype of an alleged offspring of the first and second tested subjects to a statistical model for calculating a value representing a likelihood that the alleged offspring is a true offspring of the first and second subjects; and means for determining, based on the respective values calculated for the one or more genetic loci, a likelihood that the alleged offspring is a true offspring of the first and second tested subjects; wherein the NGS data includes: genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring; wherein the statistical model utilizes: a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
14. The system of claim 13, wherein the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
15. The system of claim 13, wherein the statistical model applies the following for calculating the value of the respective genetic loci:
16. The system of claim 15, wherein the first tested subject is a mother of the offspring and the second tested subject is an alleged father of the offspring.
17. A non-transitory computer readable medium for storing computer instructions that, when executed by one or more processors, causes the one or more processors to perform a next generation sequencing (NGS) based method for genetic testing, comprising: applying, for one or more genetic loci, respective NGS data related to genotype of a first tested subject, genotype of a second tested subject, and genotype of an alleged offspring of the first and second tested subjects to a statistical model for calculating a value representing a likelihood that the alleged offspring is a true offspring of the first and second subjects; and determining, based on the respective values calculated for the one or more genetic loci, a likelihood that the alleged offspring is a true offspring of the first and second tested subjects; wherein the NGS data includes: genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring; wherein the statistical model utilizes: a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
18. The non-transitory computer readable medium of claim 17, wherein the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
19. The non-transitory computer readable medium of claim 17, wherein the statistical model applies the following for calculating the value of the respective genetic loci:
20. The non-transitory computer readable medium of claim 17, wherein the first tested subject is a mother of the offspring and the second tested subject is an alleged father of the offspring.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings in which:
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0030] The inventors of the present invention have devised, through research, experiments, and trials, that next-generation sequencing (NGS), with its high-throughput and relatively low cost compared to other sequencing techniques, may provide enormous potential feasibilities in forensic studies. From the first pyro-sequencing-based high-throughput sequencing systemthe 454 Genome Sequencing System, introduced by Roche in 2005, the NGS technique gradually matures through time. The throughput of a single sequencing run nowadays has increased significantly and the cost-per-base has reduced significantly. For paternity testing, whole genome sequencing provides redundant marker information that is capable of handling complex scenarios with high accuracy.
[0031] The inventors of the present invention have also devised, through research, experiments, and trials, that in order to acquire a reliable result with low cost, a minimum requirement of sequencing coverage must be set using NGS-based methods and systems. However, when the sequencing coverage is low, genotypes of the tested individuals are associated with statistical uncertainty, for mainly two reasons. First, for haploids, both alleles may not be samples. Second, in most NGS data, the error rate is at least 0.1% even after filtering out base pairs with low quality. This may result in many homozygous loci wrongly inferred as heterozygous.
[0032] The most widely applied method for paternity testing nowadays is the likelihood method. Given the genotypes of a tested trio, this method relies on the calculation of the likelihood ratio of two hypotheses called Paternity Index (PI):
(1) X, the likelihood of the tested man is the biological father of the child (True Trio);
(2) Y, the likelihood of a random man is the biological father of the child (False Trio); For each locus, denote g.sub.qf, g.sub.m and g.sub.c, as the genotypes for the alleged father, mother and child respectively, then the PI value can be written as
where T(g.sub.c|g.sub.m, g.sub.af) is the likelihood of true trio, which means that both alleles of the child are inherited from the mother and the alleged father; T(g.sub.c|g.sub.mf) is the likelihood of that the tested man is not the biological father of the child.
[0033] Ranges from 0 to infinity, the PI value provides DNA evidence of paternity for each locus. Specifically, if PI>1, then it indicates that the genetic evidence of this locus supports that the tested man is the biological father; if PI=1, then it indicates that the genetic evidence of this locus provides no information on paternity; and if PI<1, then it indicates that the genetic evidence of this locus is more consistent with non-paternity than paternity. Low PI values are primarily resulting from inconsistency in genetic markers, which may be caused by non-paternity, mutations in offspring and wrong genotype calls by sequencing errors.
[0034] The following embodiment of the present invention provides a method that reduces the errors caused by sequencing errors.
[0035]
[0036] Preferably, in the method 100 of the present embodiments, the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
[0037] In some examples, the method 100 may further include obtaining raw NGS data from the first tested subject, the second tested subject, and the alleged offspring. This is prior to step 102. Preferably, in the raw NGS data, respective sequencing coverage of the first tested subject, the second tested subject, and the alleged offspring are each as low as 0.5. In one example, in the raw NGS data, respective sequencing coverage of the first tested subject, the second tested subject, and the alleged offspring are each between 0.5 and 2. The method may further include, prior to step 102, filtering, either automatically or manually, raw NGS data to remove marker with more than two alleles to obtain the respective NGS data for the genetic loci.
[0038] In one embodiment, the method 100 may include dividing respective genomes in the corresponding NGS data of the first tested subject, the second tested subject, and the alleged offspring into a plurality of segments. The markers in each of the plurality of segments are then sorted based on a probability of exclusion, and afterwards, one or more markers may be selected based on the sorting result for application to the statistical model. In one example, the markers with the highest probability of exclusion are selected.
[0039] In the method 100 of the present embodiment, to reduce the errors caused by sequencing errors, the probability of the genotypes are modelled as the posterior probability with the observed reads by Bayesian rule. In the present embodiment, the PI value is defined as
where D.sub.c, D.sub.m and D.sub.af represent the observed sequencing reads for, respectively, the tested offspring, mother and alleged father.
[0040] According to the Bayesian rule, the conditional probability of the individuals real genotype is g.sub.i,j with allele i and j given the observed read on such locus is
where P(g.sub.i,j) is the genotype frequency in the subject population. Under the assumption of Hardy-Weinberg equilibrium, it can be calculated that
where f(i) and f(j) are the allele frequencies for allele i and j respectively.
[0041] In the method of the present embodiment, P(D|g.sub.i,j) is the likelihood of observing the allele type that are supported by reads if the genotype is g.sub.i,j. Assuming that the reads are independent of each other in the sequencing process, then
P(D|g.sub.i,j)=.sub.kP(d.sub.k|g.sub.i,j)(5
where d.sub.k is the k-th read that covers the corresponding locus.
[0042] The present embodiment models the sequencing process as a random process following binomial distribution, which means the probabilities of a sequenced read from both alleles are equal. Thus
P(D|g.sub.i,j)=C.sub.D.sup.d.sup.
where P(i|g.sub.i,j) is the probability of the sequenced read with allele i in one sampling under the condition that the individual genotype is g.sub.i,j. In one example, if g.sub.i,j is Aa, then p(A|g.sub.Aa)=p(a|g.sub.Aa)=0.5.
[0043] Considering sequencing errors, and denote the observed reads for allele i and j as d.sub.i and d.sub.j respectively, the real reads (without error) of allele i and j as r.sub.i and r.sub.i. Then it can be determined that
P(d.sub.i,d.sub.j|g.sub.i,j)=.sub.r.sub.
[0044] Suppose in one example it is observed that there are 4 reads supporting allele i and 6 reads supporting allele j for an SNP locus, the real situation may be 4 reads for i and 6 reads for j without sequencing error, or 3 reads for i and 7 reads for j with 1 sequencing error. If the real situation is 4 reads for i and 6 reads for j, the number of errors may be 0, 2, 4, . . . In other words, in this example, there must be even opposite errors, i.e., if one read is incorrectly sequenced as i instead of j, there must be another error where allele j is sequenced as allele i in order to get the final observation.
[0045] To convert the theoretically sequencing scenario (without sequencing errors) to the observed case (with sequencing errors), the minimum number of sequencing errors on this locus is e.sub.min=|d.sub.ir.sub.i|=|d.sub.jr.sub.j|. Under the assumption that each read can only be incorrectly sequenced once, the total error number on this locus e must satisfy
[0046] After clarifying the rules for errors, equation (7) may be expanded by listing out all the cases with sequencing errors. Subsequently,
P(D|g.sub.i,j)=.sub.d.sub.
where e is subject to inequality set in equations (8).
[0047] Referring to
[0048] To verify the performance of the method in the above embodiments of the present invention, the following experiments are performed.
[0049] One experiment uses genetic data of 320 Chinese individuals in 1000 Genome Project Phase 3. In the experiment, the allele frequencies for both SNP and STR markers in Chinese sub-population were counted. Then, 8 Chinese family trio NGS data with average sequencing coverage of 32 were collected. After stringently filtering out the markers with more than two alleles, the statistical model in the above embodiments of the present invention is applied.
[0050]
[0051] A further experiment was performed by randomizing subsample reads to reduce the sequencing coverage of samples. In this experiment, the overall coverage was reduced to 2, 1, 0.5 and 0.3 respectively. With each sequencing coverage, 800 experiments for both true trio and false trio (each family trio 100 times) were processed. As shown in
[0052] Embodiments of the present invention have provided a statistical model based method for genetic testing with NGS data. By considering the probability of sequencing errors and missing alleles, the likelihood of the genotypes for individuals in the tested trio is calculated, and is then combined together to obtain the overall probability that the tested subject is biologically related to the alleged offspring (e.g., the tested man is the true biological father of the alleged offspring). The method in some embodiments of the present invention requires the minimum 0.5NGS sequencing data of a trio family to perform accurate determination. As a result, reliable result can be obtained with relatively low cost.
[0053] It should be noted that the methods of the present invention can be applied not only to paternity testing, but also to genetic analysis for individual identification. Also, the present invention is not limited in its application to human beings, but may also apply to other animal, plants, etc.
[0054] Although not required, the embodiments described with reference to the Figures can be implemented as an application programming interface (API) or as a series of libraries for use by a developer or can be included within another software application, such as a terminal or personal computer operating system or a portable computing device operating system. Generally, as program modules include routines, programs, objects, components and data files assisting in the performance of particular functions, the skilled person will understand that the functionality of the software application may be distributed across a number of routines, objects or components to achieve the same functionality desired herein.
[0055] It will also be appreciated that where the methods and systems of the present invention are either wholly implemented by computing system or partly implemented by computing systems then any appropriate computing system architecture may be utilized. This will include stand-alone computers, network computers and dedicated hardware devices. Where the terms computing system and computing device are used, these terms are intended to cover any appropriate arrangement of computer hardware capable of implementing the function described.
[0056] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
[0057] Any reference to prior art contained herein is not to be taken as an admission that the information is common general knowledge, unless otherwise indicated.