Method and system for estimating a gender of a foetus of a pregnant female
11155854 · 2021-10-26
Assignee
Inventors
Cpc classification
C12Q2537/165
CHEMISTRY; METALLURGY
C12Q2537/165
CHEMISTRY; METALLURGY
G16H10/40
PHYSICS
C12Q2537/16
CHEMISTRY; METALLURGY
C12Q1/6809
CHEMISTRY; METALLURGY
C12Q1/6809
CHEMISTRY; METALLURGY
G16H50/20
PHYSICS
C12Q2537/16
CHEMISTRY; METALLURGY
International classification
G16H10/40
PHYSICS
C12Q1/6809
CHEMISTRY; METALLURGY
G16H50/20
PHYSICS
Abstract
A method for estimating a gender of a foetus of a pregnant female, said method comprising measuring allele presences (D.sub.X) for a first plurality of genetic markers of the X-chromosome and allele presences (DR) for a second plurality of genetic markers of at least one reference chromosome, different from the X and Y chromosome, in a sample of cell-free DNA from a pregnant female; based on said measured allele presences for said first plurality, determining a first fraction thereof which is associated with purely homozygous genetic markers; based on said measured allele presences for said second plurality, determining a second fraction thereof which is associated with purely homozygous genetic markers; and estimating a gender of said foetus based on said first and second fraction.
Claims
1. A method for estimating a gender of a fetus of a pregnant female, said method comprising the following steps: measuring allele presences (D.sub.X) for a first plurality of genetic markers of the X-chromosome and allele presences (D.sub.R) for a second plurality of genetic markers of at least one reference chromosome, different from the X and Y chromosome, in a sample of cell-free DNA from a pregnant female; each allele presence representing the presence at a genetic marker of at least one of: a reference allele of maternal or fetal origin or an alternative allele of maternal or fetal origin; determining, by a computer device, a first fraction (F.sub.X) thereof which is associated with purely homozygous genetic markers based on said measured allele presences for said first plurality; determining, by the computer device, a second fraction (F.sub.R) thereof which is associated with purely homozygous genetic markers based on said measured allele presences for said second plurality; determining, by the computer device, a first gender estimator (E.sub.H1) based on said first and second fractions; determining, by the computer device, a ratio (N.sub.Xi) between the measured allele presences of the first plurality and the measured allele presences of the second plurality of genetic markers, and determining statistical distribution parameters of the ratio associated with a first gender estimator which indicates that the fetus is female, and determining, by the computer device, a second gender estimator (E.sub.H2) using the statistical distribution; and estimating, by the computer device, a gender of said fetus, using the first and second gender estimator; wherein the first gender estimator (E.sub.H1), is calculated as follows:
E.sub.H1=(E.sub.H0−1)/(E.sub.He−1), with E.sub.H0=F.sub.X/F.sub.R; with F.sub.X the first fraction; with F.sub.B the second fraction; and with E.sub.He a predicted value for E.sub.H0 in case of a male fetus, and wherein E.sub.H1 is approximately 0 for a female fetus, and approximately 1 for male fetus.
2. The method of claim 1, wherein the determining of the first fraction comprises: calculating a corresponding number of allele frequencies for said first plurality based on said measured allele presences for the first plurality of genetic markers; and determining as the first fraction the fraction of said measured allele presences for which the allele frequency is 0 or 1 within a predetermined error margin; and/or wherein the determining of the second fraction comprises: calculating a corresponding number of allele frequencies for said second plurality based on said measured allele presences for the second plurality of genetic markers; and determining as the second fraction the fraction of said measured allele presences for which the allele frequency is 0 or 1 within a predetermined error margin.
3. The method of claim 1, wherein the measuring and determining steps are performed, by the computer device, for a batch comprising a plurality of samples; wherein for each sample in the batch, the first and second fraction are calculated.
4. The method of claim 1, wherein the second gender estimator E.sub.H2 is calculated as follows
E.sub.H2=Z.sub.Oi/Z.sub.Mi, with Z.sub.Oi a z-score calculated for the sample i in an analysis batch as:
5. The method of claim 1, wherein the sample is obtained from maternal blood, plasma, urine, cerebrospinal fluid, serum, saliva or is transcervical lavage fluid.
6. The method of claim 1, wherein said measuring step comprises at least one of the following: polymerase chain reaction (PCR), ligase chain reaction, nucleic acid sequence based amplification (NASBA), and/or branched DNA methods.
7. The method of claim 6, wherein said measuring step comprises PCR.
Description
BRIEF DESCRIPTION OF THE FIGURES
(1) The accompanying drawings are used to illustrate presently preferred non-limiting exemplary embodiments of a method and system of the present invention. The above and other advantages of the features and objects of the invention will become more apparent and the invention will be better understood from the following detailed description when read in conjunction with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
(12) In a Non-Invasive Prenatal Test (NIPT), known in the prior art, cell free DNA (cfDNA) in a maternal serum or plasma sample of a pregnant female is sequenced in order to screen for the presence of chromosomal aneuploidies in the foetus, such as trisomy of chromosome 21.
(13) According to exemplary embodiments of the invention, there is provided a method to estimate the gender of the foetus.
(14) In a typical embodiment, a maternal serum or plasma sample is derived from the maternal blood. This may be a small amount of serum or plasma, e.g. 1 to 20 ml. Depending on the desired accuracy it may be preferred to use larger volumes. The preparation of the serum or plasma from the maternal blood sample may be carried out using standard techniques. Suitable techniques include centrifugation and/or matrix based techniques. In possible embodiments, a sequence-based enrichment method may be used on the maternal serum or plasma to specifically enrich for foetal nucleic acid sequences.
(15) Embodiments of the method of the invention may be carried out for a sample containing foetal DNA at a foetal fraction concentration of the total amount of DNA above a predetermined threshold. In preferred embodiments, an amplification of the foetal DNA sequences in the sample is carried out. Any amplification method known to the skilled person may be used, such as a PCR method.
(16) In a preferred embodiment data from Applicant's Clarigo test that does not involve the detection of SNP (or single-nucleotide polymorphism, i.e. a genetic marker that comprises a single variable nucleotide) alleles on the foetal DNA that are not present in the DNA of the pregnant female, is used. The Clarigo test consists in targeted sequencing of a number of regions on the human genome (in other words, targeting specific genetic markers), using known SNPs (single-nucleotide polymorphism) with high (e.g. greater than 1%, preferably greater than 10%) population prevalence and two possible alleles (sc. a reference allele a.k.a. REF; and an alternative allele a.k.a. ALT). More details about the Clarigo test can be found on the Internet at multiplicom.com/product/clarigo, and in WO 2013/057568, which was filed in the name of the Applicant.
(17) Now an exemplary embodiment of a method for estimating a gender of a foetus of a pregnant female will be discussed in detail. In a first measurement step, allele presences for a first plurality (D.sub.X) of genetic markers of the X-chromosome and for a second plurality (D.sub.R) of genetic markers of at least one reference chromosome, different from the X and Y chromosome, are measured in a sample of cell-free DNA from a pregnant female. Each allele presence representing the presence at a genetic marker of at least one of: a reference allele of maternal or foetal origin, and an alternative allele of maternal or foetal origin. In a second calculating step, based on said measured allele presences for said first plurality, a first fraction (F.sub.X) thereof which is associated with purely homozygous genetic markers, is determined, and based on said measured allele presences for said second plurality, a second fraction (F.sub.R) thereof which is associated with purely homozygous genetic markers, is determined. In a third step a gender of the foetus is estimated based on said first and second fraction. In the exemplary embodiment set out below, the estimating step comprises calculating a first gender estimator for a specific sample, calculating a second gender estimator using also data from other samples, and calculating a combined gender estimator for the specific sample based on the first and second gender estimator.
(18) Measurement and Fraction Determining
(19) An advantageous way to represent the results of measuring allele presences for a genetic marker, is to associate the following information to a variant data point for that genetic marker. A variant data point (being a data point associated with a number of variants, such as alleles) is used in this specification as a convenient representation for a genetic marker, and thus represents the result of measuring allele presences in a number of amplicons for genetic markers. An amplicon is a piece of DNA or RNA that is the (source and/or) product of amplification or replication events. In other words, an amplicon is a biophysical piece of replication material, designed to contain a known SNP position with high population prevalence. Each variant data point is thus associated with a known SNP with high population prevalence and with two possible alleles (sc. a reference allele a.k.a. REF; and an alternative allele a.k.a. ALT). For each variant data point A.sub.i, the following numbers can be determined using e.g. a standard bioinformatics pipeline applied on the sequencing data: The number of reads containing the REF allele on the known SNP position, C.sub.Ri. The number of reads containing the ALT allele on the known SNP position, C.sub.Ai. The total coverage C.sub.Ti=C.sub.Ri+C.sub.Ai. The allele frequency, or the fraction of ALT allele reads on the total coverage F.sub.i=C.sub.Ai/(C.sub.Ri+C.sub.Ai).
(20) Therefore, for a given genetic marker i, the allele presences can be measured for both the REF allele, for the ALT allele, and for both alleles, by measuring the numbers of reads containing the REF allele, the ALT allele and both the REF and the ALT alleles respectively. Based on the measured allele presences, a corresponding number of allele frequencies are calculated for the predetermined number of genetic markers.
(21) For each position in the genome (i.e. for each genetic locus), excluding the X and Y chromosomes and assuming that there are no relevant chromosome disorders, there are four copies present in the sample (assuming the position is not part of an aneuploidy region), which determine the total number of reads: two copies from the maternal DNA and two copies from the foetal DNA.
(22) For an individual variant data point (i.e. for an individual genetic marker), let A and B denote the REF and ALT allele for the known SNP on the maternal DNA for that genetic marker, and a and b the corresponding states for the foetal DNA. This means that the variant data point can be in the possible states listed in Table 1:
(23) TABLE-US-00001 TABLE 1 variant data point state Expected fraction ALT reads (F.sub.i) AAaa 0 AAab FF/2 ABaa 0.5 − FF/2 ← (1 − FF)/2 ABab 0.5 ABbb 0.5 + FF/2 ← (1 − FF)/2 + FF BBab 1 − FF/2 BBbb 1
(24) As an illustration of measured allele presences, the scatter plots illustrated in
(25) It can be seen from
(26) In
(27) In
(28) Therefore, three groups of variant data points (11A and 11B, 12A and 12B, and 13) can be distinguished: variant data points 11A and 11B that are homozygous in the maternal and foetal DNA (AAaa, BBbb); variant data points 12A and 12B that are homozygous in the maternal DNA, and heterozygous in the foetal DNA (AAab, BBab). Note that in these cases the foetal DNA contains an allele that was inherited from the father and that is not present in the maternal DNA. In other words, for a male foetus this group of variant data points will not be present for the X-chromosome, see
(29) It is noted that multiple variant data points may have the same (or very nearly the same) allele frequency, especially when they are part of the same group. This means that (very nearly) the same number of allele presences has been measured for them, relatively to the total number of reads.
(30) It is also noted that, in
(31) Homozygous Fraction Gender Estimator (First Gender Estimator)
(32) A first gender estimator is calculated by investigating the fraction of variant data points that are purely homozygous in the sample for the X-chromosome and for the at least one reference chromosome which does not include the X/Y-chromosome. In a preferred embodiment all reference chromosomes may be used.
(33) The following parameters are calculated from the obtained measurement results: the total number D.sub.X of reads over all variant data points on chromosome X, i.e. the total number of measured allele presences for the first plurality of genetic markers of the chromosome X; the total number D.sub.Xh of reads on chromosome X, corresponding to variant data points that are purely homozygous, i.e. allele frequency either 0 or 1, within a predetermined error margin, i.e. groups 11A and 11B in
(34) For sample M, the following set of values is obtained by the method (see
(35) For sample F, the following set of values are identified by the method (see
(36) Next, the following fractions are calculated: a first fraction F.sub.X of reads for the first plurality of genetic markers, which is associated with purely homozygous genetic markers:
F.sub.X=D.sub.Xh/D.sub.X, a second fraction F.sub.R of reads for the second plurality of genetic markers, which is associated with purely homozygous genetic markers:
F.sub.R=D.sub.Rh/D.sub.R.
(37) A first preliminary gender estimator may be calculated as:
E.sub.H0=F.sub.X/F.sub.R.
(38) In case of a female foetus, there are two copies present of chromosome X and of the at least one reference chromosome. Hence, E.sub.H0 is expected to be 1. In case of a male foetus, there is only one copy of chromosome X, but two copies of all reference chromosomes. Hence, E.sub.H0 is expected to be larger than 1. In other words, the first preliminary gender estimator allows to estimate the gender.
(39) For the exemplary embodiment of
F.sub.X=D.sub.Xh/D.sub.X≅0.71,
F.sub.R=D.sub.Rh/D.sub.R≅0.55,
E.sub.H0=F.sub.X/F.sub.R≅1.28.
(40) Because E.sub.H0 is well above 1, sample M can be estimated to be male.
(41) For the exemplary embodiment of
F.sub.X=D.sub.Xh/D.sub.X≅0.56,
F.sub.R=D.sub.Rh/D.sub.R≅0.57,
E.sub.H0=F.sub.X/F.sub.R≅0.97.
(42) Because E.sub.H0 is approximately 1, sample F can be estimated to be female.
(43) This can be further understood as follows. Given two copies of chromosome X, REF allele A and ALT allele B, there are four different combinations of those alleles possible for the mother that are expected to appear: AA, AB, BA, BB. Based on the gender of the foetus, the above combinations can be further divided due to the presence of foetal DNA: 1. Male foetus (
(44) From the above, it follows that the expected first fraction of purely homozygous SNPs on the X-chromosomes should be higher in case of a male foetus compared to a female foetus. For a female foetus, this first fraction is expected to be identical to the second fraction obtained from non-sex chromosomes. For a male foetus, the first fraction is expected to be higher than the second fraction.
(45) The first preliminary gender estimator E.sub.H0 may be standardised to a first gender estimator E.sub.H1 which is expected to be 0 for a female foetus, and 1 for male foetus:
E.sub.H1=(E.sub.H0−1)/(E.sub.He−1).
with E.sub.He a predicted value for E.sub.H0 in case of a male foetus.
(46) This predicted value E.sub.He may be obtained by estimating the population SNP heterozygosity level using the fraction of SNPs that is observed as heterozygous in the mother's DNA in the sample. This estimating can be done as follows. The following definitions are introduced:
(47) the total number D.sub.Rhet of reads on the at least one reference chromosome, corresponding to variant data points that are heterozygous in the maternal DNA. For the heterozygous variant data points, there are 8 possibilities: ABaa, ABbb, ABab, ABba, BAaa, BAbb, BAab and BAba.
(48) The fraction of heterozygous variant data points for the at least one reference chromosome can then be calculated as:
F.sub.Rhet=D.sub.Rhet/D.sub.R.
(49) This can be further understood as follows. Suppose on average, the REF allele A can be found in a fraction F.sub.A of all occurrences of an SNP and the ALT allele in a fraction F.sub.B=1−F.sub.A. If a generalization is made that states that every SNP in the variant data points shares the same REF allele fraction F.sub.A, expected levels of homozygosity and heterozygosity can be estimated. The level of homozygosity, or equivalently, the level of occurrence of the combinations AA, BB is estimated by:
F.sub.A.sup.2+F.sub.B.sup.2=F.sub.A.sup.2+(1−F.sub.A).sup.2.
(50) The level of heterozygosity, or equivalently, the level of occurrence of the combinations AB, BA is
F.sub.Rhet=2F.sub.AF.sub.B=2F.sub.A(1−F.sub.A).
(51) The combinations AAa, BBb have the same level of occurrence as the combinations AA, BB.
(52) The occurrence of the combinations AAaa, BBbb is F.sub.A.sup.3+F.sub.B.sup.3=F.sub.A.sup.3+(1−F.sub.A).sup.3.
(53) The occurrence of the other combinations can be estimated in a similar way.
(54) From the above, it follows that a predicted value for E.sub.H0 in case of a male foetus can be estimated as:
E.sub.He=F.sub.A.sup.2+(1−F.sub.A).sup.2/F.sub.A.sup.3+(1−F.sub.A).sup.3.
(55) The method estimates the value F.sub.A from the observed level of heterozygosity on the at least one reference chromosome F.sub.Rhet by solving the following second order degree equation and taking the largest root:
2F.sub.A(1−F.sub.A)−F.sub.Rhet=0.
(56) If no real root exists, F.sub.A is set to 0.5.
(57) A first gender classification may be executed based on E.sub.H1, using a fixed threshold T.sub.0(<=0.5):
E.sub.H1<T.sub.0: female,
E.sub.H1≥T.sub.0 and E.sub.H1≤1−T.sub.0: unknown,
E.sub.H1>1−T.sub.0: male.
(58) For sample M, the first gender estimator can be calculated as follows. First D.sub.Rhet is determined: D.sub.Rhet=259134 (divided among 498 amplicons), and next the fraction is calculated:
F.sub.Rhet=D.sub.Rhet/D.sub.R≅0.29.
(59) From this fraction the predicted value E.sub.He can be calculated:
F.sub.A≅0.82,
E.sub.He≅1.25.
(60) The first gender estimator then becomes:
E.sub.H1=(E.sub.H0−1)/(E.sub.He−1)≅1.07.
(61) Based on this value, a first classification of sample M is male.
(62) For sample F, the first gender estimator can be calculated as follows. First D.sub.Rhet is determined: D.sub.Rhet=250259 (divided among 436 amplicons), and next the fraction is calculated:
F.sub.Rhet=D.sub.Rhet/D.sub.R=0.26.
(63) From this fraction the predicted value E.sub.He can be calculated:
F.sub.A≅=0.84,
E.sub.He≅1.25.
(64) The first gender estimator then becomes:
E.sub.H1=(E.sub.H0−1)/(E.sub.He−1)≅−0.10.
(65) Based on this value, a first classification of sample F is female.
(66)
(67) Coverage Gender Estimator (Second Gender Estimator)
(68) A second gender estimator is obtained by comparing the coverage of the X-chromosome to the coverage of a set of at least one reference chromosome different from the X and Y chromosomes. This set may be the same set as the set used for determining the first gender estimator, or a different set.
(69) The following parameters are calculated from the variant data points of a sample i of a batch:
(70) the total number D.sub.Xi of reads over all variant data points on chromosome X, i.e. the total number of measured allele presences for the first plurality of genetic markers of the X-chromosome; and
(71) the total number of reads D.sub.RI over all variant data points on the at least one reference chromosome, i.e. the total number of measured allele presences for the second plurality of genetic markers of the at least one reference chromosome.
(72) Next, a ratio N.sub.Xi of the reads D.sub.Xi and the reads D.sub.Ri is calculated as follows for sample i:
N.sub.Xi=D.sub.Xi/D.sub.Ri.
(73) The ratio N.sub.Xi may be calculated for all samples in an analysis batch, e.g. all samples sequenced during a single run. Next, the values N.sub.Xi for the samples that were determined “female” using the first gender classification may be selected, and there may be calculated normal distribution parameters for those values, e.g. a mean value μ and standard deviation σ.
(74) In the event that the number of samples in the analysis batch that were classified “female” is too small to obtain a reliable estimate of the distribution, this set can be augmented by using samples classified as “male”, using a corrected value:
N′.sub.Xi=N.sub.Xi×1/(1−FF/2).
(75) with FF the estimated foetal fraction of the sample. The foetal fraction may be estimated using an embodiment of the method disclosed in patent application BE 2015/5460 in the name of the Applicant which is included herein by reference. Other existing methods may also be used to estimate the foetal fraction.
(76) Using this distribution, a z-score Z.sub.Oi can be calculated for all samples i in the analysis batch:
(77)
(78) For each sample, determine Z.sub.Mi, the expected z-score value under the assumption that the sample is “male”. This value is based on the estimated foetal fraction of the sample:
(79)
(80) A second gender estimator is calculated as:
E.sub.H2=Z.sub.Oi/Z.sub.Mi.
(81) This value is expected to be 0 for a female foetus, and 1 for a male foetus.
(82) For samples M and F the ratio N.sub.Xi can be calculated using the formulas above:
N.sub.XM=0.1225806583663057,
N.sub.XF=0.12775398163696797.
(83) Next all values N.sub.Xi are calculated for all samples in the analysis batch.
(84) The samples classified as female (i.e. the points in contour 401) are used to determine a normal distribution. These are the samples for which E.sub.H1<T.sub.0. These samples can be found on the left of the vertical separation line 301 in
(85) In the example of
N′.sub.XM=N.sub.XM×1/(1−FF/2)=0.12626348981879823.
(86)
(87) Based on the values N.sub.Xi, the normal distribution is calculated with:
mean value μ=0.12792346387904036, and
standard deviation σ=0.0011732807075874412.
(88) Using this distribution, a z-score Z.sub.OM=−4.55 for sample M and Z.sub.OF=−0.14 for sample F are calculated. The z-score for all samples in the analysis batch is shown in
(89) The expected z-score value under the assumption that the sample is “male” for the two example samples are Z.sub.MM=−3.18 and Z.sub.MF=−2.84. Using both z-scores, the second gender estimation of samples M and F are:
Sample M: E.sub.H2=Z.sub.OM/Z.sub.MM=1.43,
Sample F: E.sub.H2=Z.sub.OF/Z.sub.MF=0.05.
(90) The second gender estimator values for all samples in the analysis batch are shown on the x-axis of
(91) Combined Gender Estimator
(92) Both the first and the second gender estimators may be combined into a single joint estimator.
(93) In case of a female foetus, the combined estimator pair is expected to have the following values:
(E.sub.H1,E.sub.H2)=(0,0).
(94) In case of a male foetus, the combined estimator pair is expected to have the following values:
(E.sub.H1,E.sub.H2)=(1,1).
(95) For a specific sample, the Euclidian distances between the observed estimator value pairs and both expected value pairs may be calculated:
D.sub.F=√{square root over (E.sub.H1.sup.2+E.sub.H2.sup.2)}
D.sub.M=√{square root over ((E.sub.H1−1).sup.2+(E.sub.H2−1).sup.2)}
(96) From this, a single combined estimator is calculated as
E.sub.H=(D.sub.M−D.sub.F)/√{square root over (2)}.
(97) This value is expected to be −1 for a male foetus, and +1 for a female foetus.
(98) Using the final gender estimator, a final estimation may be performed as follows, using a predetermined threshold T (≥0):
E.sub.H<−T: male,
E.sub.H≥−T and E.sub.H≤+T: unknown,
E.sub.H>+T: female.
(99) The gender estimators resulted in values (1.07, 1.43) for sample M and values (−0.10, 0.05) for sample F.
(100) Finally, a combined estimator value for samples M and F can be calculated:
Sample M: E.sub.H=−0.95,
Sample F: E.sub.H=0.94.
(101) Based on these values and threshold T=0.3, sample M is classified as male and sample F is classified as female. The value of E.sub.H for all samples in the example analysis batch is shown in
(102)
(103) A person of skill in the art would readily recognize that steps of various above-described methods can be performed by programmed computers. Herein, some embodiments are also intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions, wherein said instructions perform some or all of the steps of said above-described methods. The program storage devices may be, e.g., digital memories, magnetic storage media such as a magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of the above-described methods.
(104) The functions of the various elements shown in the figures, including any functional blocks labelled as “modules”, may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “module” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non volatile storage. Other hardware, conventional and/or custom, may also be included.
(105) Whilst the principles of the invention have been set out above in connection with specific embodiments, it is to be understood that this description is merely made by way of example and not as a limitation of the scope of protection which is determined by the appended claims