METHOD FOR DETERMINING GENOTYPE OF PARTICULAR GENE LOCUS GROUP OR INDIVIDUAL GENE LOCUS, DETERMINATION COMPUTER SYSTEM AND DETERMINATION PROGRAM

20170351805 · 2017-12-07

Assignee

Inventors

Cpc classification

International classification

Abstract

It is an invention aimed at providing methods for optimizing read information mapped to the particular gene loci such as MHC loci under a framework of probabilistic statistical processing. In the present invention, a step in which for all reads, calculation of the expected number of mappings to alleles of the particular gene loci is performed for read information in which mapping of reads to alleles of the particular gene loci is identified, a step in which the total number of expected mappings for each allele is calculated, and a step in which the fraction of reads allocated to each allele is calculated, are repeatedly executed, in which the optimization of read information is performed in a computer, and based on the optimized information, a method of easily and accurately estimating the genotypes of the particular gene loci, a computer system, and a computer program capable of easily and accurately estimating the genotype of the particular gene loci, are provided.

Claims

1. A method for optimizing read information of DNA, which is characterized by executing all or a part of the following steps (1) to (6) using read information which contains mapping correspondence of each read to alleles of selected gene loci or an individual gene locus obtained by mapping a nucleotide sequence of read data where read information of DNA derived from alleles of the gene loci or an individual gene locus is mixed: (1) a step in which, for each read, quantification of an expected number of mappings to each allele of the gene loci or individual gene locus, is performed; (2) a step in which the expected number of mappings quantified in step (1) is summed for each allele of the gene loci or individual gene locus to calculate a total number of expected mappings; (3) a step in which the total number of expected mappings calculated in step (2) is divided by the sum of the total number of expected mappings of all the alleles of the gene loci or individual gene locus, to calculate a fraction of reads allocated to each allele of the gene loci or individual gene locus relative to a total read amount mapped to alleles of the gene loci or individual gene locus; (4) a step in which the fraction of reads obtained in step (3) is assigned to each allele of the gene loci or individual gene locus as frequency, and based on the assigned frequencies, for each read, the expected number of mappings to each allele of the gene loci or individual gene locus is calculated again; (5) a step in which step (2) or (3) is executed again based on the updated expected number of mappings obtained in step (4), to calculate a fraction of the number of reads allocated to an allele of the gene loci or individual gene locus relative to the total read amount mapped to alleles of the gene loci or individual gene locus; and (6) a step in which steps (4) and (5) are repeatedly executed, until difference between the expected number of mappings to each allele of the gene loci or individual gene locus calculated in step (4) and that calculated in the previous step (4) is not recognized for all the reads, or difference between the fraction of reads calculated in step (5) and that calculated in the previous step (5), is not recognized for all the alleles in the gene loci or individual gene locus, and the converged expected number of mappings for each read for each allele of the gene loci or individual gene locus, or, the converged fraction of reads for alleles of the gene loci or individual gene locus, is recognized as optimized data.

2. An optimization method in which mapping of the read information of DNA from a subject to alleles of selected gene loci or an individual gene locus is optimized by a computer, which includes a step to determine for each read an expected number of mappings to alleles of the gene loci or individual gene locus, and a step to estimate allele frequencies θ of the gene loci or individual gene locus, θ being an objective parameter, where θ is T dimensional vector, the T being the number of alleles of the gene loci or individual gene locus, given nucleotide sequences of all reads in data in which read information of DNA derived from alleles of the gene loci or individual gene locus is mixed as observed data R, and which is characterized in that, with respect to (a) a latent variable T.sub.n dependent on θ with respect to allele selection of the gene loci or individual gene locus of read n, and (b) a latent variable S.sub.n dependent on T.sub.n with respect to starting position of read n, and given nucleotide sequence of read n as observed data R.sub.n, at least (i) variables T.sub.n and S.sub.n or (ii) variable T.sub.n is incorporated to calculate an estimate of the parameter in an estimation process of the objective parameter θ, so that observed data R.sub.n depends on the latent variables during the estimation process of the objective parameter θ from observed data R.sub.n.

3. The optimization method according to claim 1 or 2, which is characterized by calculating for each read the expected number of mappings to each allele of the selected gene loci or individual gene locus, and calculating estimated value of allele frequency 0 of the gene loci or individual gene locus, θ being an objective parameter, by executing steps based on a maximum likelihood estimation method, or Bayesian estimation method.

4. The optimization method according to claim 2 or 3, which is characterized by using an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes latent variables T.sub.n and S.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n.

5. The optimization method according to claim 4, which is characterized by executing the following steps (1) to (5): (1) a step in which a first update value of a posterior probability of Z.sub.nts=1 or Z.sub.nt=1 is calculated based on a given initial value of θ.sub.t, and a first updated value of a maximum likelihood estimate of θ.sub.t is further calculated, based on the first updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1, or, (2) a step in which a first updated value of a maximum likelihood estimate of θ.sub.t is calculated, based on a given initial value of a posterior probability of Z.sub.nts=1 or Z.sub.nt=1; (3) a step in which an updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 is calculated, based on the updated value of the maximum likelihood estimate of θ.sub.t calculated by a previous step (4) (however, for the first time, by step (1) or step (2)); (4) a step in which an updated value of the maximum likelihood estimate of θ.sub.t is calculated, based on the updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 calculated in step (3); and (5) (i) a step in which log likelihood is calculated based on the updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 calculated in step (3) and the updated value of the maximum likelihood estimate of θ.sub.t calculated in step (4), to evaluate convergence of the log likelihood, (ii) a step in which convergence of the updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 calculated in step (3) is evaluated, or (iii) a step in which convergence of the updated value of the maximum likelihood estimate of θ.sub.t calculated in step (4), and, if convergence is recognized, θ.sub.t in each step is determined as a final estimate value, and if convergence is not recognized, iteration of steps (3), (4) and (5) is determined.

6. The optimization method according to claim 4, which is characterized by executing the following steps (1) to (5): (1) a step in which a first update posterior distribution of Z.sub.nts or Z.sub.nt is calculated based on a given initial posterior distribution of θ.sub.t based on the hyperparameter α.sub.0 representing prior information of allele frequencies of the selected gene loci or individual gene locus, and a first updated posterior distribution of θ.sub.t is further calculated based on the first updated posterior distribution of Z.sub.nts or Z.sub.nt, or, (2) a step in which an updated posterior distribution of θ.sub.t is calculated based on a given initial distribution of the Z.sub.nts or Z.sub.nt; (3) a step in which an updated posterior distribution of Z.sub.nts or Z.sub.nt is calculated, based on the updated posterior distribution of θ.sub.t, calculated by the previous step (4) (however, for the first time, by step (1) or step (2)); (4) a step in which an updated posterior distribution of θ.sub.t is calculated, based on the updated posterior distribution of Z.sub.nts or Z.sub.nt calculated in step (3); and (5) a step in which convergence of the updated posterior distribution of θ.sub.t calculated in step (4) is evaluated, and if convergence is recognized, the expected value of θ.sub.t is determined as a final estimate value, and if convergence is not recognized, iteration of steps (3), (4) and (5) is determined.

7. The optimization method according to any one of claims 1 to 6, in which the data in which read information of DNA derived from alleles of the selected gene loci or individual gene locus is mixed, is read information in which mapping of each read to each allele of the gene loci or individual gene locus is identified by mapping read information of a subject to reference nucleotide sequences of alleles of the gene loci or individual gene locus registered in a database, and the mapping is characterized by executing the following steps (a) and (b): (a) a step in which with respect to nucleotide sequence information of reads obtained from the subject, reads are mapped to a human genome nucleotide sequence, and then those mapped to alleles of the gene loci or individual gene locus are extracted; and (b) a step in which the read sequence information mapped to alleles of the gene loci or individual gene locus obtained in step (a) is mapped to the reference nucleotide sequences of alleles of the gene loci or individual gene locus registered in a database, and reads mapped to each allele of the gene loci or individual locus are extracted, and read information is obtained, in which mapping of each read to each allele of the gene loci or individual gene locus is identified.

8. The optimization method according to claim 7, which is characterized in that the mapping performed in steps (a) and (b) allows one read to be mapped to multiple alleles of the selected gene loci or individual gene locus.

9. The optimization method according to claim 7 or 8, which is characterized in that in addition to the reads mapped to alleles of the selected gene loci or individual gene locus obtained in step (a), reads that are not mapped to whole human genome are extracted, and it becomes the target of the re-mapping in step (b).

10. The optimization method according to any one of claims 1 to 9, which is characterized in that the selected gene loci or individual gene locus are MHC gene loci or individual MHC gene locus.

11. The optimization method according to claim 10, which is characterized in that MHC is HLA.

12. A genotype determination method for selected gene loci or an individual gene locus, which is characterized by calculating individual depth of coverage of reads for each allele of the gene loci or individual gene locus from allele frequency of the selected gene loci or individual gene locus obtained from the optimization method recited in any one of claims 1 to 11, selecting top two or less than two alleles of the gene loci or individual gene locus from a list of alleles that are sorted based on the individual depth of coverage by descending order, and determining the alleles as candidates for genotyping of each gene locus of the selected gene loci or the individual gene locus.

13. A genotype determination method for selected gene loci or an individual gene locus, in which individual depth of coverage of reads for each allele of the gene loci or individual gene locus is calculated from allele frequency of the gene loci or individual gene locus obtained from the optimization method recited in any one of claims 1 to 12, top two or less than two alleles of the gene loci or individual gene locus are selected from a list of alleles that are sorted based on the individual depth of coverage by descending order, and the alleles are determined as candidates for genotyping of each gene locus of the selected gene loci or the individual gene locus, and which is characterized by setting a threshold of depth of coverage of alleles at 5-50% of the depth of coverage of total reads as a rejection threshold, and eliminating alleles of the gene loci or gene locus whose individual depth of coverage are less than the threshold from candidates for genotyping of each gene locus of the selected gene loci or the individual gene locus.

14. The determination method according to claim 13, which is characterized in that after eliminating the alleles from candidates for genotyping of each gene locus of the selected gene loci or the individual gene locus, the following (i) or (ii) is determined: (i) for a case that there is one allele that is selected for genotyping of the gene locus of the gene loci or the individual gene locus, if the individual depth of coverage of the allele is twice or more of the rejection threshold, it is determined that the genotype is homozygous of the allele, and if it is less than twice of the rejection threshold, it is determined that the genotype is heterozygous of the allele, and (ii) for the case that there are two alleles that are selected for genotyping of the gene locus of the gene loci or the individual gene locus, if the individual depth of coverage of the one whose individual depth of coverage is larger is less than twice of the smaller one, it is determined that the genotype is heterozygous of the two alleles, or if the individual depth of coverage of the one whose individual depth of coverage is larger is twice or more of the smaller one, it is determined that the genotype is homozygous of the allele whose individual depth of coverage is larger.

15. The genotype determination method according to any one of claims 12 to 14, which is characterized by that the selected gene loci or individual gene locus is MHC gene loci or individual MHC gene locus.

16. The genotype determination method according to claim 15, which is characterized in that MHC is HLA.

17. A computer system for optimizing read information in which read mapping information of each read to each allele of gene loci or an individual locus to be optimized is identified by mapping nucleotide sequence of reads in data in which read information of DNA derived from alleles of the gene loci or individual gene locus is mixed, comprising a recording unit and an arithmetic processing unit, which is characterized in that all or a part of the following processes (A) to (G) is executed: (A) in the recording unit, read information of DNA derived from a subject is recorded as data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped; (B) in the arithmetic processing unit, based on the information of the recording section, for each read, an expected number of mappings for each allele of the gene loci or individual gene locus is calculated by executing numerical processing; (C) for each allele, the expected number of mappings calculated in the process (B) is summed for each allele of the gene loci or individual gene locus to calculate a total number of expected mappings; (D) for each allele, the total number of expected mappings calculated in the process (C) is divided by the sum of the total number of expected mappings for all alleles of the gene loci or individual gene loci, and the process of calculating the fraction of reads allocated to each allele of the gene loci or individual gene locus with respect to the total amount of reads mapped to alleles of the gene loci or individual gene locus is executed; (E) the fraction of reads calculated in the above process (C) is assigned as frequency to each allele of the gene loci or individual gene locus, and given the assignment frequency, the process (B) is executed again in which for each read, the expected number of mappings to each allele of the gene loci or individual gene locus is calculated; (F) the process (C) or (D) is executed again for the new expected number of mappings calculated by the process (E), and the process of calculating the fraction of reads allocated to each allele of the gene loci or individual gene locus with respect to the total amount of reads mapped to alleles of the gene loci or the individual locus is executed; and (G) the processes (E) and (F) are repeatedly executed, until for all reads, difference between the number of expected mappings to each allele of the gene loci or individual gene locus calculated in the process (E) and that calculated in the previous process (E) is not recognized, or for all alleles of the gene loci or individual gene locus, difference between the fraction of reads calculated in the process (F) and that calculated in the previous process (F) is not recognized, and a converged number of expected mappings for each read to each allele of the gene loci or individual gene locus, or the converged value of the fraction of reads for each allele of the gene loci or individual gene locus, is recognized as the optimized data.

18. A computer system for optimizing, for each read, the number of expected mappings to each allele of selected gene loci or individual gene locus for data in which read information of DNA derived from alleles of the gene loci or individual locus is mixed, comprising a recording unit and an arithmetic processing unit, which is characterized in that all or a part of the following processes (A) to (E) is executed: (A) in the recording unit, read information of DNA derived from a subject is recorded as observed data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped; (B) in the arithmetic processing unit, based on the information of the recording unit, either one of the following initialization processes (B)-1 and (B)-2 is executed: (B)-1: the process of calculating an initial value of θ, which represents allele frequency of the gene loci or individual gene locus, and (B)-2: the process of calculating initial values of two latent variables, (a) and (b) described below, that are related to the θ described above and the data of DNA read information of the subject, which is the observed data: (a) a variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus, (b) a posterior probability of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n and a start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n; (C) in the arithmetic processing unit, based on the parameter θ calculated in the process (B)-1, the calculation process of the posterior probability of indicator variable Z.sub.nts or Z.sub.nt=1 is executed; (D) in the arithmetic processing unit, a first updated value of the maximum likelihood estimate of parameter θ is calculated based on the posterior probability of indicator variable Z.sub.nts or Z.sub.nt=1 calculated in the process (B)-2 or (C); and (E) in the arithmetic processing unit, the processes (C) and (D) are executed again based on the first updated value of the maximum likelihood estimate of parameter θ calculated in the process (D), and an iterative loop process for calculating a second updated value of parameter θ is repeatedly executed, until difference between the new updated value of θ and the previous updated value of θ is not recognized, and the converged parameter θ is recorded as an optimized value in the recording unit.

19. A computer system for optimizing, for each read, the number of expected mappings to each allele of selected gene loci or an individual gene locus for data where read mapping information of the gene loci or individual gene locus is mixed, comprising a recording unit and an arithmetic processing unit, which is characterized in that all or a part of the following processes (A) to (E) is executed: (A) in the recording unit, read information of DNA derived from the subject is recorded as observed data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped, (B) in the arithmetic processing unit, based on the observed data retrieved from the recording unit, either one of the following initialization processes (B)-1 and (B)-2 is executed: (B)-1: the process of calculating an initial value of posterior distribution of θ.sub.t, based on an initial value of hyperparameter α.sub.t which represents prior information of allele frequency of the gene loci or individual gene locus, and (B)-2: the process of calculating initial distribution of two latent variables, (a) and (b) described below, that are related to the distribution of θ described-above and the data of DNA read information of the subject, which is observed data: (a) a variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus, and (b) posterior distribution of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n, the start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n; (C) in the arithmetic processing unit, based on the distribution of parameter θ calculated in the process (B)-1, the calculation process of the posterior distribution of indicator variable Z.sub.nts or Z.sub.nt is executed; (D) in the arithmetic processing unit, the first updated posterior distribution of parameter θ is calculated based on the posterior distribution of Z.sub.nts or Z.sub.nt calculated in the process (B)-2 or (C), (E) in the arithmetic processing unit, the processes (C) and (D) are executed again based on the first updated value of the posterior distribution of θ, and an iterative loop process for calculating a second updated posterior distribution of θ is repeatedly executed, until difference between the expected value of the new updated posterior distribution of θ and the expected value of the previous updated posterior distribution of θ is not recognized, and the converged expected value of the posterior distribution of θ is recorded as the optimized value in the recording unit.

20. The computer system according to any one of claims 17 to 19, which is characterized in that the data in which read information of DNA derived from alleles of the selected gene loci or individual gene locus is mixed is read information in which mapping of each read to each allele of the gene loci or individual gene locus is identified by mapping read information of the subject to nucleotide sequences of alleles of the gene loci or individual gene locus registered in a database, and the mapping is executed by the following processes (a) and (b): (a) a process in which with respect to nucleotide sequence information of reads obtained from a subject, reads are mapped to human genome nucleotide sequence, and then those mapped to alleles of the gene loci or individual gene locus are extracted; and (b) a process in which the read sequence information mapped to alleles of the gene loci or individual gene locus obtained in process (a) is mapped to reference nucleotide sequences of alleles of the gene loci or individual gene locus that are registered in a database, and read information is obtained, in which mapping information and mapping states for each read to each allele of the gene loci or individual gene locus are identified.

21. The computer system according to claim 20, which is characterized in that the mapping performed in process (b) allows one read to be mapped to multiple alleles of the selected gene loci or individual gene locus.

22. The computer system according to claim 20 or 21, which is characterized in that in addition to the reads mapped to alleles of the selected gene loci or individual gene locus obtained in process (a), reads that are not mapped to the whole human genome are extracted, and it becomes the target of the re-mapping in process (b).

23. A computer system for determining genotype of each gene locus of selected gene loci or individual gene locus of a subject, comprising a recording unit and an arithmetic processing unit, which is characterized in that all or a part of the following processes (α) to (δ) is executed: (α) in the recording unit, at least allele frequencies and depth of coverage of total reads of the gene loci or individual gene locus of the subject that are obtained by the optimization method recited in any one of claims 1 to 11 are recorded; (β) in the arithmetic processing unit, based on the allele frequencies of the gene loci or the individual gene locus recorded in the recording unit, processing of calculating individual depth of coverage of each allele of the gene loci or individual gene locus, and processing of allocating the calculated individual depth of coverage to the alleles of the gene loci or the individual gene locus, are executed; (γ) given a rejection threshold value, which is set as a frequency number of 5 to 50% of the average depth of coverage of all reads, a process in which alleles of the gene loci or individual gene locus whose individual depth of coverage is smaller than the threshold value are excluded from candidates for genotyping, is executed; (δ): (δ)-1 after the exclusion process of (γ), for a case that there is one allele that is selected for genotyping the gene locus of the selected gene loci or the individual gene locus, if the individual depth of coverage of the allele is twice or more of the rejection threshold value, it is determined that the genotype is homozygous of the allele, and if it is less than twice of the rejection threshold value, it is determined that the genotype is heterozygous of the allele, and (δ)-2 after the exclusion process of (γ), for a case that there are two alleles that are selected for genotyping the gene locus of the selected gene loci or the individual gene locus, if the individual depth of coverage of the one whose individual depth of coverage is larger is less than twice of the smaller one, it is determined that the genotype is heterozygous of the two alleles, or if the individual depth of coverage of the one whose individual depth of coverage is larger is twice or more of the smaller one, it is determined that the genotype is homozygous of the allele whose individual depth of coverage is larger.

24. The computer system according to any one of claims 17 to 23, which is characterized in that the selected gene loci or individual gene locus is MHC gene loci or individual MHC gene locus.

25. The computer system according to claim 24, which is characterized in that MHC is HLA.

26. A computer program for optimizing read information in which mapping correspondence of each read to alleles of a selected gene loci or an individual gene locus is obtained by mapping a nucleotide sequence of read data where read information of DNA derived from alleles of the selected gene loci or an individual gene locus is mixed, which is characterized by including algorithms to realize all or a part of the following functions (1) to (7) in a computer: (A) the function (1) in which read information of DNA derived from a subject recorded in a recording unit as data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped is retrieved from the recording unit; (B) the function (2) in which based on the read information retrieved in the function (1), for each read, an expected number of mappings for each allele of the gene loci or individual gene locus is calculated by executing numerical processing; (C) the function (3) in which for each allele, the expected number of mappings quantified in the function (2) is summed for each allele of the gene loci or individual gene locus to calculate the total number of expected mappings; (D) the function (4) in which for each allele, the total number of expected mappings calculated in the function (3) is divided by the sum of the total number of expected mappings for all alleles of the gene loci or individual gene loci, and the process of calculating the fraction of reads allocated to each allele of the gene loci or individual gene locus with respect to the total amount of reads mapped to alleles of the gene loci or individual gene locus is executed; (E) the function (5) in which the fraction of reads calculated in the function (4) is assigned as frequency to each allele of the gene loci or individual gene locus, and given the assignment frequency, the function (2) is executed again in which for each read, the expected number of mappings to each allele of the gene loci or individual gene locus is calculated; (F) the function (6) in which the function (3) or (4) is executed again for the newly obtained expected number of mappings calculated by the function (5), and the process of calculating the fraction of reads allocated to each allele of the gene loci or individual gene locus with respect to the total amount of reads mapped to alleles of the gene loci or the individual locus is executed; and (G) the function (7) in which the functions (5) and (6) are repeatedly executed, until for all reads, difference between the number of expected mappings to each allele of the gene loci or individual gene locus calculated in the function (5) and that calculated in the previous function (5), is not recognized, or for all alleles of the gene loci or individual gene locus, difference between the value of the fraction of reads calculated in the function (6) and that calculated in the previous function (6), is not recognized, and the converged number of expected mappings for each read to each allele of the gene loci or individual gene locus, or the converged value of the fraction of reads for each allele of the gene loci or individual gene locus, is recognized as an optimized data.

27. A computer program for optimizing, for each read, the number of expected mappings to each allele of selected gene loci or individual gene locus for data in which read information of DNA derived from alleles of the gene loci or gene locus is mixed, which is characterized by including algorithms for realizing all or a part of the following functions 1 to 5: (A) function 1 in which read information of DNA derived from a subject recorded in a recording unit as observed data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped, is retrieved from the recording unit; (B) function 2 in which based on the observed data retrieved by function 1, either one of the following initialization processes (B)-1 and (B)-2 is executed: (B)-1: the process of calculating an initial value of θ, which represents allele frequency of the gene loci or individual gene locus, and (B)-2: the process of calculating initial values of two latent variables, (a) and (b) described below, that are related to the θ described above and the data of DNA read information of the subject, which is the observed data: (a) a variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus, and (b) posterior probability of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n which is start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes a latent variable T.sub.n; (C) function 3 in which based on the parameter θ calculated in the process (B)-1, the calculation process of the posterior probability of indicator variable Z.sub.nts or Z.sub.nt=1 is executed; (D) function 4 in which a first updated value of maximum likelihood estimate of parameter θ is calculated based on the posterior probability of Z.sub.nts or Z.sub.nt=1 calculated in process (B)-2 or function 3; and (E) function 5 in which function 3 and 4 are executed again based on the first updated value of the maximum likelihood estimate of θ, and the iterative loop process for calculating a second updated value of θ, is repeatedly executed, until difference between the new updated value of θ and the previous updated value of θ is not recognized, and the converged parameter θ is recorded as an optimized value in the recording unit.

28. A computer program for optimizing, for each read, the number of expected mappings to each allele of selected gene loci or individual gene locus for data in which read information of DNA derived from alleles of the gene loci or gene locus is mixed, which is characterized by including algorithms for realizing all or a part of the following functions 1 to 5 in a computer: (A) function 1 in which read information of DNA derived from a subject recorded in a recording unit as observed data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped, is retrieved from the recording unit; (B) function 2 in which based on the observed data retrieved by function 1, either one of the following initialization processes (B)-1 and (B)-2 is executed: (B)-1: a process of calculating an updated value of posterior distribution of θ.sub.t, based on an initial value of hyperparameter α.sub.t which represents prior information of allele frequency of the gene loci or individual gene locus, and (B)-2: a process of calculating initial distributions of two latent variables, (a) and (b) described below, that are related to the distribution of θ described above and the data of DNA read information of the subject, which is observed data: (a) a variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus, (b) posterior distribution of indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n which is start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes latent variable T.sub.n; (C) function 3 in which based on the distribution θ calculated in the process (B)-1, the calculation process of the posterior distribution of indicator variable Z.sub.nts or Z.sub.nt is executed; (D) function 4 in which a first updated value of the posterior distribution of θ is calculated based on the posterior distribution of Z.sub.nts or Z.sub.nt calculated in function (B)-2 or function 3; and (E) function 5 in which function 3 and 4 are executed again based on the first updated value of the posterior distribution of θ, and an iterative loop process for calculating a second updated value of the posterior distribution of θ is repeatedly executed, until difference between an expected value of the new updated posterior distribution of θ and an expected value of the previous updated posterior distribution of θ is not recognized, and the converged expected value of posterior distribution of θ is recorded as an optimized value in the recording unit.

29. The computer program according to any one of claims 26 to 28, in which the data in which read information of DNA derived from alleles of the selected gene loci or individual gene locus is mixed, is read information in which mapping of each read to each allele of the gene loci or individual gene locus is identified by mapping read information of the subject to nucleotide sequences of alleles of the gene loci or individual gene locus registered in a database, which is characterized in that the mapping is realizing by executing the following functions (a) and (b): (a) a function in which with respect to nucleotide sequence information of reads obtained from the subject, reads are mapped to human genome nucleotide sequence, and then those mapped to alleles of the gene loci or individual gene locus are extracted; and (b) a function in which with respect to the read sequence information mapped to alleles of the gene loci or individual gene locus obtained in function (a), reads are mapped to reference nucleotide sequences of alleles of the gene loci or individual gene locus that are registered in a database, and reads mapped to each allele of the gene loci or individual locus are extracted, and read information is obtained, in which mapping of each read to each allele of the gene loci or individual gene locus is identified.

30. The computer program according to claim 29, which is characterized in that the mapping performed in function (b) allows one read to be mapped to multiple alleles of the selected gene loci or individual gene locus.

31. The computer program according to claim 29 or 30, which is characterized in that in addition to the reads mapped to alleles of the selected gene loci or individual gene locus obtained in function (a), reads that are not mapped to the whole human genome are extracted, and it becomes the target of the re-mapping in function (b).

32. A computer program for determining genotype of a gene locus of selected gene loci or an individual gene locus of a subject, which is characterized by including algorithms for realizing the following functions (α) to (δ) in a computer: (α) a function α, in which at least allele frequencies and depth of coverage of total reads of the gene loci or individual gene locus of the subject that are obtained by executing the computer program recited in any one of claims 26 to 31, are retrieved; (β) a function β, in which based on the allele frequencies of the gene loci or the individual gene locus retrieved by executing function α, calculation of individual depth of coverage of each allele of the gene loci or individual gene locus, and allocation of the calculated individual depth of coverage to the alleles of the gene loci or the individual gene locus, are executed; (γ) a function γ, in which given a rejection threshold value, which is set as a frequency number of 5 to 50% of average depth of coverage of all reads, a process in which alleles of the gene loci or individual gene locus whose individual depth of coverage specified by the function β is smaller than the threshold value are excluded from candidates for genotyping, is executed; and (δ): a function δ specified as (δ)-1 and (δ)-2 below: (δ)-1 after the exclusion process in function (γ), for a case that there is one allele that is selected for genotyping the gene locus of the selected gene loci or the individual gene locus, if the individual depth of coverage of the allele is twice or more of the rejection threshold value, a process in which it is determined that the genotype is homozygous of the allele, and if it is less than twice of the rejection threshold value, it is determined that the genotype is heterozygous of the allele, and (δ)-2 after the exclusion process in function (γ), for a case that there are two alleles that are selected for genotyping the gene locus of the selected gene loci or the individual gene locus, if the individual depth of coverage of the one whose individual depth of coverage is larger is less than twice of the smaller one, a process in which it is determined that the genotype is heterozygous of the two alleles, or if the individual depth of coverage of the one whose individual depth of coverage is larger is twice or more of the smaller one, it is determined that the genotype is homozygous of the allele whose individual depth of coverage is larger, is executed.

33. The computer program according to any one of claims 26 to 32, which is characterized in that the selected gene loci or individual gene locus is MHC gene loci or individual MHC gene locus.

34. The computer program according to claim 33, which is characterized in that MHC is HLA.

35. A computer-readable recording medium, wherein the computer program recited in any one of claims 26 to 34 is recorded.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0146] FIG. 1 is a diagram showing a process flow of the present invention.

[0147] FIG. 2-1 is a flow sheet showing one embodiment of the optimization process of HLA-read mapping information with a latent variable Z.sub.nt.

[0148] FIG. 2-2 is a flow sheet of the EM algorithm showing one embodiment of the optimization process of HLA-read mapping information with a latent variable Z.sub.nts.

[0149] FIG. 2-3 is a flow sheet of the variational Bayesian algorithm showing one embodiment of the optimization process of HLA-read mapping information with a latent variable Z.sub.nts.

[0150] FIG. 3 is a flow sheet showing one example of the HLA typing process based on “the fraction of reads allocated to each HLA allele” obtained through the above optimization process.

[0151] FIG. 4 is a figure showing prediction accuracy of HLA types at four-digit resolution at various depths of coverage using the simulation data.

[0152] FIG. 5-1 is a graph showing the results of examining allele frequencies of HLA-A by the system of the present invention and another method.

[0153] FIG. 5-2 is a graph showing the results of examining allele frequencies of HLA-B by the system of the present invention and another method.

[0154] FIG. 5-3 is a graph showing the results of examining allele frequencies of HLA-C by the system of the present invention and another method.

MODES OF CARRYING OUT THE INVENTION

[0155] [A] Estimation Methods with the EM Algorithm and Variational Bayesian Method

[0156] Here, the contents of the EM algorithm and the variational Bayesian method disclosed in the previous application will be described in more detail. Symbols to be used are as described before unless otherwise noted.

(1) EM (Expectation-Maximization) Algorithm

[0157] The EM algorithm is a method for obtaining the maximum likelihood estimate of the parameter in the probability model in which the latent variable exists. In the estimation method of the present invention, for example, it can be done with the following contents.

[0158] That is, the estimation method of the present invention by the EM method is exemplified as the estimation method of the present invention characterized by performing the following steps (a) to (e) (this method is also referred to as “the EM method of the present invention”). [0159] (a) A step in which the first updated value of a posterior probability of Z.sub.nts=1 or Z.sub.nt=1 is calculated, based on a given initial value of θ.sub.t, and the first updated value of the maximum likelihood estimate of θ.sub.t is further calculated, based on the first updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1, or, (b) A step in which an updated value of the maximum likelihood estimate of θ.sub.t is calculated, based on a given initial value of the posterior probability of the Z.sub.nts=1 or Z.sub.nt=1, [0160] (c) A step in which an updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 is calculated, based on the updated value of the maximum likelihood estimate of θ.sub.t calculated by the previous step (d) (however, for the first time, it is step (a) or step (b)), [0161] (d) A step in which an updated value of the maximum likelihood estimate of θ.sub.t is calculated, based on the updated value of the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 calculated in step (c), and [0162] (e) (i) A step in which convergence of the log likelihood is evaluated by calculating the log likelihood based on the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 calculated in step (c) and θ.sub.t calculated in step (d),

[0163] (ii) A step in which convergence of the updated posterior probability of Z.sub.nts=1 or Z.sub.nt=1 calculated in step (c) is evaluated, or

[0164] (iii) A step in which convergence of the updated maximum likelihood estimate of θ.sub.t calculated in step (d) is evaluated, and,

if convergence is recognized, θ.sub.t in each step is determined as the final estimate value, and if convergence is not recognized, iteration of steps (c), (d) and (e) is determined.

[0165] The step (a) described above is a step of defining the initial value of this method for iterative optimization.

[0166] The initial value of n in Z.sub.nts=1 or Z.sub.nt=1 is a number from 1 to N, t is a number from 1 to T. The value of s is a number 1≦s≦l.sub.t−L+1, where l.sub.t is the length of allele t of the particular gene locus, L is the read length. If a read is mapped to multiple locations, it is preferable to set the initial value of Z.sub.nts=1 or Z.sub.nt=1 as 1/[The number of locations to which read n is mapped] by assuming that the mappings are equally likely. It is preferable to set the initial value of θ.sub.t as 1/T by assuming that an allele frequency of each gene locus is the same. When the prior information of allele frequency of a specific gene locus is already known, it is also possible to use this information as the initial value of θ.sub.t, for example, when the subject population is a specific racial group.

[0167] The step (b) is a step in which r.sub.nts, the posterior probability of Z.sub.nts=1 or r.sub.nt, the posterior probability of Z.sub.nt=1, is calculated based on the current estimated value (for the first iteration, it is calculated based on the initial value obtained in the step (a)).

[0168] That is, it is a step of calculating P (Z.sub.nts=1|θ*.sub.t) or P (Z.sub.nt=1|θ*.sub.t) (where * indicates that θ.sub.t has been updated. The same notation follows below.), in which r.sub.nts is calculated as

[00006] [ Mathematical .Math. .Math. 13 ] r nts = { ρ nts .Math. ( t , s ) .Math. π n .Math. .Math. ρ nt .Math. s if .Math. .Math. ( t , s ) π n 0 otherwise .

[0169] Here, for the single-end reads, ρ.sub.nts is calculated as


log ρ.sub.nts=log θ.sub.t+log p(S.sub.n|T.sub.n)+log p(R.sub.n|T.sub.n,S.sub.n)   [Mathematical 14]

(where the first term on the right side can be calculated by taking the logarithm of Expression (1) (a), the second term can be calculated by taking the logarithm of Expression (1) (c), and the third term can be calculated by taking the logarithm of Expression (2).)

[0170] For the paired-end reads, ρ.sub.nts is calculated as

[00007] .Math. [ Mathematical .Math. .Math. 15 ] log .Math. .Math. ρ nts = .Math. f = L l t - s + 1 .Math. .Math. { log .Math. .Math. θ t + log .Math. .Math. p ( F n = f T n ) + log .Math. .Math. p ( S n T n , F n = f ) + log .Math. .Math. p ( R na T n , S n ) + log .Math. .Math. p ( R nb T n , F n , S n ) }

(where the first term within the brackets on the right side can be calculated by taking the logarithm of Expression (3) (a), the second term can be calculated by taking the logarithm of Expression (3) (a′), the third term can be calculated by taking the logarithm of Expression (3) (b), the fourth term can be calculated by taking the logarithm of Expression (4-1), and the fifth term can be calculated by taking the logarithm of Expression (4-2).)

[0171] On the other hand, because Z.sub.nt considers all the Z.sub.nts described above with respect to all the possible s for each t, r.sub.nt can be calculated as

[00008] [ Mathematical .Math. .Math. 16 ] r nt = .Math. s .Math. .Math. r nts

[0172] This corresponds to the E step of the EM method of the present invention.

[0173] The step (c) is a step corresponding to the M step of EM method of the present invention, in which based on r.sub.nts or r.sub.nt which is the posterior probability of Z.sub.nts=1 or Z.sub.nt=1 obtained in step (b), the maximum likelihood estimate θ.sub.t that maximizes the log likelihood is calculated as follows.

[00009] [ Mathematical .Math. .Math. 17 ] θ t = .Math. n , t = t , s .Math. .Math. r n .Math. t .Math. s N .Math. .Math. or , [ Mathematical .Math. .Math. 18 ] θ t = .Math. n , t = t .Math. .Math. r n .Math. t N

[0174] The steps (d) and (e) are as described above. As a preferred example of the convergence criterion, the relative change 10.sup.−3 is used as the convergence criterion for allele frequencies of the particular gene loci or individual gene locus for the abundance parameter θ.sub.t>10.sup.−7, but different convergence criteria can be also used.

[0175] Usually, “iteration of steps (c), (d), and (e)” in step (e) is performed once or more, but it is also possible to do genotyping of alleles of the particular gene loci or individual gene locus without executing the iterative step, or with terminating the step when parameters are not converged.

(2) Variational Bayesian Method

[0176] Variational Bayesian method is a method for estimating the posterior probability distribution of parameters by Bayesian estimation method, which enables robust estimation of parameters against noise.

[0177] The mathematical expression of the complete likelihood and posterior joint distribution of the estimation method of the present invention represented by the expression (1) involves integration over all possible latent variables Z and can not be solved analytically. Therefore, in the complete likelihood and posterior joint distribution, to obtain approximate solutions, the factorization of the latent variables and the model parameters is assumed as below.


q(θ, Z)≈q(θ).Math.q(Z)   [Mathematical 19]

[0178] In the present invention, we use Dirichlet distribution as a prior distribution of θ, as described above.

[00010] [ Mathematical .Math. .Math. 20 ] p ( θ ) = 1 G ( α ) .Math. .Math. t = 1 T .Math. .Math. θ t α t - 1 ( III )

[where Π.sup.T.sub.t=1θ.sub.t=1, T is the number of alleles of the particular gene loci, α.sub.t is a hyperparameter,

[00011] [ Mathematical .Math. .Math. 21 ] G ( α ) = .Math. t .Math. .Math. Γ ( α t ) Γ ( .Math. t .Math. .Math. α t ) ,

and Γ(.Math.) is gamma function.]

[0179] In the present invention, based on the assumption that there is no prior knowledge on the relative differences in the allele frequencies of particular gene loci or an individual gene locus such as HLA loci or locus, HLA being human a MHC, it is preferred that a single hyperparameter α.sub.0 is used for all the alleles of the particular gene loci or individual gene locus. The single hyperparameter α.sub.0 controls the complexity of the target parameter, that is, the number of θ.sub.t>0. When α.sub.0≧1, α.sub.0−1 can be interpreted as a prior count of the read assigned to the allele of the particular gene loci or individual gene locus, and when α.sub.0<1, the prior distribution favors a tendency for some of the allele frequencies at the particular gene loci or individual gene locus to be zero. Specifically, it is preferable to set 0<α.sub.0≦0.1, or α.sub.0 that maximizes the lower bound of the log marginal likelihood. When prior information of allele frequencies is known, it is also possible to weight α.sub.t>0 so that α.sub.t whose allele frequency is higher in prior information is given a larger value in descending order for each t. Given the hyperparameter α.sub.0, the lower bound of the log marginal likelihood is iteratively maximized by the variational Bayesian estimation algorithm.

[0180] The estimation method of the present invention by variational Bayesian method is exemplified as the estimation method of the present invention characterized by carrying out the following steps (a) to (d) (this method is referred to as a “variational Bayesian method of the present invention”). [0181] (a) A step in which the first update posterior distribution of Z.sub.nts or Z.sub.nt is calculated, based on a given initial posterior distribution of θ.sub.t based on a given initial value of the hyperparameter α.sub.0 representing prior information of allele frequencies of the particular gene loci or individual gene locus, and the first updated posterior distribution of θ.sub.t is further calculated, based on the first updated posterior distribution of Z.sub.nts or Z.sub.nt, or, (b) A step in which the first updated posterior distribution of θ.sub.t is calculated, based on a given initial distribution of Z.sub.nts or Z.sub.nt, [0182] (c) A step in which an updated posterior distribution of Z.sub.nts or Z.sub.nt is calculated, based on the updated posterior distribution of θ.sub.t calculated by the previous step (d) (however, for the first time, it is step (a) or step (b)), [0183] (d) A step in which an updated posterior distribution of θ.sub.t is calculated, based on the updated posterior distribution of Z.sub.nts or Z.sub.nt calculated in step (c), and [0184] (e) A step in which convergence of an expected value of the updated posterior distribution of θ.sub.t calculated in step (d) is evaluated, and if convergence is recognized, the expected value of θ.sub.t is determined as the final estimate value, and if convergence is not recognized, iteration of steps (c), (d) and (e) is determined.

[0185] The step (a) described above is a step of defining the initial value of this method for iterative optimization. Specifically, it is preferable to set α*.sub.t=(N/T+α.sub.0)/Σ.sub.t(N/T+α.sub.0) for each allele t of the particular gene loci or individual gene locus, where α*.sub.t is a parameter of q*(θ), assuming that each allele frequency of the particular gene loci or individual gene locus is equal.

[0186] The step (b) is a step to calculate E.sub.z [Z.sub.nts] or E.sub.z [Z.sub.nt] given the current estimate of q*(θ). The value for n in E.sub.z [Z.sub.nts] is the number from 1 to N, the value for t is the number from 1 to T, the value for s is the number 1≦s≦l.sub.t−L+1, where l.sub.t is the length of allele t of the particular gene loci or individual gene locus, and L is the read length. E.sub.z[Z.sub.nts] is calculated based on the current estimate of q*(θ) as below.

[00012] [ Mathematical .Math. .Math. 22 ] E Z ( Z nts ] = { ρ nts .Math. ( t , s ) .Math. π n .Math. .Math. ρ nt .Math. s if .Math. .Math. ( t , s ) π n 0 otherwise

[0187] Here, for the single-end reads, ρ.sub.nts is calculated as


log ρ.sub.nts=E.sub.θ[log θ.sub.t]+log p(S.sub.n|T.sub.n)+log p(R.sub.n|T.sub.n,S.sub.n)   [Mathematical 23]

(where the first term on the right side can be calculated by taking an expected value of the logarithm of Expression (1)(a), the second term can be calculated by taking the logarithm of Expression (1)(c), and the third term can be calculated by taking the logarithm of Expression (2).)

[0188] For the paired-end reads, ρ.sub.nts is calculated as

[00013] .Math. [ Mathematical .Math. .Math. 24 ] log .Math. .Math. ρ nts = .Math. f = L l t - s + 1 .Math. .Math. { E θ [ log .Math. .Math. θ t ] + log .Math. .Math. p ( F n = f T n ) + log .Math. .Math. p ( S n T n , F n = f ) + log .Math. .Math. p ( R na T n , S n ) + log .Math. .Math. p ( R nb T n , F n , S n ) }

(where the first term within the brackets on the right side can be calculated by taking an expected value of the logarithm of Expression (3)(a), the second term can be calculated by taking the logarithm of Expression (3)(a′), the third term can be calculated by taking the logarithm of Expression (3)(b), the fourth term can be calculated by taking the logarithm of Expression (4-1), and the fifth term can be calculated by taking the logarithm of Expression (4-2).)

[0189] On the other hand, because Z.sub.nt considers all the Z.sub.nts described above with respect to all the possible s for each t, r.sub.nt can be calculated as

[00014] [ Mathematical .Math. .Math. 25 ] r nt = .Math. s .Math. .Math. r nts

[0190] It is also possible to perform a series of calculations using Z.sub.nt instead of Z.sub.nts, and to execute the subsequent steps.

[0191] Here,

[00015] [ Mathematical .Math. .Math. 26 ] E θ [ log .Math. .Math. θ t ] = ψ ( α t * ) - ψ ( .Math. t .Math. .Math. α t * )

(where ψ(.Math.) is digamma function.)

[0192] The step (c) is a step of calculating E.sub.θ[θ.sub.t] given the current estimate of q*(Z).

[0193] E.sub.θ[θ.sub.t] is calculated based on the current estimate of q*(Z) as follows.

[00016] [ Mathematical .Math. .Math. 27 ] E θ [ θ t ] = α t * .Math. t .Math. .Math. α t * .Math. .Math. ( where .Math. .Math. α t * = α 0 + .Math. n , t = t , s .Math. .Math. E Z [ Z n , t , s ] )

[0194] Therefore, q*(θ) is also a Dirichlet distribution, and the prior distribution p(θ) is a conjugate prior distribution.

[0195] In the steps (d), as a preferred example of the convergence criterion, the relative change 10.sup.−3 is used as the convergence criterion for allele frequencies of the particular gene loci or individual gene locus for the expected value of the abundance parameter, E.sub.θ[θ.sub.t]>10.sup.−7, but different convergence criteria can be also used.

[0196] Here, we investigate the lower bound of the log marginal likelihood and show that the converged value obtained by updating the variational Bayesian method of the present invention approximates the true objective parameter θ.

[0197] The log marginal likelihood of the present invention is decomposed as

[00017] [ Mathematical .Math. .Math. 28 ] log .Math. .Math. p ( R ) = L ( q ) + KL ( q .Math. .Math. p ) .Math. .Math. ( where [ Mathematical .Math. .Math. 29 ] L ( q ) = q ( θ , Z ) .Math. log .Math. .Math. p ( R , θ , Z ) q ( θ , Z ) .Math. d .Math. .Math. θ .Math. .Math. dZ , .Math. KL ( q .Math. .Math. p ) = - q ( θ , Z ) .Math. log .Math. .Math. p ( θ , Z R ) q ( θ , Z ) .Math. d .Math. .Math. θ .Math. .Math. dZ )

[0198] Since KL (q∥p) is the Kullback-Leibler divergence between q(θ, Z) and p(θ, Z|R), the lower bound of the log marginal likelihood is given by L(q). That is, since the Kullback-Leibler divergence is always 0 or more, L(q) constitutes the lower bound of the log marginal likelihood. It is generally proved that L(q) (the lower bound of the log marginal likelihood) is increased each time the steps (b) and (c) are repeatedly updated (Bishop C M. Pattern Recognition and Machine Learning. Springer Science: Business Media, LLC, New York, N.Y., USA; 2006).

[0199] Based on the factorization assumption


q(θ, Z)≈q(θ) q(Z)   [Mathematical 30]

as described above, L(q) is calculated as follows.

[00018] [ Mathematical .Math. .Math. 31 ] L ( q ) = E [ log .Math. .Math. p ( R , θ , Z ) ] - E [ log .Math. .Math. p ( θ , Z ) ] = E [ log .Math. .Math. p ( R , Z θ ) ] + E [ log .Math. .Math. p ( θ ) ] - E [ log .Math. .Math. q ( θ ) ] - E [ log .Math. .Math. q ( Z ) ] .Math. .Math. ( where [ Mathematical .Math. .Math. 32 ] E [ log .Math. .Math. p ( R , Z θ ) ] = .Math. n , t , s .Math. .Math. E Z [ Z nts ] .Math. log .Math. .Math. ρ nts , .Math. E [ log .Math. .Math. p ( θ ) ] = .Math. t .Math. .Math. ( α 0 - 1 ) .Math. E θ [ log .Math. .Math. θ t ] - log .Math. .Math. G ( α ) , .Math. E [ log .Math. .Math. q ( θ ) ] = .Math. t .Math. .Math. ( α t * - 1 ) .Math. E θ [ log .Math. .Math. θ t ] - log .Math. .Math. G ( α * ) , .Math. E [ log .Math. .Math. q ( Z ) ] = .Math. n , t , s .Math. .Math. E Z [ Z nts ] .Math. log .Math. .Math. E Z [ Z nts ] )

[0200] Usually, “iteration of steps (c), (d), and (e)” in step (e) is performed once or more, but it is also possible to do genotyping of alleles of the particular gene loci or individual gene locus without executing the iterative step, or with terminating the step when parameters are not converged.

(3) The Computer System and Computer Program using the EM Algorithm [0201] (i) As a computer system using the EM algorithm described above, for example, the following computer system can be described. An embodiment of the computer system of the present invention is a computer system characterized in that a computer executes an estimation method using the EM algorithm described above.

[0202] The computer system of the present invention is a computer system for optimizing, for each read, the number of expected mappings to each allele of the gene loci or individual gene locus for the data where read mapping information of the particular gene loci or individual gene locus is mixed, comprising a recording unit and an arithmetic processing unit, which is characterized in that all or a part of the following processes (A) to (E), is executed: [0203] (A) In the recording unit, read information of DNA derived from the subject is recorded as observed data of nucleotide sequence of read and alleles of the gene loci or individual gene locus to which the read is mapped to, [0204] (B) In the arithmetic processing unit, based on the observed data retrieved from the recording unit, either one of the following initialization processes (B)-1 and (B)-2 is executed:

[0205] (B)-1: The process of calculating the initial value of θ, which represents allele frequency of the gene loci or individual gene locus, and

[0206] (B)-2: The process of calculating the initial values of two latent variables, (a) and (b) described below, that are related to the θ described above and the data of DNA read information of the subject, which is the observed data:

[0207] (a) A variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus, and

[0208] (b) The posterior probability of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n, the start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n, [0209] (C) In the arithmetic processing unit, based on the parameter θ calculated in the process (B)-1 described above, the calculation process of the posterior probability of indicator variable Z.sub.nts or Z.sub.nt=1 is exerted, [0210] (D) In the arithmetic processing unit, the first updated value of the maximum likelihood estimate of parameter θ is calculated based on the posterior probability of indicator variable Z.sub.nts or Z.sub.nt=1 calculated in the process (B)-2 or (C), [0211] (E) In the arithmetic processing unit, the processes (C) and (D) described above are executed again based on the first updated value of the maximum likelihood estimate of θ calculated in the process (D) described above, and the iterative loop process for calculating the second updated value of θ is repeatedly executed, until difference between the new updated value of θ and the previous updated value of θ is not recognized, and the converged parameter θ is recognized as the optimized value and recorded in the recording unit described above. [0212] (ii) As a computer program using the EM algorithm described above, for example, the following computer program can be considered. The embodiment of the computer program of the present invention is a computer program characterized by realizing the estimation method using the EM algorithm in a computer.

[0213] The computer program is a computer program for optimizing, for each read, the number of expected mappings to each allele of the gene loci or locus for the data where read mapping information of the particular gene loci or individual locus is mixed, which is characterized in that all or a part of the following functions 1 to 5, is realized: [0214] (A) Function 1 in which read information of DNA derived from the subject recorded as observed data of nucleotide sequence of read and alleles of the loci or locus to which the read is mapped to, is retrieved from the recording unit, [0215] (B) Function 2 in which based on the observed data retrieved by function 1 described above, either one of the following initialization processes (B)-1 and (B)-2 is executed:

[0216] (B)-1: A process of calculating the initial value of θ, which represents allele frequency of the gene loci or individual gene locus, and

[0217] (B)-2: A process of calculating the initial values of two latent variables, (a) and (b) described below, that are related to the θ described above and the data of DNA read information of the subject, which is observed data:

[0218] (a) A variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus, and

[0219] (b) The posterior probability of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n, the start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n, [0220] (C) Function 3 in which based on the parameter θ calculated in the process (B)-1 described above, the calculation process of the posterior probability of indicator variable Z.sub.nts or Z.sub.nt is executed, [0221] (D) Function 4 in which the first updated value of the maximum likelihood estimate of parameter θ is calculated based on the posterior probability of indicator variable Z.sub.nts=1 or Z.sub.nt=1 calculated in the process (B)-2 or function 3, [0222] (E) Function 5 in which function 3 and 4 described above are executed again based on the first updated value of the maximum likelihood estimate of θ calculated in the function 4 described above, and the iterative loop process for calculating the second updated value of θ is repeatedly executed, until difference between the new updated value of θ and the previous updated value of θ is not recognized, and the converged parameter θ is recognized as the optimized value and recorded in the recording unit described above.
(4) The Computer System and Computer Program using the Variational Bayesian Method [0223] (i) As a computer system using the variational Bayesian method described above, for example, the following computer system can be considered. The embodiment of the computer system of the present invention is a computer system characterized by realizing the estimation method using the variational Bayesian method in a computer.

[0224] The computer system of the present invention is a computer system for optimizing, for each read, the number of expected mappings to each allele of the gene loci or locus for the data where read mapping information of the particular gene loci or individual gene locus is mixed, comprising a recording unit and an arithmetic processing unit, which is characterized in that all or a part of the following processes (A) to (E), is executed: [0225] (A) In the recording unit, read information of DNA derived from the subject is recorded as observed data of nucleotide sequence of read and alleles of the loci or locus to which the read is mapped to, [0226] (B) In the arithmetic processing unit, based on the observed data retrieved from the recording unit, either one of the following initialization processes (B)-1 and (B)-2 is executed:

[0227] (B)-1: A process of calculating the updated value of the posterior distribution of θ.sub.t, based on the initial value of hyperparameter α.sub.t which represents prior information of allele frequency of the gene loci or individual gene locus, and

[0228] (B)-2: A process of calculating the initial distribution of two latent variables, (a) and (b) described below, that are related to the distribution of θ described above and the data where DNA read information of the subject is mixed, which is the observed data:

[0229] (a) A variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus,

[0230] (b) The posterior distribution of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n, the start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n, [0231] (C) In the arithmetic processing unit, based on the distribution of parameter θ calculated in the process (B)-1 described above, the calculation process of the posterior distribution of indicator variable Z.sub.nts or Z.sub.nt is exerted, [0232] (D) In the arithmetic processing unit, the first updated posterior distribution of parameter θ is calculated based on the updated posterior distribution of Z.sub.nts or Z.sub.nt calculated in the process (B)-2 or (C), [0233] (E) In the arithmetic processing unit, the processes (C) and (D) described above are executed again based on the first updated posterior distribution of θ calculated in the process (D), and the iterative loop process in which the second updated posterior distribution of θ is calculated is repeatedly executed, until difference between the expected value of the new updated posterior distribution of θ and the expected value of the previous updated posterior distribution of θ is not recognized, and the converged expected value of the posterior distribution of θ is recognized as the optimized value and recorded in the recording unit described above. [0234] (ii) As a computer program using the variational Bayesian method described above, for example, the following computer program can be considered. The embodiment of the computer program of the present invention is a computer program characterized by realizing the estimation method using the variational Bayesian method in a computer.

[0235] The computer program is a computer program for optimizing, for each read, the number of expected mappings to each allele of particular gene loci or individual gene locus for the data where read mapping information of the particular gene loci or individual gene locus is mixed, which is characterized in that all or a part of the following functions 1 to 5, is realized: [0236] (A) Function 1 in which read information of DNA derived from the subject recorded as observed data of nucleotide sequence of read and alleles of the loci or locus to which the read is mapped, is retrieved from the recording unit, [0237] (B) Function 2 in which based on the observed data retrieved by function 1 described above, either one of the following initialization processes (B)-1 and (B)-2 is executed:

[0238] (B)-1: A process of calculating the updated posterior distribution of θ.sub.t, based on the initial value of hyperparameter α.sub.t which represents the prior information of allele frequency of the gene loci or individual gene locus, and

[0239] (B)-2: A process of calculating the posterior distribution of two latent variables, (a) and (b) described below, that are related to the distribution of θ described above and the data where DNA read information of the subject is mixed, which is the observed data:

[0240] (a) A variable T.sub.n that is dependent on θ, which represents allele selection of read n in the gene loci or individual gene locus,

[0241] (b) The initial posterior distribution of an indicator variable Z.sub.nts (Z.sub.nts equals to one if (T.sub.n, S.sub.n)=(t, s), and zero otherwise), which summarizes S.sub.n, the start position of read n that is dependent on T.sub.n, or Z.sub.nt (Z.sub.nt equals to one if T.sub.n=t, and zero otherwise), which summarizes the latent variable T.sub.n, [0242] (C) Function 3 in which based on the distribution of θ calculated in the process (B)-1 described above, the calculation process of the posterior distribution of indicator variable Z.sub.nts or Z.sub.nt is executed, [0243] (D) Function 4 in which the first updated value of the posterior distribution of θ is calculated based on the updated posterior distribution of Z.sub.nts or Z.sub.nt calculated in the process (B)-2 or function 3, [0244] (E) Function 5 in which function 3 and 4 described above are executed again based on the first updated value of the posterior distribution of θ calculated in the function 4 described above, and the iterative loop process in which the second updated value of the posterior distribution of θ is calculated is repeatedly executed, until difference between the expected value of the new updated posterior distribution of θ and the expected value of the previous updated posterior distribution of θ is not recognized, and the converged expected value of θ is recognized as the optimized value and recorded in the recording unit described above.

[B] A More Specific Form of the Method of the Present Invention

[0245] Here, we disclose an example where the particular gene loci are HLA loci, HLA being a human MHC, but the algorithm disclosed herein may be applied to other loci, cytochrome P450 loci, immunoglobulin gene loci, T cell receptor gene loci, and olfactory receptor gene loci. The method of the present invention is used as a generic name of the optimization method and determination method, as described above.

[0246] FIG. 1 is a process flow of the method of the present invention for predicting the genotype of a subject's HLA locus. Each element of this flow of processing is all performed electronically on the computer based on an instruction to realize an algorithm of computer software via a computer network or a database as necessary.

[0247] Box B1-1 indicates the presence of read data and box B1-2 indicates the presence of a human genome reference sequence. As described above, the “read data” (box B1-1) is electronic information of nucleotide sequence of a part or all of the individual DNA fragments (reads) obtained by processing the subject's DNA sample with a high-throughput sequencer, and normally, information indicating the reading accuracy is also added (FASTQ: a term derived from the FASTA format representing the nucleotide sequence of DNA). At this stage, “depth of coverage of total reads” is determined. The “human genome reference sequence” (box B1-2) is constructed from one or multiple human genome sequence information as described above, and its source of supply is not particularly limited here. It is also assumed that the ethnicity of the subject and that of individuals from which the human genome reference sequence is constructed are different, and there exists heterogeneity with respect to the genome sequence in HLA genes. The substantial effect of the present invention does not depend on the version change of the reference sequence, and it is always possible to determine HLA types of the subject in correspondence with the changes of the reference sequence, and as the latest reference sequence provided at present or the one that will be provided in the future can be used. Examples of human genome reference sequences include the human genome reference sequence (such as hg19) managed by UCSC (University of California Santa Cruz), the human genomic reference sequence (such as GRCh37) managed by the Genome Reference Consortium, and others, but it is not limited to these examples. As described above, for example, sequence data obtained from targeted sequencing, Exome sequencing, RNA sequencing, and long-read sequencing such as with PacBio RSII and Oxford Nanopore and others can also be used as a reference sequence.

[0248] The step S1 is a step of performing the first mapping, or, a step of mapping the “read data” of the box B1-1 to the “human genome reference sequence” of the box B1-2. With regard to this mapping, those skilled in this field can create computer programs including algorithms capable of realizing the mapping with computers, but it is also possible to use existing software. The existing mapping software includes, for example, BWA-MEM (http://bio-bwa.source.net/), Bowtie 2 (http://bowtie-bio.sourceforge.net/bowtie 2/index.shtml), Novoalign (http://www.novocraft.com/products/novoalign/), and others. Also, when the read data is paired-end sequence data, if one of the nucleotide sequences (pairs) of both ends of the read is mapped to the HLA locus and the other is not mapped, it is preferable to extract both reads of the pair and use them in the following processes.

[0249] Furthermore, it is preferable that a “decoy sequence” is included in addition to the human genome reference sequence described above. A “decoy sequence” is a genomic sequence that does not exist in a reference sequence prepared in advance. This is because, without using a decoy sequence, there is a high risk that reads that are not derived from sequences registered in the human genome reference sequence will be mapped to some places in the existing human genomic reference sequences, possibly reducing mapping accuracy. As a decoy sequence, hs37d5 (ftp: //ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz) and others can be used.

[0250] The box B2 is a box showing the existence of “the result of the first mapping” in the above S1, and the mapping result exists in the format such as the SAM format or the BAM format.

[0251] Step S2-1 is a step of “extracting reads mapped to HLA loci” from “the result of the first mapping” in the box B2, and step S2-2 is a step of performing “extraction of unmapped reads” that are not mapped to any portion of the human genome reference sequence. With respect to the execution of these extraction, it is possible for those skilled in the field to create computer programs including algorithms that can realize the extraction with a computer, but it is also possible to use existing software. As existing extracting software, for example, SAMtools (http://samtools.sourceforge.net/) can be mentioned. In this extraction step, it is preferable to carry out calculation as efficient as possible because the necessary processing to be performed spans at least all HLA loci, which are many loci. Therefore, it is preferable that the computer program mentioned above is designed so that the extraction is carried out on all HLA loci (currently, HLA-A, -B, -C, -DM, -DO, -DP, -DQ, -DR, -E, -F, -G, -H, -J, -K, -L, -P, -V, -MIC, and -TAP are registered in the IMGT/HLA database). With SAMtools, extraction processing can be easily performed. It is preferable that “Extraction of unmapped reads” in step S2-2 is set as a step to prevent the loss of reads derived from HLA loci that were not mapped to the reference genomic sequence due to polymorphism in the subject's genomic sequence as described above.

[0252] Through these steps in S2, the information of reads that were mapped to regions other than HLA loci are excluded.

[0253] The box B3 is a box showing the presence of the read information (extracted reads) extracted in the above step S2 and the box B4 is a box showing the existence of the information of “reference sequence of HLA allele”. Reference sequence information of HLA allele is electronic information derived from a database in which genome information of HLA allele is stored. As the database, for example, IMGT/HLA (http://www.ebi.ac.uk/ipd/imgt/hla/), and others can be mentioned. It is preferable to acquire the latest update information from these database information as much as possible. In addition, it is preferable that HLA alleles of “pseudogenes” which are genes that do not actually code for proteins are also included in the genomic information of HLA allele.

[0254] Step S3 is a step in which “second mapping” to “reference sequence of HLA allele” in box B4 of “extracted read information” in box B3 is executed. It is preferable that the mapping software is designed so that the execution mode of the second mapping is performed so as to allow one read to be mapped to multiple HLA alleles, and it is possible by specifying “-a option” with BWA-MEM mentioned above and “-k option” with Bowtie 2, which make it possible for the computer to realize the multiple mapping of reads to HLA alleles.

[0255] Box B5 is a box showing “the result of the second mapping” in step S3, that is, the presence of “HLA mapping read information”, and the information exists in a format such as the SAM format or BAM format.

[0256] The “mapping process of the present invention” described above is shown until this step in FIG. 1.

[0257] Hereinafter, the step S4 is a step of optimizing the “HLA mapping read information” described above, and details thereof are shown in FIG. 2 (FIGS. 2-1, 2-2, 2-3) as a flow sheet. FIG. 2-1 shows embodiments (variational Bayesian method and EM algorithm) of the optimizing method of the present invention using the latent variable Z.sub.nt, FIG. 2-2 shows an embodiment using the EM algorithm in the optimization method of the present invention using the latent variable Z.sub.nts, and FIG. 2-3 shows an embodiment using the variational Bayesian method in the optimization method of the present invention using the latent variable Z.sub.nts. The detailed calculation steps involved in these embodiments are as described above for alleles of the particular gene loci. Furthermore, the box B6 is a box showing the presence of electronic information of the desired “Determination of HLA type”, but in the process from step S4 to box B6, the HLA typing process is preferably executed using the rejection threshold shown in FIG. 3 (described later).

[An Embodiment of the Optimization Method using the Latent Variable Z.sub.nt]

[0258] FIG. 2-1 is a flow sheet showing one embodiment of the optimization process using the latent variable Z.sub.nt of “HLA mapping read information” as described above. This flow sheet is an embodiment in which prediction by the variational Bayesian method is performed for Bayesian estimation to obtain expected read counts on HLA alleles, but it is also possible to perform prediction by the EM algorithm in order to perform maximum likelihood estimation, which does not need incorporation of the hyperparameter α.sub.0 as described below.

[0259] Step S4-11 is a step of retrieving “HLA allele reference sequence” shown in the box B4 and retrieving “HLA mapping read information” shown in the box B5.

[0260] Step S4-12 is a step in which the number of expected mappings E[Z.sub.nt] to each HLA allele of the read n used in the optimization process by executing the computer program of the present invention, and the expected value of the fraction of the total number of expected mappings allocated to each HLA allele relative to the total read amount, E[θ.sub.t], are initialized.

[0261] Step S4-13 is a step in which the total number of expected mappings for each HLA allele for all reads based on the initialized values obtained in step S4-12 described above is calculated. The definition of the mathematical expression described here is as described above. The r.sub.t is “the total number of expected mappings assigned to HLA allele t”.

[0262] Step S4-14 is a step in which the “total number of expected mappings” for each HLA allele, calculated in step S4-13 described above is divided by the sum of the total number of expected mappings for all HLA alleles, and the expected value of the fraction of the total number of expected mappings allocated to each HLA allele relative to the total read amount, E[θ.sub.t], is calculated. In this step S4-14, the hyperparameter α.sub.0 has been incorporated as described above, in which an embodiment of the variational Bayesian method is described. For example, α.sub.0=1 which gives a uniform distribution is used as the hyper parameter α.sub.0 assuming that there is no prior knowledge, but α.sub.0 that maximizes the lower bound of the log likelihood, for example, α.sub.0=0.01 is also preferably used. Furthermore, as described above, it is also possible to use it as a flow sheet of an embodiment of the EM algorithm for performing maximum likelihood estimation of the parameter θ.sub.t without incorporating this hyperparameter.

[0263] Step S4-15 is a step related to loop processing, and [0264] (i) The function in which the distribution of the fraction of the total number of expected mappings calculated in step S4-14 described above is assigned to each HLA allele as a frequency distribution, and based on the allocated frequency distribution, again in step S4-13 described above, the number of expected mappings for each HLA allele for each read is calculated, [0265] (ii) The function in which based on the number of expected mappings for each HLA allele for each read calculated in the function (i) described above, step S4-13 or S4-14 is executed again for calculating the updated distribution of the fraction of reads assigned to each HLA allele relative to the total read amount mapped to HLA alleles, [0266] (iii) The function in which the calculation processes in the (i) and (ii) described above are repeatedly executed, until the difference between the number of expected mappings to each HLA allele for each read calculated by executing the current (i) and that calculated by executing the previous (i) is not recognized for all the reads, or, the difference between the expected value of the distribution of the fraction of the total number of expected mappings calculated by executing the current (ii) and that calculated by executing the previous (ii) is not recognized for all HLA alleles, and the converged expected number of mappings to each HLA allele for each read or the converged expected value of the distribution of the fraction of reads for each HLA allele is recognized as the optimized data as “convergence value”;
are included in the description of step S4-15.
[Embodiment of the Optimization Method using Latent Variable Z.sub.nts (1)]

[0267] FIG. 2-2 is a flow sheet showing an embodiment of the optimization process using the latent variable Z.sub.nts of “HLA mapping read information” as described above. This flow sheet describes prediction of the expected read counts on HLA alleles by maximum likelihood estimation with the EM algorithm.

[0268] Step S4-21 is a step of retrieving the “HLA mapping read information” in the same manner as in step S4-11 described above.

[0269] Step S4-22 is a step in which a latent variable Z.sub.nts (a latent variable taking 1 in the case where the read n is generated from the position s of the HLA allele t) and θ.sub.t, the fraction of reads allocated to each HLA allele relative to the total read amount, used in the optimization process by executing the computer program, are initialized. The expected value of the latent variable Z.sub.nts is described as E[Z.sub.ntss]. Here, N is the “total number of reads”, and T is the “total number of HLA alleles”. l.sub.t is the length of the HLA allele t, and L is the read length. M.sub.n is the total number of observed mapped locations of the read n. E[Z.sub.nts] is initialized as 1/M.sub.n, given an assumption of no prior information/uniform distribution. θ.sub.t is given as 1/T as an initial value, given an assumption of no prior information/uniform distribution.

[0270] Step S4-23 is a step in which for all reads, the number of expected mappings to each HLA allele based on the initialized information retrieved in step S4-22 is calculated. That is, this is a step in which the expected value of the latent variable Z.sub.nts (E.sub.z [Z.sub.nts]) indicating the allocation of read n to the position s of the HLA allele t, based on the θ.sub.t updated immediately before (for the first time, it is the initial values in step S4-22 described above), that is, the “expected number of mappings”, is calculated, which corresponds to the E step of the EM method.

[0271] Step S4-24 is a step in which the “expected number of mappings” calculated in step S4-23 described above is summed for each HLA allele and the “total number of expected mappings” (r.sub.t=Σ.sub.n′,t′=t,s′ E.sub.z[Z.sub.n′t′s′]) is calculated, and the process in which the total number of expected mappings is divided by the sum of the total number of expected mappings for all HLA alleles to calculate the maximum likelihood estimate of the fraction of reads assigned to each allele relative to the total read amount “θ.sub.t” (θ.sub.t=r.sub.t/Σ.sub.t′r.sub.t′), is executed, which corresponds to the M step of the EM method. This step corresponds to calculating θ.sub.t that locally maximizes the log likelihood.

[0272] Step S4-25 is a step related to loop processing. That is, steps S4-23 and S4-24 are executed again based on the first updated value of the maximum likelihood estimate of θ.sub.t calculated in step S4-24, and the loop processing in which the second updated value of the parameter θ is calculated is repeatedly executed until the significant difference between the new updated value and the previously updated value is not recognized for all HLA alleles, and the converged parameter θ is recorded on the recording unit as the optimized θ.

[Embodiment of the Optimization Method using Latent Variable Z.sub.nts (2)]

[0273] FIG. 2-3 is a flow sheet showing an embodiment of the optimization process using the latent variable Z.sub.nts of “HLA mapping read information” as described above. This flow sheet describes prediction of the expected read counts on HLA alleles and HLA allele frequency by variational Bayesian method.

[0274] Step S4-31 is a step of retrieving the “HLA mapping read information” in the same manner as in step S4-11 described above.

[0275] Step S4-32 is a step in which the posterior distribution of a latent variable Z.sub.nts (a latent variable taking 1 in the case where the read n is generated from the position s of the HLA allele t, and 0 otherwise) and the posterior distribution of θ.sub.t (the fraction of reads allocated to each HLA allele relative to the total read amount), used in the optimization process by executing the computer program, are initialized. The expected value of the posterior distribution of the latent variable Z.sub.nts is described as E[Z.sub.nts]. Here, N is the “total number of reads”, and T is the “total number of HLA alleles”. l.sub.t is the length of the HLA allele t, and L is the read length. M.sub.n is the total number of observed mapped locations of the read n. E[Z.sub.nts] is initialized as 1/M.sub.n, given an assumption of no prior information/uniform distribution. The expected value of θ.sub.t, E.sub.θ[θ.sub.t], is given as α*.sub.t/Σ.sub.t′α*t′. Here, α*.sub.t=α.sub.0+N/T. As described above, α.sub.0 is a hyperparameter. For example, assuming that there is no prior information, α.sub.0=1 that gives a uniform distribution is used, but it is also preferable to use α.sub.0, which maximizes the lower bound of log likelihood, for example, α.sub.0=0.01.

[0276] Step S4-33 is a step in which for all reads, the number of expected mappings to each HLA allele based on the initialized information retrieved in step S4-32 is calculated. That is, this is a step in which the expected value of the posterior distribution of the latent variable Z.sub.nts, E.sub.z[Z.sub.nts], indicating the allocation of read n to the position s of the HLA allele t, based on the posterior distribution of θ.sub.t updated immediately before (for the first time, it is the initial values in step S4-32 described above), that is, the “expected number of mappings”, is calculated, which corresponds to the E step of the variational Bayesian method.

[0277] Step S4-34 is a step in which the posterior distribution of θ.sub.t is calculated based on the “expected number of mappings” (E.sub.z[Z.sub.nts]) calculated in step S4-33 described above, which corresponds to the M step of the variational Bayesian method, and the expected value of the posterior distribution of θ.sub.t, E.sub.θ[θ.sub.t], is calculated based on the E.sub.z[Z.sub.nts] calculated in step S4-34 as

[00019] [ Mathematical .Math. .Math. 33 ] E θ [ θ t ] = α t * .Math. t .Math. .Math. α t * .Math. .Math. ( where .Math. .Math. α t * = α 0 + .Math. n , t = t , s .Math. .Math. E Z [ Z n .Math. t .Math. s ] )

[0278] Step S4-35 is a step of selecting whether to repeat steps S4-33 and S4-34 described above, or to end the loop process.

[0279] That is, steps S4-33 and S4-34 are executed again based on the first updated posterior distribution of the parameter θ.sub.t calculated in step S4-24, and the loop processing of calculating the second updated posterior distribution of the parameter θ is iteratively executed until the significant difference between the expected value of the newly updated posterior distribution and the expected value of the posterior distribution updated last time is not recognized, and the converged expected value of θ is recorded in the recording unit as the data of optimized θ.

[0280] It is possible to use the obtained “r.sub.t” or “θ.sub.t” as an indicator of HLA typing, but it is preferable that the following HLA typing process is performed.

[0281] FIG. 3 is a flow sheet showing an example of the HLA typing process using a rejection threshold for the individual depth of coverage that is determined relative to the depth of coverage of all reads, based on “the fraction of reads assigned to each HLA allele t “θ.sub.t”, or the expected value of the read fraction “θ.sub.t””, obtained through the optimization step described above.

[0282] Step S5-1 is a step in which the “individual depth of coverage” of each HLA allele is calculated from “the fraction of reads assigned to each HLA allele”. The variables “t” and “n” are initialized, and the process of calculating the individual depth “d.sub.t” is performed using the mathematical expression indicated in box S5-1. The details of the calculation process here have already been described.

[0283] Step S5-2 is a step in which, as described in the box, for each HLA locus, two of the HLA alleles with the two largest individual depth of coverage “d.sub.t” are selected, and the allele with the largest individual depth of coverage, and the allele with the second largest individual depth of coverage are specified.

[0284] Step S5-3 is a step in which a selection process to determine whether the largest individual depth of coverage “d.sub.first” of the allele described above is smaller than the rejection depth “D” or not is executed. The rejection depth “D” is a numerical value selected between 5 and 50%, preferably between 10 and 30% of the “depth of coverage of total reads” as described above. If the individual depth of coverage “d.sub.first” is smaller than the rejection depth “D”, a decision is made that “HLA type is not determined” (conclusion D5-1). If it is not smaller, the next selection step S5-4 is executed.

[0285] Step S5-4 is a step in which a selection process to determine whether the second largest individual depth of coverage “d.sub.second” of the allele described above is smaller than the rejection depth “D” or not is executed. If the individual depth of coverage “d.sub.second” is smaller than the rejection depth “D”, a decision is made to go to the selection step S5-5, and if it is not smaller, a decision is made to go to the selection step S5-6.

[0286] Step S5-5 is a step that is applied when the individual depth of coverage “d.sub.second” is smaller than the rejection depth “D”. That is, a determination process in which if the “d.sub.first” described above is greater than twice the rejection depth “D”, “the HLA type is homozygous of the HLA allele with the largest individual depth of coverage” (conclusion D5-2), and if it is not larger than 2 times, it is assumed that “the HLA type is determined as heterozygous of the HLA allele with the largest individual depth of coverage and the other allele is not determined” (conclusion D5-3), is performed.

[0287] Step S5-6 is a further selection step that is applied when “d.sub.second<D” is not satisfied in step S5-4, and a determination process in which if “d.sub.first” is more than twice the individual depth of coverage “d.sub.second”, “the HLA type is homozygous of the HLA allele with the highest individual depth of coverage” (conclusion D5-4), and if it is not more than twice, “the HLA type is heterozygous of the HLA alleles with the largest and the second largest individual depth of coverage” (conclusion D5-5), is performed.

[0288] As described above, the system of the present invention can determine the HLA type in the preferred form (box B6 in FIG. 1).

EXAMPLES

[0289] Hereinafter, examples of the present invention performed in which the target is HLA, which is human MHC, are disclosed.

[A] The Source used in the Examples

[0290] The sources used in executing the computer system and the computer program (hereinafter collectively referred to as the “system of the present invention”) used in the present examples are as follows.

(1) HLA Mapping Process of the Present Invention

[0291] HiSeq2000 (Illumina, Inc.) was used as the next-generation sequencer that provides the “read information” of box B1-1 in FIG. 1. The read information is provided in the FASTAQ format, and the subsequent read information is likewise in the FASTAQ format.

[0292] As the “human genome reference sequence” in box B1-2, hg19 (UCSC) or GRCh37 (Genome Reference Consortium) was used in conjunction with decoy sequence (hs37d5).

[0293] The mapping of step S1/S3 was performed by BWA-MEM. For step S3, the option “-a” was specified and all possible mapping locations were output for each read.

[0294] The “first mapping result” in box B2 and the “second mapping result” in box B5 were used in the BAM format.

[0295] As software for extracting the mapping result in step S2, “SAMtools” was used.

[0296] “Reference sequence of HLA allele” in box B4 was provided in the FASTA format obtained from the IMGT/HLA database.

(2) The Optimization Process

[0297] The optimization process in step S4 is executed by applying the algorithm described in flow sheet in FIG. 2-1 using the variational Bayesian method to the paired-end data, the rejection depth is set to 20% of the depth of coverage of total reads, and the rejection process was executed by applying the algorithm described in the flow sheet in FIG. 3.

[B] Performance Measurement of HLA Typing

(1) Overview of the Simulation Experiment

[0298] The prediction performance of the system of the present invention was evaluated from the viewpoint of prediction accuracy. The prediction accuracy is defined as the fraction of the true positive prediction relative to the predicted HLA types. In this simulation experiment, two HLA alleles (either heterozygous or homozygous) for the six HLA loci (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) were evaluated in each individual. Predictive performance was evaluated separately for each method with two-digit, four-digit, six-digit and eight-digit resolution.

(2) Simulation Data Analysis

[0299] With the simulation data analysis, the performance of the system of the present invention to predict HLA types was evaluated compared to other systems. As the comparable systems, (a) PHLAT (Bai et al., BMC genomics, 15: 325 (2014)) and (b) HLAminer (Warren et al., Genome medicine, 4 (12): 102 (2013)) were used. It is known that these comparable systems can classify HLA class I loci (HLA-A, -B, and -C) and the class II loci (HLA-DQA1, -DQB1, and -DRB1) at a four-digit resolution from human whole-genome sequence data.

[0300] First, simulation data of 1,000 individuals of human samples was prepared. For each individual, HLA diplotype of the six HLA loci was randomly selected from HLA alleles registered in the IMGT/HLA database release 3.15.0. Once the HLA type was established for each individual, one SNP for 1,000 bp was incorporated in each individual's HLA allele based on the average polymorphism in the human genome. Next, 100 bp of the paired-end read data (average depths of coverage were 5×, 10×, 20× and 30×, and the mean fragment length and standard deviation of the distribution were set as 300 bp and 40 bp, respectively) was uniformly produced across the HLA allele sequence with 0.1% substitution, deletion and insertion errors.

[0301] Table 1 shows the prediction accuracy of the system of the present invention and the existing tool using the 30× simulation data.

TABLE-US-00001 TABLE 1 HLA System of the resolution present invention PHLAT HLAminer 8-digit 99.94% — — 6-digit 99.95% 80.80% — 4-digit 99.95% 88.75% 50.12% 2-digit   100% 96.39% 77.82%

[0302] In Table 1, the system of the present invention was able to estimate the HLA type with an especially high accuracy of 99.94% at eight-digit resolution which can not be conducted with existing PHLAT and HLAminer. Comparing the estimation accuracy of the system of the present invention with that of the existing PHLAT and HLAminer, it became clear that the system of the present invention achieved better prediction performance at the four-digit resolution and the six-digit resolution than the existing software. FIG. 4 is a drawing showing prediction accuracy of HLA types at four-digit resolution at various depth of coverage in the simulation data. In FIG. 4, the prediction accuracy when using the system of the present invention was always better than the accuracy with PHLAT and HLAminer throughout the examined depth of coverage.

[0303] In particular, in the system of the present invention, HLA type prediction was made at an accuracy of 99.36% at four-digit resolution using the simulation data of average depth of 5×. In PHLAT, only SNP sites are examined in order to compute likelihood, but with only the limited information, in case if other polymorphic sites (such as deletions or insertions) are important for determining the HLA type, it is not effective. Another possible disadvantage of PHLAT is that the system needs prior information about HLA allele frequency. However, the HLA allele frequency is particularly diverse among human populations and it is not always possible to guess the racial roots of the provided gene specimens. In contrast, the system of the present invention does not require prior information on HLA allele frequency.

(3) Real Data Analysis (1)

[0304] The system of the present invention is applied to whole-genome sequencing data of CEU trio samples (NA12878 (child), NA12891 (father) and NA12892 (mother)) used in the international HapMap project, which were not amplified by PCR method. Here, 100 bp paired-end data was generated using HiSeq2000, the average insertion length was 300 bp, and the depth of coverage of these data was 45× for each sample (all data was provided by Illumina, Inc.). Table 2 shows the predicted HLA types for the CEU trio.

TABLE-US-00002 TABLE 2 Sample Predicted HLA types NA12878 (child) A*01:01:01:01 A*11:01:01 B*08:01:01 B*56:01:01 C*01:02:01 C*07:01:01:01 DQA1*01:01:02 DQA1*05:01:01:02 DQB1*02:01:01 DQB1*05:01:01:02 DRB1*03:01:01:01 DRB1*01:01:01 NA12891 (father) A*01:01:01:01 A*24:02:01:01 B*07:02:01 B*08:01:01 C*07:01:01:01 C*07:02:01:03 DQA1*01:02:01:01 DQA1*05:01:01:02 DQB1*02:01:01 DQB1*06:02:01 DRB1*03:01:01:01 DRB1*15:01:01:02 NA12892 (mother) A*02:01:01:01 A*11:01:01 B*15:01:01:01 B*56:01:01 C*01:02:01 C*04:01:01:01 DQA1*01:01:02 DQA1*01:01:02 DQB1*05:01:01:02 DQB1*05:01:01:01 DRB1*01:01:01 DRB1*01:01:01

[0305] In Table 2, the HLA types for the class I locus (HLA-A, -B, and -C) predicted by using the system of the present invention are matched with HLA types experimentally verified at four-digit resolution. These are shown in bold text in Table 2. Many of the HLA types were predicted at 8-digit resolution by the system of the present invention. The HLA types of HLA-A, -B and -C loci predicted in the system of the present invention are consistent with HLA types predicted with PHLAT at six-digit resolution except for “B*07:02:01” (one allele in NA12891). Another literature (Major et al., PloS one, 8(11): e78410 (2013)) also reported one of the HLA-B alleles of NA12891 as “B*07:02:01”, which is consistent with the prediction with the system of the present invention. PHLAT, on the other hand, predicted this HLA type as “B*07:02:29”. Overall, the HLA types of the trio (child, father and mother) predicted in the system of the present invention was more consistent than those predicted by PHLAT.

[0306] The HLA types of the HLA-DQA1, -DQB1 and -DRB1 loci predicted with the system of the present invention are consistent with those predicted with PHLAT at six-digit resolution, except for DQA1*01:01:02 (one allele in NA12878 and two alleles in NA12892). PHLAT predicted the HLA allele as “DQA1*01:01:01”, which was predicted as “DQA1*01:01:02” with the system of the present invention. However, its genomic sequence itself was missing in the IMGT database release 3.15.0, so it was not possible to determine the success or failure between the two systems.

(4) Real Data Analysis (2)

[0307] The system of the present invention was applied to the 1KJPN population (1,070 healthy Japanese who participated in the cohort study of Tohoku Medical Megabank Project) and HLA allele at HLA-A, HLA-B and HLA-C loci were estimated. This example is based on mapping of sequence reads to genomic HLA allele sequences registered in the IMGT/HLA database. This analysis confirmed that it is possible to classify HLA-A alleles for 2,063 alleles out of 2,140 alleles at full resolution (eight-digits in HLA nomenclature).

[0308] Also, it is confirmed that the allele frequencies of HLA-A, HLA-B, and HLA-C at the predicted four-digit resolution (nucleotide difference that changes the amino acid sequence) are very similar to those determined at 4-digit resolution using the method PCR-SSOP (Itoh, Y. et al., Immunogenetics 57, 717-729 (2005)) in another Japanese population of 1018 people. Since both of the two Japanese populations which were compared include a sufficient number of samples of 1,000 or more, it can be assumed that the allele frequency distribution is close to the actual HLA allele frequency of a Japanese population. The fact that the estimated result with the system of the present invention and the estimated result with PCR-SSOP are very similar suggests that the allele frequency of a Japanese population could be correctly estimated at four-digit resolution whichever method was used.

[0309] FIG. 5-1 is a graph showing the results of examining HLA allele frequencies calculated for HLA-A, FIG. 5-2 for HLA-B, and FIG. 5-3 for HLA-C with both methods. The vertical axis shows allele frequency and the horizontal axis shows HLA alleles at four-digit resolution, and a graph bar on the left side of the same HLA allele type is the analysis result estimated using the system of the present invention in this 1KJPN population, and a graph bar on the right shows the analysis result with PCR-SSOP applied to another Japanese population of 1,018 people.

INDUSTRIAL APPLICABILITY

[0310] The system of the present invention realizes efficient and accurate genotyping of the gene loci in which there are similar nucleotide sequences in multiple locations in genome or there are many polymorphisms known, such as HLA loci, HLA being a human MHC, cytochrome P450 loci, gene loci encoding an immunoglobulin, gene loci encoding T cell receptors, gene loci encoding olfactory receptors, using whole-genome sequencing data without the need of primer design that is specific to the gene locus or prior knowledge of the allele frequency of the gene loci. Regardless of individual or population scale sequence data, it is possible to easily apply the system of the present invention for genotyping each locus at any selected gene loci, which is useful for studies to identify the relationship between genotype and phenotype and for clinical applications such as HLA type matching of donors and recipients during organ transplantation. In addition, by applying the present invention to other MHC other than HLA, for example, mammalian MHC such as H-2 (histocompatibility-2) which is a mouse MHC, and avian MHC such as chicken B locus, the present invention can be used as providing a more accurate and detailed variety discrimination, and further, providing basic knowledge for establishing therapeutic guidelines for diseases of pets and protected animals.

EXPLANATION OF REFERENCE NUMERALS

[0311] B1-1 A box indicating the presence of read data [0312] B1-2 A box indicating the presence of human genome reference [0313] S1 A step of performing the first mapping [0314] B2 A box indicating the presence of the result from the first mapping [0315] S2-1 A step of extracting reads from the result of the first mapping [0316] S2-2 A step of extracting unmapped reads [0317] B3 A box indicating the presence of the read information extracted by S2 [0318] B4 A box indicating the presence of HLA allele reference sequence [0319] S3 A step of performing the second mapping [0320] B5 A box indicating the presence of HLA mapping read information [0321] B6 A box indicating the presence of electronic information in which HLA type determination has been made [0322] S4-11 A step of describing a function of retrieval for performing optimization [0323] S4-12 A step of initializing parameters for performing optimization processes [0324] S4-13 A step of describing a function of E step for optimization [0325] S4-14 A step of describing a function of M step for optimization [0326] S4-15 A step of describing a function of loop and convergence for optimization [0327] S4-21 A step of describing a function of retrieval for performing optimization [0328] S4-22 A step of initializing parameters for performing optimization processes [0329] S4-23 A step of describing a function of E step for optimization [0330] S4-24 A step of describing a function of M step for optimization [0331] S4-25 A step of describing a function of loop and convergence for optimization [0332] S4-31 A step of describing a function of retrieval for performing optimization [0333] S4-32 A step of initializing parameters for performing optimization processes [0334] S4-33 A step of describing a function of E step for optimization [0335] S4-34 A step of describing a function of M step for optimization [0336] S4-35 A step of describing a function of loop and convergence for optimization [0337] S5-1 A step of performing calculation of individual depth of coverage [0338] S5-2 A step of performing selection of the two with the largest individual depth of coverage [0339] S5-3 A step of performing selection based on the rejection depth and the largest individual depth of coverage [0340] D5-1 A box indicating a conclusion that the HLA type is not determined [0341] S5-4 A step of performing selection based on the rejection depth and the second largest individual depth of coverage [0342] S5-5 A step performed when the second largest individual depth of coverage is smaller than the rejection depth [0343] D5-2 A box indicating a conclusion that it is homozygous of the HLA allele with the largest individual depth of coverage [0344] D5-3 A box indicating a conclusion that it is heterozygous of the HLA alleles with the largest individual depth of coverage and another allele that is not determined [0345] S5-6 A step performed when the second largest individual depth of coverage is not smaller than the rejection depth [0346] D5-4 A box indicating a conclusion that it is homozygous of the HLA allele with the largest individual depth of coverage [0347] D5-5 A box indicating a conclusion that it is heterozygous of the HLA alleles with the largest and the second largest individual depth of coverage