METHOD AND SYSTEM FOR NUCLEIC ACID SEQUENCING
20210164033 · 2021-06-03
Inventors
Cpc classification
G16B30/00
PHYSICS
C12Q2537/165
CHEMISTRY; METALLURGY
C12Q2537/165
CHEMISTRY; METALLURGY
International classification
Abstract
The present invention relates to methods and systems for nucleic acid sequencing. In particular, the present invention relates to methods and systems for reducing the number of false-positives in nucleic acid sequencing. The method comprises: aligning a plurality of genetic reads to a reference genetic sequence; grouping the genetic reads into a plurality of groups; creating a consensus sequence for each group of the plurality of groups by setting a representation of the most abundant nucleotide man_p or a tag N based on a ratio r; and identifying a variation as a true variation if a ratio r* between the number of consensus sequences comprising the tag N at a specific position p and the number of the consensus sequences comprising the variation at the specific position p is below a threshold t*.
Claims
1. A method for nucleic acid sequencing comprising the following steps: (a) obtaining a plurality of genetic reads by sequencing of a nucleic acid sample; (b) aligning the plurality of genetic reads to at least one reference genetic sequence; (c) grouping the genetic reads sharing a genetic position on a reference genetic sequence of the at least one reference genetic sequence into a plurality of groups; (d) creating a consensus sequence for each group of the plurality of groups, wherein one corresponding consensus sequence is created by determining a most abundant nucleotide man_p at each specific position p of a plurality of positions within the one corresponding group of genetic reads and (i) setting a representation of the most abundant nucleotide man_p at each specific position p of the consensus sequence if a ratio r between the number of genetic reads within the one corresponding group having the most abundant nucleotide man_p at the specific position p and the number of genetic reads within the one corresponding group is above or equal a predetermined threshold t; and (ii) setting a tag N if the ratio r is below the predetermined threshold t; (e) comparing the consensus sequences of the plurality of groups to the reference genetic sequence at each specific position p of a plurality of positions of the consensus sequences, and wherein a difference at a specific position between the consensus sequences and the reference genetic sequence indicates a genetic variation at the specific position; (f) determining the number of consensus sequences comprising the variation at each specific position p of a plurality of positions, and determining the number of consensus sequences comprising the tag N at each specific position p of a plurality of positions; and (g) identifying the genetic variation at each specific position p of a plurality of positions as a true genetic variation if a ratio r* between the number of consensus sequences comprising the tag N at the specific position p and the number of the consensus sequences comprising the genetic variation at the specific position p is below a threshold t*.
2. The method of claim 1, wherein the ratio r is equal or above 76%.
3. The method of claim 1, wherein the ratio r* is equal or above 1.8, is equal or above 2, or is equal or above 4.
4. The method according to claim 1, wherein in step (c) each genetic read in a corresponding group of the plurality of groups comprises at least one particular nucleic acid sequence.
5. The method according to claim 4, wherein each particular nucleic acid sequence corresponds to a respective molecule.
6. The method according to claim 1, wherein the genetic reads of step (c) are grouped based on their genetic position and their barcode sequence.
7. The method according to claim 1, wherein in step (d) one corresponding group of the plurality of groups share at least one particular nucleic acid sequence.
8. The method according to claim 1, wherein step (d) is performed for all respective positions within the group, wherein step (e) is performed for all respective positions within the group, wherein step (f) is performed for all respective positions within the group, or wherein step (g) is performed for all respective positions within the group.
9. The method according to claim 1, wherein the number of positions in the genetic reads is 72.
10. The method according claim 1, wherein one corresponding group comprises at least 3 genetic reads.
11. The method according to claim 1, wherein the plurality of groups is at least two groups.
12. The method according to claim 1, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ at least partially overlap with the genetic reads of the group′.
13. The method according to claim 1, wherein the plurality of groups comprises a group' and group“, and wherein the genetic reads of the group” do not overlap with the genetic reads of the group'.
14. The method according to claim 1, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ fully overlap with the genetic reads of the group′.
15. The method according to claim 1, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ correspond to the reverse complement of the genetic reads of the group′.
16. The method according to claim 15, wherein the genetic reads of the group′ correspond to a first strand of a double-stranded nucleic acid and the genetic reads of the group″ correspond to the complementary second strand of the double-stranded nucleic acid.
17. The method according to claim 15, wherein a single strand consensus sequence is created for the group′ and wherein a single strand consensus sequence is created for the group″.
18. The method according to claim 1, further comprising: creating a double-stranded consensus sequence by (i) setting a representation of the most abundant nucleotide man_p or the tag “N” at each specific position p of a plurality of positions in the double strand consensus sequence if the representation or the tag at the specific position p is respectively present in both of the single strand consensus sequences of the group′ and the group″; and (ii) setting the tag “N” at each specific position p of a plurality of positions in the double strand consensus sequence if the tag “N” is present at the specific position p in one of the single strand consensus sequences of the group′ or the group″, or if the representation of the most abundant nucleotide man_p is not identical at the specific position in both of the single strand consensus sequences of the group′ or the group″.
19. The method according to claim 18, wherein in step (e) double-stranded consensus sequences are compared.
20. The method according to claim 19, wherein steps (e), (f), and (g) are performed with the double-stranded consensus sequences.
21. The method according to claim 16, wherein each position corresponds to a base pair.
22. The method according to claim 16, wherein the genetic reads of the one corresponding group have the same length.
23. The method according to claim 16, wherein the sequencing is next generation sequencing.
24. A system for nucleic acid sequencing comprising: (a) an obtaining unit configured to obtain a plurality of genetic reads by sequencing of a nucleic acid sample; and (b) a computation unit configured to align the plurality of genetic reads to at least one reference genetic sequence; (c) the computation unit configured to group the genetic reads sharing a genetic position on a reference genetic sequence of the at least one reference genetic sequence into a plurality of groups; (d) the computation unit configured to create a consensus sequence for each group of the plurality of groups, wherein one corresponding consensus sequence is created by determining a most abundant nucleotide man_p at each specific position p of a plurality of positions within the one corresponding group of genetic reads and (i) setting a representation of the most abundant nucleotide man_p at each specific position p of the consensus sequence if a ratio r between the number of genetic reads within the one corresponding group having the most abundant nucleotide man_p at the specific position p and the number of genetic reads within the one corresponding group is above or equal a predetermined threshold t; and (ii) setting a tag N if the ratio is below the predetermined threshold t; (e) the computation unit configured to compare the consensus sequences of the plurality of groups to the reference genetic sequence at each specific position p of a plurality of positions of the consensus sequences, and wherein a difference at a specific position between the consensus sequences and the reference genetic sequence indicates a variation at the specific position; (f) the computation unit configured to determine the number of consensus sequences comprising the variation at each specific position p of a plurality of positions, and determining the number of consensus sequences comprising the tag N at each specific position p of a plurality of positions; and (g) the computation unit configured to identify the variation at each specific position p of a plurality of positions as a true variation if a ratio r* between the number of consensus sequences comprising the tag N at the specific position p and the number of the consensus sequences comprising the variation at the specific position p is below a threshold t*.
25. The system according to claim 24, wherein the ratio r is equal or above 76%.
26. The system according to claim 24, wherein the ratio r* is equal or above 1.8, is equal or above 2, or is equal or above 4.
27. The system according to claim 24, wherein in step (c) each genetic read in a corresponding group of the plurality of groups comprises at least one particular nucleic acid sequence.
28. The system according to claim 24, wherein each particular nucleic acid sequence corresponds to a respective molecule.
29. The system according to claim 24, wherein the genetic reads of step (c) are grouped based on their genetic position and their barcode sequence.
30. The system according to claim 24, wherein in step (d) one corresponding group of the plurality of groups share at least one particular nucleic acid sequence.
31. The system according to claim 24, wherein the computation unit is configured to create the consensus sequences and is configured to set a respective representation or a respective tag N for all respective positions within the one corresponding group, wherein the computation unit is configured to compare the consensus sequences to the reference genetic sequence at all positions, wherein the computation unit is configured to determine the number of consensus sequences comprising the variation and to determine the number of consensus sequences comprising the tag N for all respective positions of the consensus sequences, or wherein the computation unit is configured to identify the variation at all positions and is configured to set a respective representation or a respective tag N for all respective positions of the consensus sequences.
32. The system according to claim 24, wherein the number of positions in the genetic reads is 72.
33. The system according to claim 24, wherein one corresponding group comprises at least 3 genetic reads.
34. The system according to claim 24, wherein the plurality of groups is at least two groups.
35. The system according to claim 24, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ at least partially overlap with the genetic reads of the group′.
36. The system according to claim 24, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ do not overlap with the genetic reads of the group′.
37. The system according to claim 24, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ fully overlap with the genetic reads of the group′.
38. The system according to claim 24, wherein the plurality of groups comprises a group′ and group″, and wherein the genetic reads of the group″ correspond to the reverse complement of the genetic reads of the group′.
39. The system according to claim 38, wherein the genetic reads of the group′ correspond to a first strand of a double-stranded nucleic acid and the genetic reads of the group″ correspond to the complementary second strand of the double-stranded nucleic acid.
40. The system according to claim 38, wherein a single strand consensus sequence is created for the group′ and wherein a single strand consensus sequence is created for the group″.
41. The system according to claim 24, wherein the computation unit is configured to create a double-stranded consensus sequence by (i) setting a representation of the most abundant nucleotide man_p or the tag “N” at each specific position p of a plurality of positions in the double strand consensus sequence if the representation or the tag at the specific position p is respectively present in both of the single strand consensus sequences of the group′ and the group″; and (ii) setting the tag “N” at each specific position p of a plurality of positions in the double strand consensus sequence if the tag “N” is present at the specific position p in one of the single strand consensus sequences of the group′ or the group″, or if the representation of the most abundant nucleotide man_p is not identical at the specific position in both of the single strand consensus sequences of the group′ or the group″.
42. The system according to claim 41, wherein the computation unit is configured to compare double-stranded consensus sequences.
43. The system according to claim 41, wherein the computation unit is configured to compare the double-stranded consensus sequences, to determine the number of the double-stranded consensus sequences, and to identify the variation in double-stranded consensus sequences.
44. The system according to claim 24, wherein each position corresponds to a base pair.
45. The system according to claim 24, wherein the genetic reads of the one corresponding group have the same length.
46. The system according to claim 24, wherein the sequencing is next generation sequencing.
47. A computer program product comprising one or more computer readable media having computer executable instructions for performing the steps of the method of claim 1.
48. The method of claim 4, wherein the at least one particular nucleic acid sequence includes at least one barcode sequence.
49. The method of claim 7, wherein the at least one particular nucleic acid sequence includes at least one barcode sequence.
50. The system of claim 27, wherein the at least one particular nucleic acid sequence includes at least one barcode sequence.
51. The system of claim 30, wherein the at least one particular nucleic acid sequence includes at least one barcode sequence.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0352]
[0353]
[0354]
[0355]
[0356]
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
[0357]
[0358] In the inventive method as shown in
[0359] After the alignment, read pairs sharing the same genomic position and barcodes are grouped into “families” as shown in
[0360] The aim is to generate a synthetic single strand consensus (sscs) read by using the information contained in all members of the family, i.e. by considering the most abundant sequencing information as true. Therefore, the reads belonging to the same family are compared nucleotide by nucleotide and the most abundant nucleotide man_p is determined at each position.
[0361] A respective most abundant nucleotide man_p is written into the sscs synthetic read at the respective position only if the nucleotide appears e.g. in at least in 76%, e.g. >3 out of 4, of the members of the family at the investigate genomic position. In other words, the most abundant nucleotide man_p is determined at each specific position p of a plurality of positions within the respective family. A representation of the most abundant nucleotide man_p at each specific position p of the consensus sequence is only set if the ratio r between the number of nucleotides of the family at the specific position being the most abundant nucleotide man_p and the total number of reads of the family, i.e. the number of family members, is above or equal the predetermined threshold t, i.e. 76% in the present inventive embodiment. Otherwise, a tag N is set, i.e. if the ratio r between the number of nucleotides being the most abundant nucleotide man_p at the specific position p and the number of reads is below the 76%.
[0362] The upper part of
[0363] Such variation is only contained in one of the genetic reads. Thus, this nucleotide is not the most abundant nucleotide and the predetermined threshold, e.g. at least 76%, is not fulfilled. Therefore, the most abundant wild-type nucleotide (at least 76%) comprised in the other genetic reads at this position is set in the consensus sequence.
[0364] At the lower part of
[0365]
[0366]
[0367]
[0368] After the single strand consensus sequence (sscs) has been determined as shown in
[0369] A nucleotide is written into a double strand consensus read only if this nucleotide is the most abundant nucleotide man_p in both sscs reads, otherwise the assessed position in the sscs read is set with an “N” tag.
[0370]
[0371]
[0372] In
EXAMPLES
[0373] To assess the sensitivity and specificity (PPV) of the method according to the present invention in detecting variants at low MAF (minor allele frequency 1%), four dilutions of three HapMap normal cell lines are generated. Details can be taken from Table 1.
[0374] DNA from these four dilutions is analyzed using the method of creating a single strand consensus as described above.
TABLE-US-00002 TABLE 1 Details of the HapMap normal cell lines dilutions generated in the laboratory in order to assess the sensitivity and specificity of our approach in detecting variants at low MAF HapMap Normal Name Dilution 1 Dilution2 Dilution 3 Dilution 4 GM19194B (%) 99.4 99.4 99.4 99.4 GM19153B (%) 0.2 0.4 na na GM12144C (%) 0.4 0.2 na na GM19137B (%) na na 0.2 0.4 GM19142B (%) na na 0.4 0.2
[0375] A set of single nucleotide polymorphisms, SNPs specific to the individual HapMap cell lines in use, are determined from each non-diluted cell line. These were sub-divided into a set of unique SNPs, hereinafter referred to as “private SNPs”, which are specific to one of the HapMap cell lines in use. SNPs present in more than one of the cell lines in use is referred to as non-private SNP. The sum of private and non-private SNPs is referred to as total SNPs.
[0376] HapMap cell line dilutions are used to determine the limit of detection of our assay. Because only heterozygous private SNPs were considered, all dilutions were characterized by private SNPs with expected 0.1%=<MAF<0.2% and MAF˜50% as can be seen in table 1.
[0377] Due to experimental error, the detected MAF is in some cases different from that expected. Taking in consideration this intrinsic experimental error, sensitivity and specificity (PPV) were calculated as follow:
[0378] Sensitivity=True positive private SNPs/(True positive private SNPs+False negative)
[0379] Specificity (PPV)=Total true positive/(Total true positive+False positive)
[0380] Where true positive, false negative and false positive are defines as:
[0381] True positive private SNPs: private SNPs with detected 0.1%=<MAF<0.2%
[0382] True positive not private SNPs: not private SNPs with detected 0.1%=<MAF<0.2%
[0383] Total true positive: True positive private SNPs+True positive not private SNPs
[0384] False negative: private SNPs characterized by a detected 0.1%=<MAF<0.2% in the aligned sequencing read raw data but not called by our analysis algorithm.
[0385] False positive: SNV with detected 0.1%=<MAF<0.2% classified as “true mutations” by our analysis algorithm but not present in any of the non-diluted HapMap normal cell lines.
[0386] In Table 2 the details of the number of true positive, false negative and false positive with detected 0.1%=<MAF<0.2% are reported for the four dilutions of HapMap normal cell lines presented in Table 1 at a mean sequencing coverage of 3000×.
TABLE-US-00003 TABLE 2 Number true positive, false negative, false positive SNPs detected using the method presented in this document. Number Number of false Number of true negative of false positive Total true Mean private positive private positive Coverage SNPs variants SNPs SNPs Dilution 1 2943.56 6 16 25 91 Dilution 2 3042.1 2 12 27 49 Dilution 3 2995.51 4 18 30 58 Dilution 4 2934.2 4 25 34 66 total 22 172 141 326
[0387] In the present invention a filter is developed that uses the level of reliability for each nucleotide call, thus exploits the nature of the sscs and/or dscs reads. As mentioned above, sscs reads are made of only those nucleotides present in at least 76% of the reads of a family at the investigated position.
[0388] If this condition is not fulfilled, an “N” is placed at the investigated position in the sscs read to indicate that the consensus was not reached. It is understood by the skilled person that the present invention works also only in case where only a single strand consensus sscs is considered.
[0389] However, in order to further improve the method of the present invention also the double strand consensus was considered. Therefore, similarly, a nucleotide is written into the synthetic dscs read only if present in both sscs reads aligning at the same genomic position and showing complementary barcodes, an “N” is placed at the investigated position in the dscs read if the consensus is not reached. Thus, positions with “N” represent regions for which a consensus was not reached and the true nature of the sequence is unknown.
[0390] It was observed that sscs and dscs reads aligning at genomic regions containing substitution erroneously introduced during the sequencing workflow exhibit a lower rate of consensus nucleotides and therefore a higher number of N compared to regions containing true variants. These regions may be defined by sequence repeats, which frequently lead to the incorporation of incorrect nucleotides and are therefore not removed by the 76% consensus cutoff applied in the previous step.
[0391] Based on this information, the abundance of “N” at a variant position across all reads covering the respective position (representing the sequencing depth at this position) can be used to discriminate a true mutation from erroneous substitution. In particular, the approach that has been implemented according to the present invention employs the ratio between the number of reads containing “N” and those containing the variant to discern a true call from false positive calls (referred to as “N filter” hereinafter).
Example 1
[0392] To assess the validity of the present invention, data obtained from the dilution of HapMap normal cell lines previously analyzed with the computational pipeline as used before the invention were further processed using the N filter with a required ratio of # (number) of reads containing a “N” at a defined position divided by the # (number) of reads containing the variant at the defined position is >2. Thus, if at a defined position more than twice as many “N” are present (compared to the variant), no mutation is called; see also
[0393] The number of true positive, false negative and false positive calls detected at 0.1%=<MAF<0.2% using this further improved analysis is presented in Table 3.
TABLE-US-00004 TABLE 3 Number true positive, false negative, false positive SNPs detected using the optimized computational pipeline including the N filter. Number Number of false Number of true Total true negative of false positive (private + Mean private positive private non-private SNPs) Coverage SNPs variants SNPs positive SNPs Dilution 1 2943.56 10 9 21 42 Dilution 2 3042.1 5 9 24 35 Dilution 3 2995.51 4 10 30 54 Dilution 4 2934.2 4 18 34 60 total 23 46 109 191
[0394] The sensitivity and specificity (PPV) estimated for the workflow presented in details above were 82.5% and 80%, respectively with and without the N filter with a ratio>2 (see Table 5).
Example 2
[0395] To assess the validity of the present invention, data obtained from the dilution of HapMap normal cell lines previously analyzed with the computational pipeline as used before the invention were further processed using the N filter with a required ratio of # (number) of reads containing a “N” at a defined position divided by the # (number) of reads containing the variant at the defined position is >4. Thus, if at a defined position more than four times as many “N” are present (compared to the variant), no mutation is called; see
[0396] The number of true positive, false negative and false positive calls detected at 0.1%=<MAF<0.2% using this further improved analysis is presented in Table 4 below.
TABLE-US-00005 TABLE 4 Number true positive, false negative, false positive SNPs detected using the optimized computational pipeline including the N filter. Number Number of false Number of true Total true negative of false positive (private + Mean private positive private non-private SNPs) Coverage SNPs variants SNPs positive SNPs Dilution 1 2943.56 6 15 25 47 Dilution 2 3042.1 2 11 27 44 Dilution 3 2995.51 4 14 30 54 Dilution 4 2934.2 4 20 34 60 total 16 60 116 205
[0397] The sensitivity and specificity (PPV) estimated for the workflow presented in details above were 82.5% and 87.9%, respectively with and without the N filter with a ratio>4 (see Table 5 below).
TABLE-US-00006 TABLE 5 Sensitivity (PPV) obtained for the 4 dilutions of HapMap normal cell lines analyzed using the computational pipeline with two different N-filter ratios as well as without the N filter Sensitivity Specificity (PPV) without N filter 86.5% 65.5% with N filter >4 87.9% 77.4% with N filter >2 82.5% 80.5%
[0398] Thus the present examples show using the above-described N filter shows a significant reduction in false-positives.
[0399] As the present invention may be embodied in several forms without departing from the scope or essential characteristics thereof, it should be understood that the above-described embodiments are not limited by any of the details of the foregoing descriptions, unless otherwise specified, but rather should be construed broadly within the scope as defined in the appended claims, and therefore all changes and modifications that fall within the present invention are therefore intended to be embraced by the appended claims.