Method for encoding and decoding of quality values of a data structure

Abstract

Method for encoding of quality values of a data structure, whereby said data structure comprises a set of genomic reads, wherein the method comprises the following steps executable by a data processing system: ascertain the quality values of each read covering a certain index locus, determine a codebook identifier identifying a specific codebook from a plurality of codebooks for said certain index locus based on the ascertained quality values of said certain index locus, whereby each code-book provides a mapping from a quality value of said quality value alphabet to a corresponding quantized quality value of a quantized quality value alphabet, quantizing all ascertained quality values at said certain index locus using the specific codebook identified by the codebook identifier at said certain index locus in order to obtain for each quality value at said certain index locus a corresponding quantized quality value, and encode all determined codebook identifiers using a first entropy encoder and encode all quantized quality values using a second entropy encoder or a set of encoders.

Claims

1. A method for encoding of quality values of a data structure, whereby said data structure comprises a set of genomic reads, whereby each genomic read in said set of genomic reads comprises an actual sequenced nucleotide sequence as a local part of a donor sequence or genome, wherein said nucleotide sequence includes a sequence of symbols derived from a nucleotide alphabet, a mapping position indicating an alignment of said nucleotide sequence relating to at least one reference nucleotide sequence of the donor sequence or genome, a CIGAR string indicating similarities and/or differences of said nucleotide sequence relating to said at least one reference nucleotide sequence of the donor sequence or genome, and a sequence of quality values, each quality value being derived from a quality value alphabet, whereby a quality value at an index locus of said sequence of quality values is assigned to a corresponding symbol of said nucleotide sequence at said index locus and indicates a likelihood that the corresponding symbol is correct in view of said at least one reference nucleotide sequence of the donor sequence or genome, wherein the method is executed by a computer including a storage medium having stored thereon processor-executable instructions to cause the computer to perform the following operations: ascertain the quality values of each genomic read covering a certain index locus, determine a codebook identifier identifying a specific codebook from a plurality of codebooks for said certain index locus based on the ascertained quality values of said certain index locus, whereby each codebook provides a mapping from a quality value of said quality value alphabet to a corresponding quantized quality value of a quantized quality value alphabet, quantize all ascertained quality values at said certain index locus using the specific codebook identified by the codebook identifier at said certain index locus in order to obtain for each quality value at said certain index locus a corresponding quantized quality value, and encode all determined codebook identifiers using a first entropy encoder and encode all quantized quality values using a second entropy encoder or a set of encoders, wherein the encoded codebook identifiers and the encoded quantized quality values are stored in a memory space in place of storing the quality values or the actual sequenced nucleotide sequence.

2. The method according to claim 1, wherein the quantization step is performed for each index locus.

3. The method according to claim 1, wherein a genotype uncertainty for said certain index locus is computed based on the ascertained quality values at said certain index locus and the corresponding nucleotide symbols of each quality value at said certain index locus are obtained using a statistical model in order to obtain a likeliness that a unique genotype is the correct one.

4. The method according to claim 1, further comprising the following steps executed by the data processing system: input the determined codebook identifier at said certain index locus into a quality value codebook stream and input the quantized quality values at said certain index locus into a quality value index stream or a set of streams; encode the codebook identifiers of the quality value codebook stream using the first entropy encoder and encode the quantized quality values of the quality value index stream using the second entropy encoder or a set of encoders.

5. The method according to claim 4 further comprising the following steps executed on the data processing system: decompose the quality value descriptor stream into subsequence streams corresponding to the provided codebook identifiers such that each subsequence stream is assigned to one codebook identifier of the codebook identifiers, input the quantized quality values into that subsequence stream which corresponds to the respective codebook identifier, and encode each subsequence stream separately using the second entropy encoder or set of encoders.

6. The method according to claim 5, wherein for each subsequence stream, a probability distribution is computed based on the quality values of the respective subsequence stream and a separate second entropy encoder modelling the probability distribution of the respective subsequence stream is used for encoding the respective subsequence stream.

7. A method for decoding of encoded quality values, whereby the encoded quality values were encoded by a method according to claim 1, wherein the method is executed by a computer including a storage medium having stored thereon processor-executable instructions to cause the computer to perform the following operations: decode the encoded codebook identifiers and the encoded quantized quality values using an entropy decoder corresponding to the entropy encoders of the encoding method; ascertain a codebook identifier for a certain index locus from the decoded codebook identifiers and quantized quality values for said certain index locus from the decoded quantized quality values; determine a specific codebook of the plurality of codebooks based on the ascertained codebook identifier; and reconstruct the ascertained quantized quality values using the determined specific codebook.

8. The method according to claim 7, wherein the steps are performed for each index locus.

9. A computer program encoded on a non-transient computer readable medium having instructions which are executable on a data processing system for executing a decoding method according to claim 7.

10. A hardware device arranged to execute the decoding method according to claim 7.

11. A computer program encoded on a non-transient computer readable medium having instructions which are executable on a data processing system for executing a method according to claim 1.

12. A hardware device arranged to execute the encoding method according to claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The present invention is described in more detail by reference to the following figures:

(2) FIG. 1overview of the coding structure;

(3) FIG. 2detailed description of the encoding method using a simple example.

DETAILED DESCRIPTION

(4) FIG. 1 shows the basic coding structure for encoding and decoding. The encoder gets as input quality values q, mapping positions p, CIGAR strings c, nucleotide sequences s and optionally the reference sequence(s) r, as defined e.g. in the SAM format specification. The computation of the codebook identifiers k is performed by module G which gets as input the quality values q, the mapping positions p, the mapping positions p, the CIGAR strings c, the nucleotide sequences s and optionally the reference sequence(s) r. The codebook identifiers k then control the working of the quantization module Q which quantized the quality values q and outputs quantized quality values i.

(5) The codebook identifier k is used to quantize all quality values associate with index locus l, whereas a high codebook identifier k is associate with the codebook comprising a high number of representative values. In other words, a high codebook identifier k will yield fine quantization and vice versa.

(6) To compute the codebook identifier k the proposed method infers the genotype uncertainty at locus l from the observable data using a statistical model. Given the sequencing depth N at locus l, the immediate observable data are the read-out nucleotides and the associated quality values of all reads overlapping locus l considering the information in the CIGAR strings. The genotype uncertainty can be regarded as a metric M that measures the likeliness that a unique genotype is the correct one.

(7) More specifically, assume a set of reads that are aligned to a reference sequence or that were aligned by a de-novo assembler. Further assume that the reads were sorted by their mapping positions. Given such set of reads, let denote by N the number of reads covering locus l. Let n.sub.j be the symbol from read j covering the locus l and q.sub.j the value of the corresponding quality value. The observable data at locus l can be written as (n, q)={(n.sub.j, q.sub.j)}.sup.N.sub.j=1.

(8) For each locus l, a metric M=M(n, q) (the genotype uncertainty) can be computed. Then, the codebook identifier k is computed by using the metric M as
k=f(M(n,q)),
where f is a monotonous increasing function.

(9) That is, if the method believes that two or more different genotypes are likely to be true, than the genotype uncertainty will be high and hence, k will be high, which will yield less compressibility at the locus l. However, if there is enough evidence in the data that a particular genotype is likely the correct one, than the genotype uncertainty will be low, and therefore, k will be low, which will yield more compression.

(10) The quantization index i and the codebook identifier k are encoded by entropy encoder module E1 and E2. The quantization index i are encoded by entropy encoder module E2, while the codebook identifiers k are encoded by entropy encoder module E1.

(11) After transmission over the transmission channel, the decoder decodes the quantization indexes using entropy decoder module D2 and decodes the codebook identifiers using entropy decoder module D1. The alignment information, i.e. the mapping positions, the CIGAR strings, and the reference sequence must be transmitted as side information to the decoder. Subsequently, using the quantization indexes, the codebook identifiers and the side information, the reconstruction module r reconstructs the quality values.

(12) The quantized quality values (mentioned above as quantization indexes i) are inputted into a quality value index stream. The codebook identifiers k are inputted into a quality value codebook stream. In a single-stream entropy encoding stream, the quality value codebook stream and the quality value index streams are compressed block wise with two arithmetic encoders. Here, in an example, the first arithmetic encoder modules the probability distribution P(k) for the quality value codebook stream symbols K={0, . . . , k} therefore approaching the memoryless entropy of the quality value codebook stream signal. The second arithmetic encoder models the probability distribution P(i) for the symbols of the quantized quality value alphabet l, therefore approaching the memoryless entropy of the quality value index stream signal.

(13) In a context-based entropy encoding stream, as shown in a simple example in FIG. 2, the quality value index stream is decomposed respectively demultiplexed into (disjunkt or disjoint) subsequence streams corresponding to the number of codebooks K. For example, the number of codebooks could be 7, so that the quality value index stream is decomposed into subsequence streams corresponding to the codebook identifier symbols k element {1, 2, 3, 4, 5, 6, 7}. The codebook identifier symbol 0 is sent at loci with 0 sequencing depth.

(14) The example in FIG. 2 shows four reads at the specific locus l, whereby the first read at locus l had a nucleotide A, the second read a nucleotide C, the third read a nucleotide T and the last read the nucleotide T. The quality value at locus l for the first read has a value of 10, the second value of 21, the third value of 7 and the last value of 8.

(15) Based on the quality values 10, 21, 7, 8 at the specific locus l and the nucleotides A, C, T, T, the codebook identifier k is computed. Based on the codebook identifier k, a codebook from a plurality of codebooks is determined, whereby the determined codebook relating to the codebook identifier k. In the example of FIG. 2, k=2 and the codebook with the number 2 is chosen. This codebook has the quantization indexes or quantized quality values I={0, 1, 2}. Based on the chosen codebook number 2 and the quality values {10, 21, 7, 8} at locus l, the quantized quality values i are computed using the codebook number 2. Therefore, for the quality value 10 a quantized quality value 1 is determined. For the quality value 21, a quantized quality value 2 is used. For the quality value 7 and 8, the quantized quality value of 0 is determined from the codebook number 2.

(16) Furthermore, seven (disjunkt or disjoint) quality value index streams (known as subsequence streams) are exist. In the subsequence streams corresponding to the codebook number 2, the quantized quality values {1, 2, 0, 0} at locus l are inputted.

(17) Quantized quality values quantized by another codebook are inputted into the stream relating to the corresponding codebook. Hence, the quantized quality values i are grouped into seven subsequence streams which than are separately compressed by seven arithmetic encoders which model the probability distributions (pilk.sub.i).

(18) Every codebook identifier k is associated to a specific genomic locus l. Also, every quantized quality value symbol i is associate to a specific genomic locus. Given a codebook identifier k(l)=k.sub.l at locus l, the possible values for all quantized quality value symbol at this locus l are also determined by i(l)=i(k.sub.l) element {0, . . . , k.sub.l}.

(19) In this current implementation, seven arithmetic encoders are used which each model a different conditional probability distribution. However, other entropy encoder architectures might as well be used to exploit the statistic of the quantized quality values and codebook identifiers streams.

Method for encoding and decoding of quality values of a data structure

Assignee

Inventors

Cpc classification

Classification Explorer

G16B50/00

PHYSICS

Classification Explorer

H03M7/4006

ELECTRICITY

Classification Explorer

H03M7/30

ELECTRICITY

Classification Explorer

H03M7/70

ELECTRICITY

Classification Explorer

H03M7/40

ELECTRICITY

Classification Explorer

G16B30/10

PHYSICS

Classification Explorer

H03M7/3091

ELECTRICITY

International classification

Classification Explorer

H03M7/30

ELECTRICITY

Classification Explorer

G16B30/10

PHYSICS

Classification Explorer

H03M7/40

ELECTRICITY

Classification Explorer

G16B50/00

PHYSICS

Abstract

Claims

Description