GENE SEQUENCING DATA COMPRESSION METHOD AND DECOMPRESSION METHOD, SYSTEM AND COMPUTER-READABLE MEDIUM

20200294629 ยท 2020-09-17

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention discloses a gene sequencing data compression method and decompression method, a system, and a computer-readable medium. The compression method includes: comparing a read sequence R with a reference genome to obtain an equal-length gene character sequence CS; coding the read sequence R and the equal-length gene character sequence CS, performing reversible computing by means of a reversible function, compressing a most approximate position p of the read sequence R in the reference genome and the reversible computing result that serve as two data streams, and outputting the compressed data streams. The data decompression method is reverse processing of the compression method. By means of the present invention, the compression ratio can be further decreased, the compression/decompression time of an algorithm is shorter while a better compression ratio is obtained. The present invention is compatible with algorithms for making comparisons between read sequences and reference genomes.

Claims

1. A gene sequencing data compression method, comprising the following implementation steps: A1) traversing a gene sequencing data sample data to obtain a read sequence R with a length of Lr; A2) comparing every read sequence R with the reference genome to obtain a most approximate position p of every read sequence from the reference genome, so as to obtain a most approximate equal-length gene character sequence CS of the read sequence R; coding the read sequence R and the equal-length gene character sequence CS, and then performing reversible computing by means of a reversible function, wherein output computing results coded by any pair of same characters are identical by virtue of the reversible function; and compressing the most approximate position p of the read sequence R in the reference genome and the reversible computing result that serve as two data streams, and outputting the compressed data streams.

2. The gene sequencing data compression method as recited in claim 1, wherein the step A2) comprises the following detailed steps: A2.1) traversing the gene sequencing data sample data to obtain a read sequence R with the length of Lr; A2.2) comparing the read sequence R with the reference genome to obtain a most approximate position p thereof from the reference genome, so as to obtain a most approximate equal-length gene character sequence CS of the read sequence R; A2.3) coding the read sequence R and the equal-length gene character sequence CS, and then performing reversible computing by means of a reversible function, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible function; A2.4) compressing the most approximate position p of the read sequence R in the reference genome and the reversible computing result that serve as two data streams, and outputting the compressed data streams; A2.5) judging whether the read sequence R in the gene sequencing data sample data is traversed, if not, jumping to step A2.1); otherwise ending and exiting.

3. The gene sequencing data compression method as recited in claim 1, wherein a XOR computing or a bit subtraction is specifically applied for the reversible function.

4. The gene sequencing data compression method as recited in claim 1, wherein the compression in step A2) specifically refers to a compression using a statistical model and entropy coding.

5. A gene sequencing data decompression method, comprising the following implementation steps: B1) traversing gene sequencing data data.sub.c to be decompressed to obtain a read sequence R.sub.c to be decompressed; B2) decompressing and reconstructing every read sequence R.sub.c to be decompressed to be a most approximate position p in the reference genome and a reversible computing result CS1 with and a length of Lr bit; obtaining a gene character string CS2 with the length of Lr bit in the reference genome according to the most approximate position p in the reference genome; performing reverse computing for the reversible computing result CS1 and the gene character string CS2 by virtue of an inverse function of the reversible function, so as to obtain and output an original read sequence R of the corresponding read sequence R.sub.c to be decompressed, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible computing.

6. The gene sequencing data decompression method as recited in claim 5, wherein the step B2) comprises the following detailed steps: B2.1) traversing gene sequencing data data.sub.c to be decompressed to obtain a read sequence R.sub.c to be decompressed; B2.2) decompressing and reconstructing the read sequence R.sub.c to be decompressed to a most approximate position p in the reference genome and the reversible computing result CS1 with a length of Lr bit; B2.3) obtaining a gene character string CS2 with the length of Lr bit from the reference genome according to the most approximate position p in the reference genome; B2.4) performing reverse computing for the reversible computing result CS1 and the gene character string CS2 by virtue of an inverse function of an reversible function, so as to obtain and output an original read sequence R of the corresponding read sequence R.sub.c to be decompressed, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible computing; B2.5) judging whether the read sequence R.sub.c to be decompressed in the gene sequencing data sample data.sub.c to be decompressed is traversed, if not, jumping to step B2.1); otherwise ending and exiting.

7. The gene sequencing data decompression method as recited in claim 5, wherein an XOR function or a bit subtraction function is specifically applied for the reversible function; An inverse function of the XOR function is the XOR function, and an inverse function of the bit subtraction function is a bit addition function.

8. The gene sequencing data decompression method as recited in claim 5, wherein the decompression and reconstruction in step B2) specifically refer to decompression and reconstructing using inverse algorithms of a statistical model and entropy coding.

9. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data compression method as recited in claim 1.

10. A computer-readable medium on which a computer program is stored, wherein the computer program enables a computer to perform the steps of the gene sequencing data compression method as recited in claim 1.

11. The gene sequencing data compression method as recited in claim 2, wherein a XOR computing or a bit subtraction is specifically applied for the reversible function.

12. The gene sequencing data decompression method as recited in claim 6, wherein an XOR function or a bit subtraction function is specifically applied for the reversible function; An inverse function of the XOR function is the XOR function, and an inverse function of the bit subtraction function is a bit addition function.

13. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data compression method as recited in claim 2.

14. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data compression method as recited in claim 3.

15. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data compression method as recited in claim 4.

16. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data decompression method as recited in claim 5.

17. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data decompression method as recited in claim 6.

18. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data decompression method as recited in claim 7.

19. A gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the gene sequencing data decompression method as recited in claim 8.

20. A computer-readable medium on which a computer program is stored, wherein the computer program enables a computer to perform the steps of the gene sequencing data compression method as recited in claim 2.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] FIG. 1 is a basic schematic diagram of a compression method in the embodiments of the present invention.

[0039] FIG. 2 is a basic schematic diagram of a decompression method in the embodiments of the present invention.

DETAILED DESCRIPTION

[0040] By referring to FIG. 1, the gene sequencing data compression method of this embodiment comprises the following implementation steps:

[0041] A1) traversing a gene sequencing data sample (data) to obtain a read sequence R with a length of Lr;

[0042] A2) comparing every read sequence R with the reference genome to obtain the most approximate position p of every read sequence from the reference genome, so as to obtain the most approximate equal-length gene character sequence CS of the read sequence R; coding the read sequence R and the equal-length gene character sequence CS, and then performing reversible computing by means of the reversible function, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible function; and compressing the most approximate position p of the read sequence R in the reference genome and the reversible computing result that serve as two data streams, and outputting the compressed data streams.

[0043] According to the gene sequencing data compression method in this embodiment, the compression ratio is further reduced, the compression/decompression time of an algorithm is relatively shorter while a better compression ratio is obtained; the present invention is compatible with algorithms for making comparisons between read sequences and reference genomes.

[0044] In this embodiment, step A2) comprises the following detailed steps:

[0045] A2.1) traversing the gene sequencing data sample (data) to obtain the read sequence R with the length of Lr;

[0046] A2.2) comparing the read sequence R with the reference genome to obtain the most approximate position p thereof from the reference genome, so as to obtain the most approximate equal-length gene character sequence CS of the read sequence R;

[0047] A2.3) coding the read sequence R and the equal-length gene character sequence CS, and then performing reversible computing by means of a reversible function, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible function;

[0048] A2.4) compressing the most approximate position p of the read sequence R in the reference genome and the reversible computing result that serve as two data streams, and outputting the compressed data streams;

[0049] A2.5) judging whether the read sequence R in the gene sequencing data sample (data) is traversed, if not, jumping to step A2.1); otherwise ending and exiting.

[0050] In this embodiment, XOR computing or bit subtraction is specifically applied for the reversible function.

[0051] In this embodiment, compression in step A2) specifically refers to compression using a statistical model and entropy coding.

[0052] By referring to FIG. 2, the gene sequencing data decompression method of this embodiment comprises the following implementation steps:

[0053] B1) traversing gene sequencing data (data) to be decompressed to obtain a read sequence R.sub.c to be decompressed;

[0054] B2) decompressing and reconstructing every read sequence R.sub.c to be decompressed to be a most approximate position p in the reference genome and a reversible computing result CS1 with a length of Lr bit; obtaining a gene character string CS2 with the length of Lr bit in the reference genome according to the most approximate position p in the reference genome; performing reverse computing for the reversible computing result CS1 and the gene character string CS2 by virtue of an inverse function of the reversible function, so as to obtain and output an original read sequence R of the corresponding read sequence R.sub.c to be decompressed, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible computing.

[0055] In this embodiment, step B2) comprises the following detailed steps:

[0056] B2.1) traversing gene sequencing data (data.sub.c) to be decompressed to obtain the read sequence R.sub.c to be decompressed;

[0057] B2.2) decompressing and reconstructing the read sequence R.sub.c to be decompressed to the most approximate position p in the reference genome and the reversible computing result CS1 with the length of Lr bit;

[0058] B2.3) obtaining the gene character string CS2 with the length of Lr bit from the reference genome according to the most approximate position p in the reference genome;

[0059] B2.4) performing reverse computing for the reversible computing result CS1 and the gene character string CS2 by virtue of the inverse function of the reversible function, so as to obtain and output the original read sequence R of the corresponding read sequence R.sub.c to be decompressed, wherein the output computing results coded by any pair of same characters are identical by virtue of the reversible computing;

[0060] B2.5) judging whether the read sequence R.sub.c to be decompressed in the gene sequencing data sample (data.sub.c) to be decompressed is traversed, if not, jumping to step B2.1); otherwise ending and exiting.

[0061] An XOR function or a bit subtraction function is specifically applied for the reversible function. An inverse function of the XOR function is the XOR function, and an inverse function of the bit subtraction function is a bit addition function. In this embodiment, XOR computing is specifically applied for the reversible computing. In this embodiment, A, C, G and T gene letters are respectively coded as 00, 01, 10 and 11, for instance, a certain gene letter is A, and a prediction character c is A at the same, an XOR operation result (reversible computing result) of this bit is 00, otherwise the XOR operation result varies according to different input characters; in decompressing, the XOR operation (reverse computing for the inverse function of the XOR function) is performed for the character coding and XOR operation result (reversible computing result) of the prediction character c again, namely, original gene characters can be restored. A, C, G and T gene letters are respectively coded as 00, 01, 10 and 11, which is a preferable streamlined coding way. Besides, other binary coding ways may be applied for reversible conversion between the gene characters, prediction characters and reversible computing results according to the needs. Without doubt, the subtraction may be applied for reversible computing in addition to the XOR computing, and meanwhile the inverse computing of the reversible computing is addition. Meanwhile, the reversible conversion between the gene characters, prediction characters and reversible computing results can be implemented.

[0062] In this embodiment, decompression and reconstruction in step B2) specifically refer to decompression and reconstructing using inverse algorithms of a statistical model and entropy coding.

[0063] Besides, this embodiment further provides a gene sequencing data decompression system, comprising a computer system, wherein the computer system is programmed to perform the steps of the aforesaid gene sequencing data compression method or the aforesaid gene sequencing data decompression method of the present invention.

[0064] Besides, this embodiment further provides a computer-readable medium on which a computer program is stored, wherein the computer program enables a computer to perform the steps of the aforesaid gene sequencing data compression method or the aforesaid gene sequencing data decompression method of the present invention.

[0065] The above are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited to the embodiment mentioned above. The technical solutions under the ideas of the present invention fall into the protection scope of the present invention. It should be pointed out that, for an ordinary person skilled in the art, some improvements and modifications without departing from the principle of the present invention shall be deemed as the protection scope of the present invention.