GENE SEQUENCING QUALITY LINE DATA COMPRESSION PRE-PROCESSING AND DECOMPRESSION AND RESTORATION METHODS, AND SYSTEM
20200402618 ยท 2020-12-24
Assignee
Inventors
- Yanhuang Jiang (Hunan, CN)
- Zhuo Song (Hunan, CN)
- Gen Li (Hunan, CN)
- Qiangli ZHAO (Hunan, CN)
- Bolun Feng (Hunan, CN)
- Hongwei TANG (Hunan, CN)
- Xiali Xu (Hunan, CN)
- Haibo Mao (Hunan, CN)
Cpc classification
H03M7/30
ELECTRICITY
International classification
G16B30/00
PHYSICS
Abstract
This invention relates to a gene sequencing quality line data compression pre-processing and decompression and restoration method, and a system, wherein the basic principle of the gene sequencing quality line data compression pre-processing and decompression and restoration is to extract several columns from an inputted quality line document or data block to act as index columns, and then perform rearrangement on all quality line data, all quality lines having a same index column being one group and being arranged together according to their relative positions in the original data block. Since quality line data having a same index column is usually more similar, the data reorganization means can arrange similar gene sequencing data together, so as to increase local similarity of the data.
Claims
1. A method of gene sequencing quality line data compression pre-processing, wherein the implementation steps comprise: 1) reading an original data block (Data) of the quality line data and determining an index column numbers (Index_No) thereof; 2) establishing an index information table (IIT) according to an index columns of the original data block (Data); 3) according to the index information table (IIT), regrouping quality lines in the original data block (Data) according to an index column information, and deleting portion of an index column data to obtain a regrouped data (Grouped_Data); 4) extracting the index column data (Index_Data) of the original data block (Data), and exporting the index column numbers (Index_No), the index column data (Index_Data) of the original data block (Data) and the regrouped data (Grouped_Data) as compression pre-processing results.
2. The method of gene sequencing quality line data compression pre-processing of claim 1, wherein step 2) comprises the following detailed steps: 2.1) initializing number of entries of the index information table (ITT) to be 0, and including serial numbers, the index column information (Index), and variables (num, start and temp) in entries of the index information table (IIT) structurally, wherein the variable (num) is a number of quality lines having the corresponding index column information; the variable (start) indicates initial locations of the quality lines having the index column information after grouped; the variable (temp) is the number of quality lines having the corresponding index column information in regrouping; 2.2) initializing a current quality line number (i) of the original data block (Data) to be 0; 2.3) sequentially scanning a current quality line (Data[i]) in the original data block (Data), and jumping to execute step 2.6) if reaching the end of the original data block(Data); otherwise, taking out the index column information (Index) of the current quality line (Data[i]), wherein (Data[i]) refers to contents of the current quality line (i) in the original data block (Data); adding 1 to the current quality line number (i); 2.4) searching all entries in the index information table (IIT), adding 1 to the variable (num) of an entry (j) if the index column information of a certain entry (j) of the index information table (IIT) is equal to the index column information (Index) of the current quality line (Data[i]), jumping to execute step 2.3); otherwise, jumping to execute step 2.5); 2.5) establishing a new entry (k) in the index information table (ITT), setting index column information (IIT[k].Index) of an entry (k) to be equal to the index column information (Index) of the current quality line (Data[i]), and the variable (num) of the entry (k) to be equal to 1, and adding 1 to a serial number (k); jumping to execute step 2.3); 2.6) initializing the current entry (j) of the index information table (ITT) to be 0; 2.7) sequentially scanning the entries of the index information table (IIT), setting corresponding grouping start positions for all index column information, and ending this step and jumping to execute step 3) if reaching the end of the index information table (IIT); otherwise, with respect to the entry (j) scanned currently in the index information table (IIT), setting a value of the variable (start) of the entry (j) to be 0 and a value of variable (temp) to be 0 if a serial number (j) of the entry is 0, and adding 1 to the serial number (j) of the current entry; jumping to continue with step 2.7); otherwise setting the value of the variable (start) of the entry (j) to be the sum of the variable (start) and the variable (num) of the last entry (j1) and the value of the variable (temp) of the entry (j) to be 0, adding 1 to the serial number (j) of the current entry, and jumping to continue with step 2.7).
3. The method of gene sequencing quality line data compression pre-processing of claim 1, wherein step 3) comprises the following detailed steps: 3.1) allocating a space for the regrouped data (Grouped_Data), wherein a number of lines thereof is the same as that of the original data block (Data); 3.2) initializing a value of a current quality line number (i) of the original data block (Data) to be 0; 3.3) scanning a current quality line of the original data block (Data), wherein a current quality line data is Data[i], and (i) is the current quality line number; taking out the index column information (Index) of the current quality line Data[i]; 3.4) searching an entry (j), an index information of which is the same as Index, in the index information table (IIT); 3.5) inserting the quality line data, the index column information of which is deleted, into the regrouped data (Grouped_Data), wherein a value of an insertion position (k) is the sum of the variables (start and temp) of the entry (j); adding 1 to a value of the variable (temp) of the entry (j); 3.6) adding 1 to the line number (i), judging whether the line number (i) is more than a total line number of the original data block (Data), and jumping to execute step 3.3) if the total line number of the original data block (Data) is not exceeded; otherwise, jumping to execute step 4).
4. A method of gene sequencing quality line data decompression and restoration, wherein the implementation steps comprise: S1) reading decompressed index column data (Index_Data), regrouped data (Grouped_Data) and index column numbers (Index_No), determining a number of the quality line of an original data block (Data) and character data of each line based on the regrouped data (Grouped_Data) and index column number (Index_No), and allocating a space for storage of the original data block (Data); S2) according to the index column numbers (Index_No), respectively assigning each column of data among the index column data (Index_Data) to the corresponding column, the number of which is recorded by Index_No, in the original data block (Data); S3) establishing an index information table (IIT) according to the index column data (Index_Data) S4) sequentially scanning each line of data among the regrouped data (Grouped_Data) according to the index information table (IIT), determining a position of the line in the original data block according to the index information table (ITT) and the index column data (Index_Data), and writing the same into the corresponding quality line of the original data block (Data); S5) exporting the original data block (Data).
5. The method of gene sequencing quality line data decompression and restoration of claim 4, wherein step S3) comprises the following detailed steps: S3.1) initializing a value of an entry number (k) of the index information table (IIT) to be 0, and including serial numbers, index column information (Index), and variables (num, start and temp) in entries of the index information table (ITT) structurally, wherein the variable (num) is a number of quality lines having the corresponding index column information; the variable (start) indicates initial locations of the quality lines having the index column information after grouping; the variable (temp) is the number of quality lines having the corresponding index column information in data restoration; S3.2) initializing a value of a current line number (i) of the index column data (Index_Data) to be 0; S3.3) sequentially scanning the index column data (Index_Data), and jumping to execute step S3.6) if reaching the end of the index column data (Index_Data); otherwise, taking out a current index column information (Index_Data[i]) corresponding to the current line in the index column data (Index_Data); S3.4) searching all entries in the index information table (ITT), adding 1 to the variable (num) of an entry (j) if the index column information (Index) of the entry (j) is the same as the current index column information (Index_Data[i]), and jumping to execute step S3.3); otherwise, jumping to execute step 3.5); S3.5) establishing a new entry (k) for the index information table (IIT), wherein the index column information (Index) of the entry (k) is equal to the current index column information (Index_Data[i]), and the variable (num) is equal to 1; adding 1 to the entry number (k), and jumping to execute step S3.3); S3.6) initializing a current entry (j) of the index information table (ITT) to be 0; S3.7) sequentially scanning the index information table (IIT), and setting corresponding grouping start position for the current index column information; if reaching the end of the index information table (IIT), jumping to step S4); otherwise, with respect to the entry (j) in the index information table (IIT): if a serial number (j) of the entry (j) is 0, setting the variables (start and temp) to be 0, adding 1 to the serial number (j), and jumping to continue with step S3.7); otherwise, setting the variable (start) of the entry (j) to be the sum of the variables (start) and the variables (num) of the last entry (j1), wherein the variable (temp) of the entry (j) is 0, adding 1 to the serial number (j), and jumping to continue with step S3.7).
6. The method of gene sequencing quality line data decompression and restoration of claim 4, wherein step S4) comprises the following detailed steps: S4.1) initializing a value of a current line number (k) of the regrouped data (Grouped_Data) to be 0; S4.2) obtaining the index column information of the regrouped data (Grouped_Data[k]): if reaching the end of regrouped data (Grouped_Data), jumping to execute step S5); otherwise, scanning the index information table (IIT) to find out an entry (j) of the index information table (IIT) to make it conform to that: a value of a line number (k) is more than or equal to a value of the variable (start) of the entry (j), and less than or equal to the sum of values of the variable (start) of the entry (j) and the variable (num) thereof, wherein the index column information corresponding to data (Grouped_Data[k]) of the current line in the regrouped data (Grouped_Data) is the index column information (Index) of the entry (j); S4.3) combining the data (Grouped_Data[k]) of the current line in the regrouped data (Grouped_Data) and the index column information (Index) of the entry (j) to generate a complete quality line (Temp_Read); S4.4) obtaining an occurrence order (r) in the quality line having the same index column information of the complete quality line (Temp_Read) in the original data block (Data), wherein a value of the occurrence order (r) is a differential value between the current line number (k) and the variable (start) of the entry (j); S4.5) sequentially scanning the index column data (Index_Data) to find out the r.sup.th index column information to be an entry (t) of the index column information (Index) of the entry (j) in the index information table (IIT), so as to determine a line number (t) of the complete quality line (Temp_Read) in the original data block; S4.6) writing the complete quality line (Temp_Read) to the line number (t) of the original data block (Data); S4.7) adding 1 to the current line number (k) of the regrouped data (Grouped_Data); S4.8) judging whether the current line number (k) is more than the maximum line number of the regrouped data (Grouped_Data), and jumping to execute step S4.2) if failing to exceed the maximum line number of the regrouped data (Grouped_Data); otherwise, jumping to execute step S5).
7. A gene sequencing quality line data compression system, comprising a computer system, wherein the computer equipment is programmed to execute the steps of the method of gene sequencing quality line data compression pre-processing of claim 1.
8. A gene sequencing quality line data compression system, comprising a computer system, wherein the computer equipment is programmed to execute the steps of the method of gene sequencing quality line data decompression and restoration of claim 4.
9. A gene sequencing quality line data compression system, comprising a computer system, wherein the computer equipment is programmed to execute the steps of the method of gene sequencing quality line data compression pre-processing of claim 2.
10. A gene sequencing quality line data compression system, comprising a computer system, wherein the computer equipment is programmed to execute the steps of the method of gene sequencing quality line data compression pre-processing of claim 3.
11. A gene sequencing quality line data compression system, comprising a computer system, wherein the computer equipment is programmed to execute the steps of the method of gene sequencing quality line data decompression and restoration of any of claim 5.
12. A gene sequencing quality line data compression system, comprising a computer system, wherein the computer equipment is programmed to execute the steps of the method of gene sequencing quality line data decompression and restoration of any of claim 6.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0067]
[0068]
DESCRIPTION OF THE EMBODIMENTS
[0069] As shown in
[0070] 1) reading an original data block (Data) of the quality line data and determining an index column numbers (Index_No) thereof;
[0071] 2) establishing an index information table (IIT) according to the index columns of an original data block (Data);
[0072] 3) according to the index information table (IIT), regrouping quality lines in the original data block (Data) according to the index column information, and deleting index column portion data to obtain grouped data (Grouped_Data);
[0073] 4) extracting index column data (Index_Data) of the original data block (Data), and exporting the index column numbers (Index_No), index column data (Index_Data) of the original data block (Data) and data (Grouped_Data) regrouped as the compression pre-processing results.
[0074] In this embodiment, a function for the index column number (Index_No) in step 1) is determined as:
[0075] Get_Index_Column(Data)
[0076] by default, the function (Get_Index_Column) is directly returned to the first 5 columns of the quality line data as the index columns, that is, Index_No={0,1,2,3,4}. Besides, other columns or column numbers can be formulated according to the needs.
[0077] In this embodiment, step 2) includes the following detailed steps:
[0078] 2.1) initializing the number of entries of the index information table (IIT) to be 0, and including serial numbers, index column information (Index), and variables (num, start and temp) in entries of the index information table (IIT) structurally, wherein the variable (num) is the number of quality lines having the corresponding index column information; the variable (start) indicates initial locations of the quality lines having the index column information after grouped; the variable (temp) is the number of quality lines having the corresponding index column information in regrouping.
[0079] 2.2) initializing the current quality line number (i) of the original data block (Data) to be 0;
[0080] 2.3) sequentially scanning the current quality line (Data[i]) in the original data block (Data), and jumping to execute step 2.6) if reaching the end of the original data block (Data); otherwise, taking out the index column information (Index) of the current quality line (Data[i]), wherein (Data[i]) refers to the contents of the current quality line (i) in the original data block (Data), namely Index=get_index(Data[i], Index_No); adding 1 to the current quality line number (i);
[0081] 2.4) searching all entries in the index information table (IIT), adding 1 to the variable (num) of the entry (j) (IIT[j].num=IIT[j].num+1) if the index column information of a certain entry (j) of the index information table (IIT) is equal to the index column information (Index) of the current quality line (Data[i]) (IIT[j].Index=Index), and jumping to execute step 2.3); otherwise, skip to execute step 2.5);
[0082] 2.5) establishing a new entry (k) in the index information table (IIT), setting index column information (IIT) ([k].Index) of an entry (k) to be equal to index column information (Index) of the current quality line (Data[i]) (IIT[k].Index=Index), and the variable (num) of the entry (k) to be equal to 1 (IIT[k].num=1), and adding 1 to a serial number (k) (k=k+1); jumping to execute step 2.3);
[0083] 2.6) initializing the current entry (j) of the index information table (IIT) to be 0;
[0084] 2.7) sequentially scanning the entries of the index information table (IIT), setting corresponding grouping start positions for all index column information, and ending this step and jumping to execute step 3) if reaching the end of the index information table (IIT); otherwise, with respect to the entry (j) scanned currently in the index information table (IIT), setting the value of the variable (start) of the entry (j) to be 0 and the value of variable (temp) to be 0 if the serial number of the entry (j) is 0, and adding 1 to (j), namely:
[0085] IIT[j].start=0; IIT[j].temp=0; j=j+1; jumping to continue with step 2.7);
[0086] otherwise setting the value of the variable (start) of the entry (j) to be the sum of the variables (start and num) of the last entry (j1) and the variable (temp) of the entry (j) to be 0, adding 1 to (j), namely:
[0087] IIT[j].start=IIT[j1].start+IIT[j1].num; j=j+1; IIT[j].temp=0;jumping to continue with step 2.7).
[0088] In this embodiment, step 3) includes the following detailed steps:
[0089] 3.1) allocating a space for the regrouped data (Grouped_Data), wherein the number of lines thereof is the same as that of the original data block (Data);
[0090] 3.2) initializing the value of the current quality line number (i) of the original data block (Data) to be 0;
[0091] 3.3) scanning the current quality line of the original data block (Data), wherein the data of the current quality line is Data[i], and (i) is the current quality line number; taking out the index column information (Index) of the current quality line Data[i];
[0092] 3.4) searching the entry (j), the index information of which is the same as (Index), in the index information table (IIT) (namely in conformity with IIT[j].Index=Index);
[0093] 3.5) inserting the quality line data, the index column information of which is deleted, into the regrouped data (Grouped_Data) (Grouped_Data[k]=delete index(Data[i], Index_No)), wherein a value of an insertion position (k) is the sum of the variables (start and temp) of the entry (j) (k=IIT[j].start+IIT[j].temp); adding 1 to the variable (temp) value of the entry (j) (IIT [j].temp=IIT [j].temp+1);
[0094] 3.6) adding 1 to the line number (i) (i=i+1), judging whether the line number (i) is more than the total line number of the original data block (Data), and jumping to execute step 3.3) if the total line number of the original data block (Data) is not exceeded; otherwise, jumping to execute step 4).
[0095] In this embodiment, when the index column data (Index_Data) of the original data block (Data) is extracted in step 4), taking out the index columns of all quality lines from the original data block (Data) in an order from small to large according to the index column numbers (Index_No), so as to obtain the index column data (Index_Data), namely Index_Data=get_index_all(Data, Index_No); and finally, exporting the index column numbers (Index_No), the index column data (Index_Data) of the original data block (Data) and data (Grouped_Data) regrouped as the compression pre-processing results.
[0096] The gene sequencing quality line data compression pre-processing method in this embodiment puts forward a Grouped by Index Columns (GIC) based compression pre-processing method, wherein the basic idea thereof is to extract several columns from an inputted quality line document or data block to act as index columns, and then perform rearrangement on all quality line data, all quality lines having a same index column being one group and being arranged together according to their relative positions in the original data block. Since quality line data having a same index column is usually more similar, the data reorganization means can arrange similar quality line data together in the gene sequencing result, so as to increase local similarity of the data. The compression efficiency of the gene sequencing data can be further improved by performing BWT conversion and subsequent compression for the data subject to the GIC based compression pre-processing method in this embodiment. The present invention does not introduce additional storage overhead, and uses only small computational overhead to implement data rearrangement within large data windows, so as to improve compression efficiency. The gene sequencing quality line data compression pre-processing method in this embodiment is suitable for performing compression pre-processing on quality line data in a gene sequencing result document (FASTQ), wherein the bigger the data block, the more significant the advantage. In this embodiment, the quality line data obtained by gene sequencing is input by the compression pre-processing portion of the gene sequencing quality line data compression pre-processing method. The volume of quality line data composed of many quality lines is high, generally hundreds of MBs every minute. According to the GIC based compression pre-processing method in this embodiment, the quality lines are rearranged based on each quality line information in the index columns to obtain the converted quality line data through the determination for the index columns. The quality line data, converted by the GIC based compression pre-processing method in this embodiment, is subject to the subsequent compression processing. With respect to the gene sequencing quality line data, the local similarity of the data can be improved by the gene sequencing quality line data compression pre-processing method in this embodiment in the large data block range, thereby improving the gene sequencing data compression efficiency.
[0097] The decompression portion provided by the present invention is required to restore the original data block (Data) based on the index column data (Index_Data), the regrouped data (Grouped_Data) and the index column numbers (Index_No). Since the contents of the index column data (Index_Data) are the index column contents in the original data block (Data), it is easy to obtain the index information table according to the index column data (Index_Data). Then, the contents in the regrouped data (Grouped_Data) can be restored to the corresponding lines thereof in the original data block (Data) by the index information table, and then can be combined with the index column data (Index_Data), namely the original data block (Data) is restored. As shown in
[0098] S1) reading decompressed index column data (Index_Data), regrouped data (Grouped_Data) and index column numbers (Index_No), determining the quality line number of the original data block (Data) and character data of each line based on the regrouped data (Grouped_Data) and index column number information (Index_No), and allocating the space for the storage of the original data block (Data);
[0099] S2) according to the index column numbers (Index_No), respectively assigning each column of data among the index column data (Index_Data) to the corresponding column, the number of which belongs to Index_No, in the original data block (Data);
[0100] S3) establishing the index information table (IIT) according to the index column data (Index_Data);
[0101] S4) sequentially scanning each line of data among the regrouped data (Grouped_Data) according to the index information table (IIT), determining the position of the line in the original data block according to the index information table (IIT) and the index column data (Index_Data), and writing the same into the corresponding quality line of the original data block (Data);
[0102] S5) exporting the original data block (Data).
[0103] In this embodiment, step S3) includes the following detailed steps:
[0104] S3.1) initializing the value of the entry number (k) of the index information table (IIT) to be 0, and including serial numbers, index column information (Index), and variables (num, start and temp) in the entries of the index information table (IIT) structurally, wherein the variable (num) is the number of quality lines having the corresponding index column information; the variable (start) indicates the initial locations of the quality lines having the index column information after grouping; the variable (temp) is the number of quality lines having the corresponding index column information in data restoration;
[0105] S3.2) initializing the value of the current line number (i) of the index column data (Index_Data) to be 0;
[0106] S3.3) sequentially scanning the index column data (Index_Data), and jumping to execute step S3.6) if reaching the end of the index column data (Index_Data); otherwise, taking out the current index column information (Index_Data[i]) corresponding to the current line in the index column data (Index_Data);
[0107] S3.4) searching all entries in the index information table (IIT), adding 1 to the variable (num) of the entry (j) (IIT[j].num=IIT[j].num+1) if the index column information (Index) of the entry (j) is the same as the current index column information (Index_Data[i]), and jumping to execute step S3.3); otherwise, jumping to execute step 3.5);
[0108] S3.5) establishing a new entry (k) for the index information table (IIT), wherein the index column information (Index) of the entry (k) is equal to the current index column information (Index_Data[i]) (IIT[k].index=Index_Data[i]), and the variable (num) is equal to 1 (IIT[k].num=1); adding 1 to the entry number (k) (k=k+1), and jumping to execute step S3.3);
[0109] S3.6) initializing the current entry (j) of the index information table (IIT) to be 0;
[0110] S3.7) sequentially scanning the index information table (IIT), and setting the corresponding grouping start position of the current index column information; in case of reaching the end of the index information table (IIT), jumping to step S4); otherwise, with respect to the entry (j) in the index information table (IIT): if the serial number (j) of the entry (j) is 0, setting the variables (start and temp) to be 0, and adding 1 to the serial number (j), namely:
[0111] IIT[j].start=0; IIT[j].temp=0; j=j+1; jumping to continue with step S3.7);
[0112] otherwise setting the value of the variable (start) of the entry (j) to be the sum of the variables (start and num) of the last entry (j1), adding 1 to the serial number (j), and setting the variable (temp) of the entry (j) to be 0, namely:
[0113] IIT[j].start=IIT[j1].start+IIT[j1].num; IIT[j].temp=0; j=j+1; jumping to continue with step S3.7);
[0114] In this embodiment, step S4) includes the following detailed steps:
[0115] S4.1) initializing the value of the current line number (k) of the regrouped data (Grouped_Data) to be 0;
[0116] S4.2) obtaining the index column information of the regrouped data (Grouped_Data[k]): if reaching the end of regrouped data (Grouped_Data), jumping to execute step S5); otherwise, scanning the index information table (IIT) to find out the entry (j) of the index information table (IIT) to make it conform to that: the value of the line number (k) is more than or equal to the value of the variable (start) of the entry (j), and less than or equal to the sum of the values of the variable (start) of the entry (j) and the variable (num) thereof (IIT[j].startkIIT[j].start+IIT[j].num), wherein the index column information corresponding to the data (Grouped_Data[k]) of the current line in the regrouped data (Grouped_Data) is the index column information (Index) (IIT[j].index) of the entry (j);
[0117] S4.3) combining the data (Grouped_Data[k]) of the current line in the regrouped data (Grouped_Data) and the index column information (Index) of the entry (j) (IIT[j].index) to generate a complete quality line (Temp_Read);
[0118] S4.4) obtaining an occurrence order (r) in the quality line having the same index column information of the complete quality line (Temp_Read) in the original data block (Data), wherein the value of the occurrence order (r) is a differential value between the current line number (k) and the variable (start) of the entry (j) (namely: r=k-IIT[j].start);
[0119] S4.5) sequentially scanning the index column data (Index_Data) to find out the r.sup.th index column information to be an entry (t) of the index column information (Index) of the entry (j) in the index information table (IIT) (IIT[j].index), so as to determine the line number (t) of the complete quality line (Temp_Read) in the original data block;
[0120] S4.6) writing the complete quality line (Temp_Read) to the line number (t) of the original data block (Data) (Data[t]=Temp_Read);
[0121] S4.7) adding 1 to the current line number (k) of the regrouped data (Grouped_Data) (k=k+1);
[0122] S4.8) judging whether the current line number (k) is more than the maximum line number of the regrouped data (Grouped_Data), and jumping to execute step S4.2) if failing to exceed the maximum line number of the regrouped data (Grouped_Data); otherwise, jumping to execute step S5).
[0123] This embodiment further provides a gene sequencing quality line data compression system, including a computer system, wherein the computer equipment is programmed to execute the steps of the aforesaid gene sequencing quality line data compression pre-processing method in this embodiment.
[0124] This embodiment further provides a gene sequencing quality line data compression system, including a computer system, wherein the computer equipment is programmed to execute the steps of the aforesaid gene sequencing quality line data decompression and restoration method in this embodiment.
[0125] The above are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited to the embodiments mentioned above. The technical solutions under the ideas of the present invention fall into the protection scope of the present invention. It should be pointed out that, for those of ordinary skill in the art, some improvements and modifications without departing from the principle of the present invention shall be deemed as the protection scope of the present invention.