Data compression method and apparatus, and computer readable storage medium
11398832 · 2022-07-26
Assignee
Inventors
Cpc classification
H03M7/30
ELECTRICITY
H03M7/70
ELECTRICITY
International classification
Abstract
A data compression method, comprising: obtaining a plurality of values of a parameter and an occurrence probability of each of the plurality of values (S101) comparing the occurrence probability with a predetermined threshold, wherein values with the occurrence probability less than the predetermined threshold are first set of values, and values with the occurrence probability greater than or equal to the predetermined threshold are second set of values (S102), performing pretreatment on the first set of values (S103), and encoding the second set of values and the pretreated first set of values (S104). By means of the data compression method, the maximum codeword length can be effectively reduced, so as to reduce the requirements of a code table to the storage space.
Claims
1. A data compression method, comprising: obtaining a plurality of values of a parameter and respective occurrence probabilities of the plurality of values; comparing the respective occurrence probabilities with a predetermined threshold, to divide the plurality of values into a first set of values with the respective occurrence probabilities less than the predetermined threshold and a second set of values with the respective occurrence probabilities greater than or equal to the predetermined threshold; preprocessing the first set of values; and encoding the second set of values and the preprocessed first set of values, wherein the preprocessing the first set of values comprises obtaining respective original encoded codewords of the first set of values, and the encoding the second set of values and the preprocessed first set of values comprises: performing variable length encoding on the second set of values, and obtaining respective encoded codewords of the second set of values; and padding with one of 0 and 1 the end of the shortest encoded codeword from the encoded codewords obtained by the variable length encoding to obtain a padded codeword of a value corresponding to the shortest encoded codeword, and padding with the other of 0 and 1 the end of the shortest encoded codeword from the encoded codewords obtained by the variable length encoding to obtain a prefix, and combining the prefix with the obtained respective original encoded codewords of the first set of values to obtain respective combined codewords of the first set of values.
2. The data compression method according to claim 1, further comprising: forming a code table, which at least comprises: each value in the first set of values, and their respective combined codewords; each value in the second set of values, and their respective encoded codewords or padded codewords.
3. The data compression method according to claim 2, further comprising: when decoding an input bitstream according to the code table, if the prefix is obtained from the decoding, taking and outputting preset N bits following the prefix in the input bitstream as an original encoded codeword.
4. The data compression method according to claim 1, wherein the variable length encoding is Shannon encoding, Feno encoding or Huffman encoding.
5. A data compression device, comprising: an obtaining unit, a comparing unit, a preprocessing unit and an encoding unit, wherein the obtaining unit is connected with the comparing unit, the comparing unit is connected with the preprocessing unit and the encoding unit, and the preprocessing unit is connected with the encoding unit, wherein: the obtaining unit is configured to obtain a plurality of values of a parameter and respective occurrence probabilities of the plurality of values; the comparing unit is configured to compare the respective occurrence probabilities obtained by the obtaining unit with a predetermined threshold, and divide the plurality of values obtained from the obtaining unit into a first set of values with the respective occurrence probabilities less than the predetermined threshold, and a second set of values with the respective occurrence probabilities greater than or equal to the predetermined threshold; the preprocessing unit is configured to preprocess the first set of values obtained from the comparing unit; and the encoding unit is configured to encode the second set of values obtained from the comparing unit and the first set of values preprocessed by the preprocessing unit, wherein the preprocessing comprises obtaining respective original encoded codewords of the first set of values, and the encoding performed by the encoding unit comprises: performing variable length encoding on the second set of values, and obtaining respective encoded codewords of the second set of values; and padding with one of 0 and 1 the end of the shortest encoded codeword from the encoded codewords obtained by the variable length encoding to obtain a padded codeword of a value corresponding to the shortest encoded codeword, padding with the other of 0 and 1 the end of the shortest encoded codeword from the encoded codewords obtained by the variable length encoding to obtain a prefix, and combining the prefix with the obtained respective original encoded codewords of the first set of values to obtain respective combined codewords of the first set of values.
6. The data compression device according to claim 5, wherein: the encoding unit is further configured to form a code table, the code table at least comprising: each value in the first set of values and their respective combined codewords; and each value in the second set of values and their respective encoded codewords or padded codewords.
7. The data compression device according to claim 5, wherein the variable length encoding is Shannon encoding, Feno encoding or Huffman encoding.
8. A non-transitory computer-readable storage medium, having a computer instruction stored therein, wherein, when executed by a processor, the computer instruction implements the data compression method according to claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) By reading the detailed description of the exemplary embodiments below, those of ordinary skill in the art will understand the advantages and benefits described herein and other advantages and benefits. The drawings are only for the purpose of illustrating exemplary embodiments, and shall not be considered as any limitation to the present disclosure. Moreover, the same reference signs are used to denote the same components throughout the description.
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF THE EMBODIMENTS
(8) Exemplary embodiments of the disclosure will be described below in more detail in conjunction with the drawings. Although exemplary embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be implemented in various forms, rather than being limited to the embodiments illustrated here. On the contrary, these embodiments are provided for more thoroughly understanding the disclosure and fully conveying the scope of the disclosure to one skilled in the art.
(9) In the disclosure, it should be understood that, terms such as “include” or “have” intends to indicate the existence of a characteristic, a number, a step, an action, a component, a part disclosed in the specification or any combination thereof, without excluding the possibility of the existence of one or more other characteristics, numbers, steps, actions, components, parts or any combination thereof.
(10) Additionally, it should be noted that, in the case of no conflict, the embodiments in the disclosure and the characteristics in the embodiments may be combined with each other. The disclosure will be illustrated in detail below by referring to the drawings and in conjunction with the embodiments.
(11) In
(12) S101: a plurality of values of a parameter and respective occurrence probabilities of the plurality of values are obtained;
(13) S102: the respective occurrence probabilities are compared with a predetermined threshold, wherein a part from the plurality of values with the respective occurrence probabilities less than the predetermined threshold are taken as a first set of values, and the other from the plurality of values with the respective occurrence probabilities greater than or equal to the predetermined threshold are taken as a second set of values;
(14) S103: the first set of values is preprocessed; and
(15) S104: the second set of values and the preprocessed first set of values are encoded.
(16) Here, the type of the parameter is not specifically defined, and it may be various parameters, for example, various parameters in an artificial neural network (for example, parameters concerned in convolution calculation), or parameters in other mathematical models or algorithms; the level of efficiency achieved by the data compression method according to the embodiments of the disclosure is positively correlated with the differences between the probability distributions of the values of respective parameters.
(17) In this application, illustration is given by taking parameters in an artificial neural network as an example, but it is not limited to the parameters in the artificial neural network. In the embodiments of the disclosure, the parameter values are not processed in a uniform mode as in conventional way; instead, the parameter values are grouped, and values in different groups are processed differently, so that the maximum codeword length may be controlled and shortened, thereby reducing the storage space required for the code table.
(18) It may be understood by one skilled in the art that, the larger the predetermined threshold is, the more the number of values in the first set of values will be, and the more apparent the reduction effect on the maximum codeword length will be, but the more the overall encoding efficiency may be affected; on the contrary, the smaller the predetermined threshold is, the less the number of values in the first set of values will be, and the smaller the reduction effect on the maximum codeword length, but the less the overall encoding efficiency may be affected.
(19) Additionally, the purpose of preprocessing and then encoding the first set of values is to reduce the encoded codeword length of this set of values.
(20) The embodiments of the disclosure will be further described below in conjunction with
(21) As shown in
(22) In the case that the data compression method according to the embodiments of the disclosure is employed, as shown by (b) and (c) in
(23) The code table formed in the data compression method in this embodiment is as shown in Table 1 below. The code table may at least include: each value in the first set of values, and their respective combined codewords; each value in the second set of values, and their respective encoded codewords or padded codewords.
(24) TABLE-US-00001 TABLE 1 Encoded Values before Codeword codewords being encoded lengths 10 a.sub.1 2 01 a.sub.2 2 001 a.sub.3 3 0001 a.sub.4 4 0000 a.sub.5 4 11101 a.sub.6 5 11110 a.sub.7 5 11111 a.sub.8 5
(25) However, in the above mode for obtaining the prefix, the end of the shortest encoded codeword 1 in the obtained encoded codewords of the second set of values a.sub.1, a.sub.2, a.sub.3, a.sub.4, a.sub.5 may be padded with 1 to obtain the padded codeword 11 of the value a.sub.1 corresponding to the shortest encoded codeword, and the end of the shortest encoded codeword 1 may be padded with 0 to obtain the prefix 10, and the prefix 10 is combined with the original encoded codewords 101, 110, 111 of the first set of values a.sub.6, a.sub.7, a.sub.8 to obtain 10101, 10110, 10111 respectively to obtain the respective combined codewords the first set of values a.sub.6, a.sub.7, a.sub.8.
(26) When an input bitstream is decoded according to the obtained code table, if the prefix is obtained from the decoding, preset N bits following the prefix in the input bitstream are taken as an original encoded codeword for outputting, wherein N is the length of the original encoded codeword. The original encoded codeword length N is preferably a natural number greater than 2, and the larger N is, the more apparent the advantages of data compression method of the disclosure will be.
(27) In another example, as shown in
(28) In the case that the data compression method according to the embodiments of the disclosure is employed, as shown by (b) and (c) in
(29) The code table formed in the data compression method in this embodiment is as shown in Table 2. The code table may at least include: each value in the first set of values, and their respective combined codewords; each value in the second set of values, and their respective encoded codewords or padded codewords.
(30) TABLE-US-00002 TABLE 2 Encoded Values before Codeword codewords being encoded lengths 10 a.sub.1 2 01 a.sub.2 2 001 a.sub.3 3 0001 a.sub.4 4 00001 a.sub.5 5 00000 a.sub.6 5 110110 a.sub.7 6 110111 a.sub.8 6 111000 a.sub.9 6 111001 a.sub.10 6 111010 a.sub.11 6 111011 a.sub.12 6 111100 a.sub.13 6 111101 a.sub.14 6 111110 a.sub.15 6 111111 a.sub.16 6
(31)
(32) As shown in the example of
(33) In the case that the data compression method according to the embodiments of the disclosure is employed, as shown by (b) in
(34) Next, the second set of values a.sub.5, a.sub.6, a.sub.7, a.sub.8, and the first set of values a.sub.1, a.sub.2, a.sub.3, a.sub.4 with reset occurrence probabilities are encoded together, for example, by variable length encoding, as shown by (b) in
(35) TABLE-US-00003 TABLE 3 Encoded Values before Codeword codewords being encoded lengths 1 a.sub.1 1 01 a.sub.2 2 001 a.sub.3 3 0001 a.sub.4 4 000011 a.sub.5 6 000010 a.sub.6 6 000001 a.sub.7 6 000000 a.sub.8 6
(36) Thus, the length of the longest encoded codeword finally obtained in this example is 6, which is less than the length 7 of the longest encoded codeword corresponding to conventional encoding.
(37) Here, the variable length encoding may be Shannon encoding, Feno encoding or Huffman encoding, or other encoding schemes.
(38) A data compression device for implementing the above data compression method will be described below in conjunction with
(39) Wherein, the obtaining unit 501 is configured to obtain a plurality of values of a parameter and respective occurrence probabilities of the plurality of values;
(40) the comparing unit 502 is configured to compare the respective occurrence probabilities obtained by the obtaining unit 501 with a predetermined threshold, and divide the plurality of values obtained from the obtaining unit 502 into a first set of values with the respective occurrence probabilities less than the predetermined threshold and a second set of values with the respective occurrence probabilities greater than or equal to the predetermined threshold;
(41) the preprocessing unit 503 is configured to preprocess the first set of values obtained from the comparing unit 502; and
(42) the encoding unit 504 is configured to encode the second set of values obtained from the comparing unit 502 and the first set of values preprocessed by the preprocessing unit 503.
(43) Here, the predetermined threshold may be specifically determined by one skilled in the art according to practical application scenarios. In general, the larger the predetermined threshold is, the larger the number of the first set of values will be, and the more apparent the reduction effect on the maximum codeword length will be, but the more the overall encoding efficiency may be affected; on the contrary, the smaller the predetermined threshold is, the smaller the number of the first set of values will be, and the smaller the reduction effect on the maximum codeword length, but the less the overall encoding efficiency may be affected.
(44) Additionally, the purpose of preprocessing and then encoding the first set of values is to reduce the encoded codeword length of this set of values.
(45) In a specific example, the preprocessing may include obtaining respective original encoded codewords of the first set of values.
(46) The encoding by the encoding unit 504 may include:
(47) performing variable length encoding on the second set of values to obtain respective encoded codewords of the second set of values; and padding with one of 0 and 1 the end of the shortest encoded codeword from the encoded codewords obtained by variable length encoding to obtain a padded codeword of the value corresponding to the shortest encoded codeword, and padding with the other of 0 and 1 the end of the shortest encoded codeword from the encoded codewords obtained by the variable length encoding to obtain a prefix, and combining the prefix with the obtained original encoded codeword of each value in the first set of values to obtain a combined codeword of each value in the first set of values.
(48) Additionally, the encoding unit 504 may be further configured to form a code table, wherein the code table at least includes: each value in the first set of values, and their respective combined codewords; and each value in the second set of values, and their respective encoded codewords or padded codewords.
(49) In another specific example, preprocessing the first set of values by the preprocessing unit 503 includes:
(50) setting the respective occurrence probabilities of in the first set of values, so that: the sum of the respective occurrence probabilities of the first set of values is no greater than the second minimum value in the second set of values, and the first set of values can form a balanced binary tree.
(51) Or, the preprocessing unit 503 may be further configured to:
(52) calculate an average value of the respective occurrence probabilities of the first set of values, and set the respective occurrence probabilities of the first set of values as equal to each other and less than or equal to the average value.
(53) The encoding unit 504 may be further configured to: perform variable length encoding on the second set of values and the preprocessed first set of values together.
(54) Variable length encoding may be Shannon encoding, Feno encoding or Huffman encoding, or other encoding mode.
(55) According to another embodiment of the disclosure, there further provides a computer-readable storage medium. As shown in
(56) By the above solution, a part of the codewords with the minimum probability before being encoded according to the disclosure are preprocessed so as to achieve the effect of reducing the maximum encoded codeword length. Because the occurrence probabilities of a part of the values of the parameters of an artificial neural network is much less than the remaining part of values, the processing on this part of values affects the compression efficiency less, while significantly reducing the complexity and implementation cost of decoding, especially the cost of the storage space.
(57) The flowcharts and block diagrams in the drawings show some implementable architectures, functions and operations of the method, device and computer-readable storage medium according to various embodiments of the disclosure. It should be noted that, the steps represented by each block in the flowchart are not necessarily carried out in the order shown by the labels, and the steps sometimes may be carried out substantially in parallel, and sometimes may be carried out in a reverse order, which is determined by the function concerned. It should be further noted that, each block in the block diagrams and/or the flowcharts and a combination of the blocks in the block diagrams and/or the flowcharts may be implemented by hardware for executing a specified function or operation, or may be implemented by a combination of hardware and a computer instruction.
(58) The units or modules concerned in the embodiments described in the disclosure may be implemented in software or in hardware.
(59) From the above description of the embodiments, one skilled in the art may clearly understand that each embodiment may be implemented with the aid of software plus a necessary general hardware platform, and of course, each embodiment may be implemented via hardware. Based on such an understanding, the essential part of the technical solutions in the embodiments of the disclosure, or in other words, the part that contributes to the prior art, may be embodied in the form of a software product that is stored in a computer-readable storage medium, for example, ROM/RAM, magnetic disc or compact disc, etc., and includes several instructions that can make a computer device (which may be a personal computer, a server or a network device, etc.) implement the method according to each embodiment or some parts of the embodiment.
(60) Finally, it should be noted that, the above embodiments are merely provided for illustrating, rather than limiting, the technical solutions of the disclosure. Although the disclosure has been illustrated in detail referring to the above embodiments, it should be understood by one of ordinary skills in the art that, the technical solutions recorded in each of the above embodiments may still be modified, or a part of the technical features may be equivalently substituted; however, these modifications or substitutions will not make the corresponding technical solutions depart from the spirit and scope of the technical solutions in each embodiment of the disclosure.