Block Storage Device and Method for Data Compression

20230153005 · 2023-05-18

    Inventors

    Cpc classification

    International classification

    Abstract

    A block storage device, for data compression is configured to, in a first operating phase, if it is determined, by the block storage device, that a data block is to be written to a large block storage area of the block storage device, determine if the data block can be de-duplicated. If the data block cannot be de-duplicated, the block storage device stores the data block using large block compression.

    Claims

    1. A block storage device for data compression and comprising: an interface; and a processor coupled to the interface and configured to: determine whether a data block can be de-duplicated in response to determining that the data block is to be written to a large block storage area of the block storage device, and store the data block using large block compression when the data block cannot be de-duplicated.

    2. The block storage device of claim 1, wherein the large block storage area comprises data blocks that have an average size larger than a predefined threshold and that are written to or read from the large block storage area.

    3. The block storage device of claim 2, wherein the processor is further configured to determine the average size based on a read statistic or a write statistic.

    4. The block storage device of claim 1, wherein an operation of determining whether the data block can be de-duplicated is an in-line phase.

    5. The block storage device of claim 1, wherein the processor is further configured to de-duplicate and compress the data block to obtain a resulting data block.

    6. The block storage device of claim 5, wherein the processor is further configured to compare a first size resulting from de-duplication and compression of the data block with a second size resulting from large block compression of the data block.

    7. The block storage device of claim 6, wherein the processor is further configured to store the data block using the large block compression in response to the first size being larger than or equal to the second size.

    8. The block storage device of claim 7, wherein the processor is further configured to: de-duplicate a first part of sub blocks of the data block and compress a second part of the sub blocks to obtain the resulting data block; and store the resulting data block to de-duplicate and compress the data block.

    9. The block storage device of claim 8, wherein the processor is further configured to write, for each of the sub blocks, a similarity hash to a similarity hash table.

    10. The block storage device of claim 1, wherein the processor is further configured to determine, based on a similarity hash, whether the data block stored in the block storage device can be further reduced in size using similarity de-duplication.

    11. The block storage device of claim 10, wherein the processor is further configured to: determine a first space required for storing the data block using the similarity de-duplication; and, determine a second space required for storing the data block in its present format.

    12. The block storage device of claim 11, wherein the processor is further configured to store the data block in the large block storage area using the similarity de-duplication when the first space is less than the second space.

    13. The block storage device of claim 10, wherein an operation of determining whether the data block stored in the block storage device can be further reduced in size using the similarity de-duplication is an offline phase.

    14. A method implemented by a block storage device, wherein the method comprises: determining whether a data block can be de-duplicated in response to determining that the data block is to be written to a large block storage area of the block storage device; and storing the data block using large block compression when the data block cannot be de-duplicated.

    15. The method of claim 14, wherein the large block storage area comprises data blocks having an average size larger than a predefined threshold and are written to or read from the large block storage area.

    16. The method of claim 15, further comprising determining the average size based on a read statistic and a write statistic.

    17. The method of claim 14, wherein an operation of determining whether the data block can be de-duplicated is an in-line phase.

    18. The method of claim 14, further comprising de-duplicating and compressing the data block to obtain a resulting data block.

    19. The method of claim 18, further comprising comparing a first size resulting from de-duplication and compression of the data block with a second size resulting from the large block compression of the data block.

    20. The method of claim 19, further comprising storing the data block using the large block compression in response to the first size is larger than or equal to the second size.

    Description

    BRIEF DESCRIPTION OF DRAWINGS

    [0072] The above-described aspects and implementation forms of the present disclosure will be explained in the following description of embodiments in relation to the enclosed drawings, in which:

    [0073] FIG. 1 shows a schematic view of a device according to an embodiment of the present disclosure;

    [0074] FIG. 2 shows a schematic view of a device according to an embodiment of the present disclosure in more detail;

    [0075] FIG. 3 shows a schematic view of an operating manner according to the present disclosure;

    [0076] FIG. 4 shows a schematic view of an operating manner according to the present disclosure;

    [0077] FIG. 5 shows another schematic view of a method according to an embodiment of the present disclosure.

    DETAILED DESCRIPTION OF EMBODIMENTS

    [0078] FIG. 1 shows a schematic view of a block storage device 100 according to an embodiment of the present disclosure. The block storage device 100 is for data compression and thus is configured to, if it is determined, by the block storage device 100, that a data block 101 is to be written to a large block storage area 102 of the block storage device 100, determine if the data block 101 can be de-duplicated.

    [0079] Further, the block storage device 100 is configured to, if the data block 101 cannot be de-duplicated, store the data block 101 using large block compression. The data block 101 is in particular stored in the large block storage area 102.

    [0080] All these steps are performed during a first operating phase. The first operating phase in particular can be an in-line phase.

    [0081] Optionally, the large block storage area 102 can be an area where data blocks 101 having an average size larger than a predefined threshold are written to, and/or are read from. The large block storage area in particular may be part of a physical disc, a physical volume, a virtual disc, or a virtual volume. The average size of the to be written or to be read data block can optionally be based on a read statistic and/or a write statistic.

    [0082] FIG. 2 shows a schematic view of a block storage device 100 according to an embodiment of the present disclosure in more detail. The device 100 shown in FIG. 2 comprises all features and functionality of the device 100 of FIG. 1, as well as the following optional features.

    [0083] As it is illustrated in FIG. 2, the block storage device 100 optionally can be configured to, if the data block 101 can be de-duplicated, de-duplicate and compress the data block 101 and store the resulting data block 201. That is, instead of large block compression, de-duplication and compression of the data block 101 is performed. This kind of compression is different from large block compression in particular in that it is applied to data blocks which are smaller than the large blocks.

    [0084] Further optionally, if the data block 101 can be de-duplicated, the block storage device 100 can compare a size resulting from de-duplication and compression of the data block 101 with a size resulting from large block compression of the data block 101. This evaluation assists in determining the more effective way of data reduction. If the size resulting from de-duplication and compression is larger than or equal to the size resulting from large block compression, the block storage device 100 can store the data block 101 using large block compression.

    [0085] As it is further illustrated in FIG. 2, the block storage device 100 can de-duplicate a first part of sub blocks 202 of the data block 101. The block storage device 100 also can compress a second part of sub blocks 203 of the data block 101. The data block 201 which results from de-duplication and compression can then be stored by the block storage device 100. This can be done in the large block storage area 102, or in a conventional storage area of the block storage device 100. For example, if the data block 101 has a size of 1 MB, the first part of sub blocks 202 may comprise two sub blocks with a size of 256 KB each, and the second part of sub block 203 may comprise four sub blocks with a size of 128 KB each. However, any kind of distribution of sub block which follows this principle is possible.

    [0086] In an embodiment, two levels of block sizes are used. In this embodiment, the data block 101 has a size of 512 KB, and a sub block has a size of 8 KB. Thus, a 512 KB data block 101 can either be compressed as a single block or as 64 sub blocks of 8 KB size, wherein each 8 KB sub block can either de-duplicated or compressed alone.

    [0087] As it is further illustrated in FIG. 2, to prepare for similarity deduplication (which can be performed in a second operating phase of the block storage device 100), for each sub block 202, 203 of the data block 101, the block storage device 100 can write a similarity hash 204 to a similarity hash table 205. This step is in particular performed during the first operating phase of the block storage device 100.

    [0088] Further optionally, in a second operating phase, the block storage device 100 can determine, based on a similarity hash 204, if a data block 101 stored in the first operating phase can be further reduced in size using similarity de-duplication. This similarity hash 204 can be the similarity hash stored in the similarity hash table 205 during the first operating phase.

    [0089] The block storage device 100 optionally can determine space which is required for storing the data block 101 using similarity de-duplication. The block storage device 100 optionally can also determine space which is required for storing the data block 101 in its present format. This enables to only store the data block 101 in the data storage 102 using similarity de-duplication, if the space required for storing the data block 101 using similarity de-duplication is less than the space required for storing the data block 101 in its present format.

    [0090] In particular, the second operating phase can be an offline phase.

    [0091] FIG. 3 describes an operating scenario which is performed during the first operating phase in more detail.

    [0092] In step 301, the block storage device 100 checks, if a data block 101 can be de-duplicated (based on a hash fingerprint, it can be determined that an identical data block is already written to a disk), and if so, performs deduplication (see step 302). In any case, in step 301, the block storage device 100 generates a min-hash (i.e. the similarity hash 204) which is to be added to an opportunity table (i.e. the similarity hash table 205) to enable the possibility of similarity de-duplication later in a second operating phase.

    [0093] If the data block 101 cannot be de-duplicated, the block storage device 100 checks (in step 303) a prediction table to determine, if an available range of physical addresses is usually used for writing and reading small blocks of data or large blocks of data (in other words, it is determined if the data block 101 is to be written to a large block storage area 102). If the current range is usually used for writing and reading small blocks of data, the system proceeds as usual (i.e. it stores small blocks of data), as it is illustrated in step 304.

    [0094] If it is identified that the data block 101 is to be written to an area where most reads and write are done in large blocks, then the block storage device 100 analyses (in step 305), if large-block compression would yield to better compression than small-block compression. If large block compression would not lead to better compression, the block storage device proceeds as usual (i.e. it stores the compressed data block 101), see step 306. If large block compression would lead to better compression, then the block storage device 100 stores the compressed data block 101 set using large block compression (see step 307). This can be implemented by one of the following methods. Large block compression can be used to tag large block storage areas 102 as containing identical data. The block storage device 100 can write a same grain (physical) address at a same offset in each relevant large block storage area 102, or the block storage device 100 can set a bit indicating the start and end of a range of identical large block storage areas 102.

    [0095] In addition, if the data block 101 can be de-duplicated, the block storage device 100 may also determine if large block compression of the data block 101 leads to a better result, or if a combination of de-duplicating and compressing the data block 101 leads to a better result, and choose the better option.

    [0096] Thus, at the end of the first operating phase, the block storage device 100 generated and saved a min-hash and will have either have de-duplicated the data block 101 and/or have stored the data block 101 compressed in small blocks, or will have stored the data block 101 using large block compression.

    [0097] FIG. 4 describes an operating scenario which is performed during the first operating phase in more detail.

    [0098] As it was explained above, the inline processing (i.e. the first operating phase) prioritized reducing CPU usage over minimizing disk usage. Then, during background processing (i.e. the second operating phase), when CPU usage is less critical, the block storage device 100 attempts to further optimize disk usage as described in the following. Step 401 of FIG. 4 is in particular performed after step 307 of FIG. 3.

    [0099] As shown in step 402, in the second operating phase, for data blocks 401 compressed in large blocks of data, the block storage device 100 checks an opportunity table (i.e. the similarity hash table 205) to see, if there is an option for similarity de-duplication (i.e. if there is another data block 101, with the same min-hash).

    [0100] If similarity de-duplication is not possible, the block storage device 100 does nothing, and the data block 401 remains large-block compressed, see step 403.

    [0101] If similarity de-duplication is possible, then the block storage device 100 analyzes if similarity de-duplication would lead to a better result than large-block compression, see step 404.

    [0102] If similarity de-duplication would lead to lower data reduction than large-block compression, the system does nothing and the data block 101 remains large-block compressed, see step 405.

    [0103] If similarity de-duplication would lead to better data reduction than large-block compression, then the system performs similarity de-duplication and the data block 101 is stored in 8000 (8 k) data blocks with similarity de-duplication pointers where relevant.

    [0104] FIG. 5 shows a schematic view of a method 500 according to an embodiment of the present disclosure. The method 500 is for data compression. In a first operating phase, the method comprises a step of, if it is determined, by a block storage device 100, that a data block 101 is to be written to a large block storage area 102 of the block storage device 100, determining 501, by the block storage device 100, if the data block 101 can be de-duplicated. The method comprises (still in the first operating phase) a further step of, if the data block 101 cannot be de-duplicated, storing 502, by the block storage device 100, the data block 100 using large block compression.

    [0105] The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure, and the independent claims. In the claims as well as in the description, the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.