H03M7/3091

Scalable binning for big data deduplication
11204707 · 2021-12-21 · ·

Fast record deduplication is accomplished by providing as an input, data records having multiple attributes, and local similarity functions of individual attributes with local similarity thresholds. Bin IDs are then generated based on the local similarity functions and the local similarity thresholds. The Bin IDs are unique identifiers of a respective bin of records, and the bin of records is a set of records that are possibly pairwise similar. Local candidate pairs are identified based on data records that share Bin IDs. The local candidate pairs are aggregated to produce a set of global candidate pairs. The set of global candidate pairs are filtered by deciding whether a pair of data records represents a duplicate.

Additional compression for existing compressed data
11368167 · 2022-06-21 · ·

Techniques are provided for implementing additional compression for existing compressed data. Format information stored within a data block is evaluated to determine whether the data block is compressed or uncompressed. In response to the data block being compressed according to a first compression format, the data block is decompressed using the format information. The data block is compressed with one or more other data blocks to create compressed data having a second compression format different than the first compression format.

CONTENT-ADAPTIVE TILING SOLUTION VIA IMAGE SIMILARITY FOR EFFICIENT IMAGE COMPRESSION
20220189069 · 2022-06-16 · ·

Techniques are provided herein for more efficiently storing images that have a common subject, such as product images that share the same product in the image. Each image undergoes an adaptive tiling procedure to split the image into a plurality of tiles, with each tile identifying a region of the image having pixels with the same content. The tiles across multiple images can then be clustered together and those tiles having identical content are removed. Once all duplicate tiles have been removed from the set of all tiles across the images, the tiles are once again clustered based on their encoding scheme and certain encoding parameters. Tiles within each cluster are compressed using the best compression technique for the tiles in each corresponding cluster. By removing duplicative tile content between numerous images of the same subject, the total amount of data that needs to be stored is reduced.

Lossless reduction of data by using a prime data sieve and performing multidimensional search and content-associative retrieval on data that has been losslessly reduced using a prime data sieve
11363296 · 2022-06-14 · ·

Input data can be losslessly reduced by using a data structure that organizes prime data elements based on their contents. Alternatively, the data structure can organize prime data elements based on the contents of a name that is derived from the prime data elements. Specifically, video data can be losslessly reduced by (1) using the data structure to identify a set of prime data elements, and (2) using the set of prime data elements to losslessly reduce intra-frames. The input data can be dynamically partitioned based on the memory usage of components of the data structure. Parcels can be created based on the partitions to facilitate archiving and movement of the data. The losslessly reduced data can be stored using a set of distilled files and a set of prime data element files.

System and method for hash-based entropy calculation

A method, computer program product, and computing system for receiving a candidate data portion; calculating a distance-preserving hash for the candidate data portion; and performing an entropy analysis on the distance-preserving hash to generate a hash entropy for the candidate data portion.

CLIENT-SIDE COMPRESSION
20220171555 · 2022-06-02 ·

A method of sending blocks of data from a client to be stored at a storage server, wherein for each block compression and encryption is performed at the client, and deduplication is performed at the server. Security is thus enhanced as the block is compressed and encrypted when it is sent over an unsecured network and when it is stored in potentially a third-party backup system. Provisions are made to enable addition of new compression algorithms and for retirement of old compression algorithms, while ensuring that a client would not receive a block which was compressed using an unsupported, e.g., retired, compression algorithm. In some examples a compression algorithm ID is tied to an encryption key version to enable refresh of blocks compressed with old algorithm

Data transmission method and device
11349962 · 2022-05-31 · ·

Provided are a data transmission method and device. The method includes: processing a first data packet to be sent by using a compression strategy obtained in advance from a receiving end, deleting specified duplicated data comprised in the compression strategy in the first data packet; generating a second data packet to be sent from the processed first data packet, where the second data packet includes a modification record field for indicating the deleted duplicated data; and sending the second data packet to the receiving end.

Systems and methods for version chain clustering

A system, a method and a computer program product for storing data, which include receiving a data stream having a plurality of transactions that include at least one portion of data, determining whether at least one portion of data within at least one transaction is substantially similar to at least another portion of data within at least one transaction, clustering together at least one portion of data and at least another portion of data within at least one transaction, selecting one of at least one portion of data and at least another portion of data as a representative of at least one portion of data and at least another portion of data in the received data stream, and storing each representative of a portion of data from each transaction in the plurality of transactions, wherein a plurality of representatives is configured to form a chain representing the received data stream.

METHOD AND APPARATUS FOR COMPRESSING DATA OF STORAGE SYSTEM, DEVICE, AND READABLE STORAGE MEDIUM

In a method of storing data block, a storage device has stored a plurality of data block groups, each data block group having a common part that is contained in another data block in that group. For a target block to be stored, the storage device selects from the data block groups a target data block group has one data block whose common part is identical to a part of the target data block. The storage device then saves the target block by storing a target reference block of the target data block group and differential data between the target data block and the target reference block.

DATA STORAGE ARRANGEMENT AND METHOD FOR ANONYMIZATION AWARE DIFFERENTIAL COMPRESSION
20230259655 · 2023-08-17 ·

An example data storage device includes a memory and a controller. The controller is configured to store at least one of the one or more data elements utilizing differential compression. The controller is further configured to receive a data element to be stored, generate a copy of the data element to be stored, and mask data to be anonymized by deleting one or more portions to be anonymized. The controller is further configured to generate similarity hashes for one or more portions of the copy of the data element with masked data for finding one or more reference portions, and compress the data element to be stored utilizing differential compression with reference to the one or more reference portions.