G06F16/1752

STORAGE OF A SMALL OBJECT REPRESENTATION IN A DEDUPLICATION SYSTEM
20220342902 · 2022-10-27 ·

Examples may include storage of a small object representation in a deduplication system. Examples may store the small object representation of an object in the deduplication system based on a determination that the object is smaller than a threshold size. In examples, the small object representation may include a direct reference from a top-level data structure to small object metadata in a bottom-level data structure of the small object representation.

System and method for random-access manipulation of compacted data files

A system and method for random-access manipulation of compacted data files, utilizing a reference codebook, a random-access engine, a data deconstruction engine, and a data deconstruction engine. The system may receive a data query pertaining to a data read or data write request, wherein the data file to be read from or written to is a compacted data file. A random-access engine may facilitate data manipulation processes by accessing a reference codebook associated with the compacted data file, a frequency table used to construct the reference codebook, and data query details. A data read request is supported by random-access search capabilities that may enable the locating and decoding of the bits corresponding to data query details. A random-access engine facilitates data write processes. The random-access engine may encode the data to be written, insert the encoded data into a compacted data file, and update the codebook as needed.

System and method for computer data type identification

A system and method for file type identification involving extraction of a file-print of a file, the file-print being a unique or practically-unique representation of statistical characteristics associated with the distribution of bits in the binary contents of the file, similar to a fingerprint. The file-print is then passed to a machine learning algorithm that has been trained to recognize file types from their file-prints. The machine learning algorithm returns a predicted file type and, in some cases, a probability of correctness of the prediction. The file may then be encoded using an encoding algorithm chosen based on the predicted file type.

DELETING SNAPSHOTS VIA COMPARING FILES AND DELETING COMMON EXTENTS
20220342847 · 2022-10-27 · ·

The present disclosure is related to methods, systems, and machine-readable media for deleting snapshots. A deletion process can be performed responsive to receiving a request to delete a snapshot of a virtual computing instance (VCI) in a file system. The deletion process can include performing a first file comparison between the snapshot and a previous snapshot to determine first extents exclusive to the snapshot, performing a second file comparison between the snapshot and a subsequent snapshot to determine second extents exclusive to the snapshot, performing a third file comparison between the first extents and the second extents to determine common extents, wherein the common extents are common to the first extents and the second extents, and deleting the common extents from the file system.

MANAGING OBJECTS STORED AT A REMOTE STORAGE

An indication to store to a remote storage a new archive of a snapshot of a source storage is received. At least one shared data chunk of the new archive is determined to be already stored in an existing chunk object of the remote storage storing data chunks of a previous archive. One or more evaluation metrics for the existing chunk object are determined based at least in part on a retention period associated with one or more individual chunks stored in the chunk object and a data lock period associated with the entire existing chunk object. It is determined based on the one or more evaluation metrics whether to reference the at least one shared data chunk of the new archive from the existing chunk object or store the at least one shared data chunk in a new chunk object of the remote storage.

MANAGING OBJECTS STORED AT A REMOTE STORAGE

A first archive of a first snapshot of a source storage is caused to be stored to a remote storage. At least a portion of content of the first archive is stored in data chunks stored in a first chunk object of the remote storage and the first archive is associated with a first data policy. A second archive of a second snapshot of the source storage is caused to be stored to the remote storage. At least a portion of content of the second archive is referenced from data chunks stored in the first chunk object and the second archive is associated with a second data policy. Policy compliance of the chunk object storing data chunks referenced by the first archive and the second archive that are different is automatically managed based on the first data policy and the second data policy that are different.

Reference set construction for data deduplication

By way of example, a data storage system may comprise, a non-transitory storage device storing data blocks in chunks, and a storage logic coupled to the non-transitory storage device that manages storage of data on the storage device. The storage logic is executable to receive a data stream for storage in a non-transitory storage device, the data stream including one or more data blocks, analyze the data stream to determine a domain, retrieve a pre-configured reference set based on the domain, and deduplicate the one or more data blocks of the data stream using the pre-configured reference set.

Method and device for detecting duplicate content

Provided is a method for detecting duplicate audio content in an electronic device. The method includes receiving, by the electronic device, a plurality of audio content, decoding, by the electronic device, each of the audio content to extract a plurality of byte streams of each of the audio content and audio feature information, generating, by the electronic device, a unique signature for each of the audio content based on the plurality of byte streams of each of the audio content, and storing, by the electronic device, the unique signature of each of the audio content in the electronic device to identify duplicate audio content.

PARTIAL IN-LINE DEDUPLICATION AND PARTIAL POST-PROCESSING DEDUPLICATION OF DATA CHUNKS
20230062644 · 2023-03-02 ·

Data is ingested from a source system. Ingesting the data includes determining corresponding chunk identifiers for a plurality of data chunks corresponding to the ingested data and for each of the plurality of data chunks, verifying whether the corresponding chunk identifier is included in a data structure tracking identifiers of data chunks that were already stored in a storage of a storage system before the data ingestion started and storing the data chunk in a storage based on the verification. After the ingesting is completed, deduplication of the ingested data chunks stored in the storage having a same chunk identifier is performed and the data structure is updated based on the deduplication.

METHOD, COMPUTER-READABLE MEDIUM AND FILE SYSTEM FOR DEDUPLICATION
20230063119 · 2023-03-02 ·

A method for deduplication applicable to a file chunked into a plurality of deduplicated chunks is provided and includes: defining a calculation range in the file according to types of the chunks in the file, where the calculation range includes a plurality of consecutive chunks in the file; generating an evaluation value according to the types of the chunks in the calculation range to determine whether to mark the chunks in the calculation range; and re-chunking and deduplicating the marked chunks in the file. A computer-readable medium and a file system corresponding to the method for deduplication are also provided.