G06F16/1748

Managing deduplication characteristics in a storage system

A method is used in managing deduplication characteristics in a storage system. Deduplication entries stored in a deduplication cache are categorized into a set of deduplication groups based on a data deduplication probability associated with the deduplication entries. A machine learning system is used to dynamically adjust deduplication characteristics associated with the set of deduplication groups based on an I/O workload associated with the storage system.

Data deduplication in data platforms

One embodiment of the invention provides a method for data deduplication storage management in a data platform including a plurality of data stores. The method comprises, for each data store of the plurality of data stores, determining a corresponding multi-level signature mapping data content of the data store into an ordered logical form comprising a plurality of data abstraction levels, determining a data similarity between the data store and each other data store of the plurality of data stores based on the multi-level signature corresponding to the data store and another multi-level signature corresponding to the other data store, and determining data usage of the data content of the data store. The method further comprises improving storage in the data platform by detecting duplicate data across the plurality of data stores based on each data similarity determined and each data usage determined.

Method and system for large scale data curation

An end-to-end data curation system and the various methods used in linking, matching, and cleaning large-scale data sources. The goal of this system is to provide scalable and efficient record deduplication. The system uses a crowd of experts to train the system. The system operator can optionally provide a set of hints to reduce the number of questions send to the experts. The system solves the problem of schema mapping and record deduplication a holistic way by unifying these problems into a unified linkage problem.

System and method for file system metadata file region segmentation for deduplication

A method for managing file based backups (FBBs) includes obtaining, by a backup agent, a backup request for a FBB, in response to the backup request, generating a FBB, generating a FBB metadata file corresponding to the FBB, wherein the FBB metadata file comprises a set of attribute regions, performing, using the set of attribute regions, a deduplication on the FBB metadata file to obtain a deduplicated FBB metadata file, and storing the deduplicated FBB metadata file in a backup storage system.

Instant replay of a file to a cloud tier on a deduplication file system

Embodiments of an instant recall process and system for long-term data stored on a cloud storage tier. Embodiments include saving a content handle of a file in a cloud storage tier as an extended attribute in a single file system namespace; moving the file from the cloud storage tier to an active storage tier for data processing; recalling the file from the active storage tier to the cloud storage tier upon completion of the data processing; using the content handle from hidden metadata for a working copy of the file; and saving a hash of a segment reference as part of the extended attribute.

SYSTEM AND METHOD FOR NETWORK POLICY SIMULATION

This disclosure generally relate to a method and system for network policy simulation in a distributed computing system. The present technology relates techniques that enable simulation of a new network policy with regard to its effects on the network data flow. By enabling a simulation data flow that is parallel and independent from the regular data flow, the present technology can provide optimized network security management with improved efficiency.

Push-based piggyback system for source-driven logical replication in a storage environment

The disclosed techniques enable push-based piggybacking of a source-driven logical replication system. Logical replication of a data set (e.g., a snapshot) from a source node to a destination node can be achieved from a source-driven system while preserving the effects of storage efficiency operations (deduplication) applied at the source node. However, if missing data extents are detected at the destination, the destination has an extent pulling problem as the destination may not have knowledge of the physical layout on the source-side and/or mechanisms for requesting extents. The techniques overcome the extent pulling problem in a source-driven replication system by introducing specific protocols for obtaining missing extents within an existing replication environment by piggybacking data pushes from the source.

Dynamic management of expandable cache storage for multiple network shares configured in a file server

Expandable cache management dynamically manages cache storage for multiple network shares configured in a file server. Once a file is written to a directory or folder on a specially designated network share, such as one that is configured for “infinite backup,” an intermediary pre-backup copy of the file is created in an expandable cache in the file server that hosts the network share. On write operations, cache storage space can be dynamically expanded or freed up by pruning previously backed up data. This advantageously creates flexible storage caches in the file server for each network share, each cache managed independently of other like caches for other network shares on the same file server. On read operations, intermediary file storage in the expandable cache gives client computing devices speedy access to data targeted for backup, which is generally quicker than restoring files from backed up secondary copies.

Virtual machine backup and restoration

Reversing deletion of a virtual machine including managing, by a storage system, a repository of virtual machine snapshots on a datastore; receiving, by the storage system, a request to recover a deleted virtual machine from the datastore; accessing, by the storage system, the repository of virtual machine snapshots on the datastore to generate a list of deleted virtual machines associated with virtual machine snapshots in the repository of virtual machine snapshots; receiving, by the storage system, a selection of one of the deleted virtual machines in the list of deleted virtual machines; and recovering, by the storage system, the selected deleted virtual machine using a virtual machine snapshot for the selected deleted virtual machine.

Techniques for data deduplication
11573928 · 2023-02-07 · ·

Techniques for processing data may include: receiving a data block stored in a data set, wherein a hash value is derived from the data block; determining, in accordance with selection criteria, whether the hash value is included in a subset; responsive to determining the hash value is included in the subset, performing processing that updates a table in accordance with the hash value and the data set, and determining, in accordance with the information in the table, whether to perform deduplication processing for the data block to determine whether the data block is a duplicate of another stored data block. The table may include an entry for the hash value. The entry may include information identifying data sets referencing the data block and, for each of the data sets, may specify a reference count denoting a number of times the data set references the data block.