Patent classifications
G06F16/174
SYSTEMS AND METHODS FOR PHYSICAL CAPACITY ESTIMATION OF LOGICAL SPACE UNITS
Systems and methods of determining physical capacity of logical space units are disclosed. The method populates a first smart filter to track a physical capacity of a first logical space unit (LSU). The method adds fingerprints from the first LSU to register(s) of the first smart filter. The method populates a second smart filter to track fingerprints deleted by garbage collection (GC). The method adds the deleted fingerprints to register(s) of the second smart filter. Using the first and second smart filters, the method determines an intersection cardinality of the first LSU and the deleted fingerprints. The method determines a cardinality of unique fingerprints in the first LSU based on the intersection cardinality of the first LSU and the deleted fingerprints. The method determines the physical capacity of the first LSU based at least on the cardinality of unique fingerprints in the first LSU.
DEDUPLICATING DATA INTEGRITY CHECKS ACROSS SYSTEMS
A computer-implemented method, according to one embodiment, includes: receiving, at a clustered filesystem from a formatted filesystem, a request to perform a data integrity check for a portion of data. A determination is made as to whether the request includes a filesystem type of the portion of data, and in response to determining that the request includes a filesystem type of the portion of data, another determination is made as to whether the clustered filesystem supports the data integrity check for the filesystem type. In response to determining the clustered filesystem supports the data integrity check, another determination is made as to whether the portion of data is currently available. Furthermore, the computer-implemented method includes causing the data integrity check to be performed in response to determining that the portion of data is currently available. Results of performing the data integrity check are also sent to the formatted filesystem.
DEDUPLICATING DATA INTEGRITY CHECKS ACROSS SYSTEMS
A computer-implemented method, according to one embodiment, includes: receiving, at a clustered filesystem from a formatted filesystem, a request to perform a data integrity check for a portion of data. A determination is made as to whether the request includes a filesystem type of the portion of data, and in response to determining that the request includes a filesystem type of the portion of data, another determination is made as to whether the clustered filesystem supports the data integrity check for the filesystem type. In response to determining the clustered filesystem supports the data integrity check, another determination is made as to whether the portion of data is currently available. Furthermore, the computer-implemented method includes causing the data integrity check to be performed in response to determining that the portion of data is currently available. Results of performing the data integrity check are also sent to the formatted filesystem.
DUPLICATE FILE MANAGEMENT FOR CONTENT MANAGEMENT SYSTEMS AND FOR MIGRATION TO SUCH SYSTEMS
In large installations of document management systems, files are often duplicated. Users may place their own copies of files in convenient locations, or for other reasons files may be unintentionally duplicated. Duplication of files causes many problems for systems reliant on document management, chiefly because the additional (identical) files accept extra storage space, and must be handled like all other files, which results in greater network and resource utilization (with a concomitant increase in processing, search and retrieval times). A tool to standardize the identification of duplicate files (based on their binary contents), as well as the identification of a primary duplicate (the original file) across multiple repositories in a manner that minimizes the time for identification is disclosed.
Distributed storage device and data management method in distributed storage device
The number of inter-node communications in inter-node deduplication can be reduced and both performance stability and high capacity efficiency can be achieved. A storage drive of storage nodes stores files that are not deduplicated in the plurality of storage nodes, duplicate data storage files in which deduplicated duplicate data is stored, and cache data storage files in which cache data of duplicate data stored in another storage node is stored, in which when a read access request for the cache data is received, the processors of the storage nodes read the cache data if the cache data is stored in the cache data storage file, and request another storage node to read the duplicate data related to the cache data if the cache data is discarded.
Utilizing data source identifiers to obtain deduplication efficiency within a clustered storage environment
Described is a system (and method) that intelligently distributes data within a clustered storage environment. To provide such a capability, the system may distribute backup files by considering a source of the data to be backed-up. In particular, the system may leverage the ability of front-end components such as a backup application to perform a granular data source identification of data. Such information may be propagated to back-end components such as a storage filesystem in the form of a data source identifier (e.g. placement tag). The data source identifiers may then be accessed by the clustered storage system to intelligently distribute backup files amongst a set of storage nodes forming a cluster. For example, backup files from the same data source may be stored on the same storage node to obtain the same deduplication efficiency as a single storage system.
CONTENT-BASED DYNAMIC HYBRID DATA COMPRESSION
An information handling system includes a processor configured to process a training data file to determine an optimal data compression algorithm. The processor may also perform a compression ratio analysis that includes compressing the training data file using data compression algorithms, calculating a compression ratio associated with each of the data compression algorithms, determining an optimal compression ratio from the compression ratio associated with the each data compression algorithm; and determining a desirable data compression algorithm associated with the training data file based on the optimal compression ratio. The processor may also perform a probability analysis that includes generating a symbol transition matrix based on the desirable data compression algorithm, extracting statistical feature data based on the symbol transition matrix, and generating probability matrices based on the statistical feature data to determine the optimal data compression algorithm for each segment of a working data file.
REDUCING BANDWIDTH DURING SYNTHETIC RESTORES FROM A DEDUPLICATION FILE SYSTEM
A request is received to restore a file at a deduplicated storage system to a client. The file resides at the storage system as a synthetic file based on a base file at the storage system. The request includes an indication that the base file is also present at the client. Metadata generated during a backup of the file to the storage system is reviewed. The metadata includes references to data determined to be in the base file at the storage system, and references to other data determined to not be in the base file at the storage system. The other data determined to not be in the base file is read from the storage system and transmitted to the client. Upon receipt, the client assembles the requested file using the base file present at the client and the other data determined to not be in the base file.
SYSTEM AND METHOD FOR FILE SYSTEM METADATA FILE REGION SEGMENTATION FOR DEDUPLICATION
A method for managing file based backups (FBBs) includes obtaining, by a backup agent, a backup request for a FBB, in response to the backup request, generating a FBB, generating a FBB metadata file corresponding to the FBB, wherein the FBB metadata file comprises a set of attribute regions, performing, using the set of attribute regions, a deduplication on the FBB metadata file to obtain a deduplicated FBB metadata file, and storing the deduplicated FBB metadata file in a backup storage system.
Deduplicated data distribution techniques
In connection with a data distribution architecture, client-side “deduplication” techniques may be utilized for data transfers occurring among various file system nodes. In some examples, these deduplication techniques involve fingerprinting file system elements that are being shared and transferred, and dividing each file into separate units referred to as “blocks” or “chunks.” These separate units may be used for independently rebuilding a file from local and remote collections, storage locations, or sources. The deduplication techniques may be applied to data transfers to prevent unnecessary data transfers, and to reduce the amount of bandwidth, processing power, and memory used to synchronize and transfer data among the file system nodes. The described deduplication concepts may also be applied for purposes of efficient file replication, data transfers, and file system events occurring within and among networks and file system nodes.