G06F11/1471

DATABASE RAPID RESTORE AFTER MEDIA FAILURE

A computer program product, system, and computer implemented method for rapid database restoration using a database restore and recovery process that leverages one or more sparse data files and/or blocks by restoring one or more sparse data files and/or blocks and providing a mechanism to redirect requests to the one or more sparse data files and/or blocks to a backup copy of the actual data files and/or blocks and a process to populate the one or more sparse data files and/or blocks while the database is operational for servicing user requests. The approach includes the creation and population of one or more sparse data files and/or blocks, a redirection mechanism to service read operations where necessary, and a process to restore the data to one or more sparse data files and/or blocks over time, while the database maintains operability.

Optimized disaster-recovery-as-a-service system

Methods, computer program products, and systems are presented. The methods include, for instance: analyzing a dataset associated with a service provided by the data protection service provider in order to determine a policy for when and how to replicate the respective components of the dataset corresponding to the service from a source site to a target site, such that the target site may perform the service with a minimum cost.

CHECKPOINT STATE STORAGE FOR MACHINE-LEARNING MODEL TRAINING
20230229905 · 2023-07-20 · ·

A method for training a machine-learning model. A plurality of nodes are assigned for training the machine-learning model. Nodes include agents comprising at least an agent processing unit and local memory. Each agent manages, via a local network, one or more workers that include a worker processing unit. Shards of a training data set are distributed for parallel processing by workers at different nodes. Each worker processing unit is configured to iteratively train on minibatches of a shard, and to report checkpoint states indicating updated parameters for storage in local memory. Based at least on recognizing a worker processing unit failing, the failed worker processing unit is reassigned and initialized based at least on a checkpoint state stored in local memory.

Fast migration of metadata

One or more buckets of key-value pairs of a first node of a distributed storage system are selected to be migrated to a second node of the distributed storage system. One or more underlying database files corresponding to the one or more selected buckets are identified. The one or more identified underlying database files are directly copied from a storage of a first node to a storage of the second node. The copied underlying database files are linked in a database of the second node to implement the one or more selected buckets in the second node.

Methods and systems for power failure resistance for a distributed storage system

A plurality of computing devices are communicatively coupled to each other via a network, and each of the plurality of computing devices is operably coupled to one or more of a plurality of storage devices. One or more of the computing devices and/or the storage devices may be used to rebuild data that may be lost due to a power failure.

Database recovery time objective optimization with synthetic snapshots

Methods and systems for reducing the amount of time to restore a database or other application by dynamically generating and storing synthetic snapshots are described. When backing up a database, an integrated data management and storage system may acquire snapshots of the database at a snapshot frequency and acquire database transaction logs at a frequency that is greater than the snapshot frequency. In response to detecting that the database is unable to provide a database snapshot, the integrated data management and storage system may generate a synthetic snapshot of the database by instantiating a compatible version of the database locally, acquiring a previously stored snapshot of the database, applying data changes from one or more database transaction logs to the previously stored snapshot to generate the synthetic snapshot, and storing the synthetic snapshot of the database within the integrated data management and storage system.

CONTINUOUS DATA PROTECTION USING A WRITE FILTER

A reference snapshot of a storage is stored. Data changes that modify the storage are received. The data changes are captured by a write filter of the storage. The received data changes are logged. The data changes occurring after an instance time of the reference snapshot are applied to the reference snapshot to generate a first incremental snapshot corresponding to a first intermediate reference restoration point. The data changes occurring after an instance time of the first incremental snapshot are applied to the first incremental snapshot to generate a second incremental snapshot corresponding to a second intermediate reference restoration point.

Creating database clones at a specified point-in-time

A point-in-time clone may be created for a database. A request to create the point-in-time clone may be received. The clone may be provided with access to a storage for the database that stores a history of modifications to the database applicable to return data of the database according to a state of the data at the specified point in time. The clone may then be updated so that the updates made to the clone are stored for subsequent access by the clone.

Intelligently adaptive log level management of a service mesh

Systems, methods and/or computer program products dynamically managing log levels of microservices in a service mesh based on predicted error rates of calls made to the service mesh. A first AI module predicts health, status and/or failures of microservices individually or as part of microservice chains with a particular confidence level. Using health status mapped to the microservices and historical information inputted into a knowledge base (including error rates), the first AI module predicts error rates of the API call for each user profile or generally by the service mesh. A second AI module analyzes the predictions provided by the first AI module and determines whether the predictions meet threshold levels of confidence. To improve the confidence of predictions that are below threshold levels, the second AI module dynamically adjusts application logs of the microservices and/or proxies thereof to an appropriate level to capture more detailed information within the logs.

System and method for maintaining a distributed ledger

A method of maintaining a distributed ledger at a client node includes: storing a distributed ledger defining a plurality of records each containing a set of values; storing (i) a local voting weight corresponding to the client node, and (ii) respective remote voting weights for a plurality of remote client nodes; obtaining a proposed update to a record of the distributed ledger; generating a local vote to apply or discard the proposed update and transmitting the local vote to the remote client nodes; receiving remote votes to apply or discard the proposed update from the remote client nodes; determining whether to permit the proposed update based on (i) the local vote and the local voting weight, and (ii) the remote votes and the corresponding remote voting weights; and according to the determination, applying the proposed update to the distributed ledger or discarding the proposed update.