G06F11/2056

High reliability fault tolerant computer architecture

A fault tolerant computer system and method are disclosed. The system may include a plurality of CPU nodes, each including: a processor and a memory; at least two IO domains, wherein at least one of the IO domains is designated an active IO domain performing communication functions for the active CPU nodes; and a switching fabric connecting each CPU node to each IO domain. One CPU node is designated a standby CPU node and the remainder are designated as active CPU nodes. If a failure, a beginning of a failure, or a predicted failure occurs in an active node, the state and memory of the active CPU node are transferred to the standby CPU node which becomes the new active CPU node. If a failure occurs in an active IO domain, the communication functions performed by the failing active IO domain are transferred to the other IO domain.

Continuous data protection
11500740 · 2022-11-15 · ·

Providing continuous data protection includes maintaining a database having substantially all data modifications made to a primary volume over a recovery interval. The database is maintained in conjunction with a copying operation where the data of the primary volume are mirrored to a remote volume to permit recovery of mirrored data in the event of loss of primary volume data. The contents of the remote volume generally lag behind the contents of the primary volume by substantially the recovery interval. Providing continuous data protection also includes providing data roll-back to a precise point in time within the recovery interval by applying, to the contents of the remote volume, all data modifications in the database that occurred between the latest data modification to the remote volume and the precise point in time within the recovery interval. A time stamp mechanism of sufficient precision and granularity may be used.

Dynamically allocating streams during restoration of data

The systems and methods described herein dynamically allocate streams when restoring data from databases. In some embodiments, the system and methods restore data from a database by determining a number of streams to allocate to the database for restoring files of data from the database. The determined number of streams may be based on a total amount of data within the database, and/or may be based, at least in part, on the previous number of streams used during backup operations, in order to balance the benefit of allocating streams to a restoration of data with any detriments associated with changing the number of streams from the number used during previous backup operations.

CLOUD BLOCK MAP FOR CACHING DATA DURING ON-DEMAND RESTORE
20230032522 · 2023-02-02 ·

Techniques are provided for caching data during an on-demand restore using a cloud block map. A client may be provided with access to an on-demand volume during a restore process that copies backup data from a snapshot within a remote object store to the on-demand volume stored within local storage. In response to receiving a request from the client for a block of the backup data not yet restored from the snapshot to the on-demand volume, the block may be retrieved from the snapshot in the remote object store. The block may be cached within a cloud block map stored within the local storage as a cached block. The client may be provided with access to the cached block.

RUNTIME SPARING FOR UNCORRECTABLE ERRORS BASED ON FAULT-AWARE ANALYSIS
20220350715 · 2022-11-03 ·

A system can respond to detection or prediction of an uncorrectable error (UE) in memory based on fault-aware analysis. The fault-aware analysis enables the system to generate a determination of a specific hardware element of the memory that is faulty. In response to detection of an error, the system can correlate a hardware configuration of the memory device with historical data indicating memory faults for hardware elements of the hardware configuration. Based on a determination of the specific component that likely caused the UE, the system can identify a region of memory associated with the detected UE and mirror the faulty region to a reserved memory space of the memory device for access to data of the faulty region.

EVENT-DRIVEN SYSTEM FAILOVER AND FAILBACK
20230083450 · 2023-03-16 ·

A system determines that a primary event processor, included in a primary data center, is associated with a failure. The primary event processor is included in the primary data center and configured to process first events stored in a main event store of the primary data center. The system identifies a secondary event processor, in a secondary data center, that is to process one or more first events based on the failure. The primary event processor and the secondary event processor are configured to process a same type of event. The system causes, based on a configuration associated with the primary or secondary event processor, the one or more first events to be retrieved from one of the main event store or a replica event store. The replica event store is included in the secondary data center and mirrors the main event store of the primary data center.

Storage based file FTP

Transferring files directly from a storage system to a backup storage system includes determining identifiers for blocks on the storage system that correspond to files that are to be backed up, providing the identifiers for the blocks to the storage system, and the storage system pushing the blocks indicated by the identifiers directly from the storage system to the backup storage system. The identifiers may be logical block addresses. Determining the logical block addresses may vary according to a file system for files that are to be backed up. Determining the logical block address may include determining an inode value for each of the files that are to be backed up or may include determining a logical cluster number for each of the files that are to be backed up. The backup storage system may include a media server and a storage device.

Using data mirroring across multiple regions to reduce the likelihood of losing objects maintained in cloud object storage
11481319 · 2022-10-25 · ·

Techniques for using data mirroring across regions to reduce the likelihood of losing objects in a cloud object storage platform are provided. In one set of embodiments, a computer system can upload first and second copies of a data object to first and second regions of the cloud object storage platform respectively, where the first and second copies are identical. The computer system can then attempt to read the first copy of the data object from the first region. If the read attempt fails, the computer system can retrieve the second copy of the data object from the second region.

Event-driven system failover and failback
11636013 · 2023-04-25 · ·

A system determines that a primary event processor, included in a primary data center, is associated with a failure. The primary event processor is included in the primary data center and configured to process first events stored in a main event store of the primary data center. The system identifies a secondary event processor, in a secondary data center, that is to process one or more first events based on the failure. The primary event processor and the secondary event processor are configured to process a same type of event. The system causes, based on a configuration associated with the primary or secondary event processor, the one or more first events to be retrieved from one of the main event store or a replica event store. The replica event store is included in the secondary data center and mirrors the main event store of the primary data center.

Automatic objective-based compression level change for individual clusters

A method, computer system, and a computer program product for objective-based compression level change is provided. The present invention may include storing a volume in a storage device, wherein the stored volume is compressed using an initial compression level. The present invention may also include checking a last access time of the stored volume in the storage device at a regular interval. The present invention may further include, in response to determining, based on the checked last access time, that the stored volume is not accessed at the regular interval, recompressing the stored volume in the storage device using a higher compression level, wherein the higher compression level includes a higher compression ratio than a compression ratio associated with the initial compression level.