G06F11/2048

Managing data center failure events

Managing data center recovery from failure events can include a failure event platform having aspects provided via a user interface that integrates multiple failure and recovery management and execution features. The features can include, among others, application drift monitoring between production and recovery environments, real-time health checks of system components, user-modifiable scripting for prioritizing and customizing data center recovery actions, and a recovery execution tool.

Resource manager for transaction processing systems

A resource manager (RM) instance is associated with each transaction processing system (TPS) member, of a TPS group. Each RM instance monitors performance of the associated TPS member. If a TPS member becomes unavailable for any reason (a failing TPS), the associated RM instance broadcasts status of the failing TPS to RMs associated “surviving” members of the group. RM instances associated with surviving members initiate a series of actions that reduce the resources used by the surviving TPS members. Consequently, the surviving TPS members are better able to process the additional workload imposed on them due to the unavailability of the failing TPS. Once the failing TPS is brought back online and made available again (or a replacement TPS is brought online), RM instances associated with the surviving members perform actions to undo the resource usage reduction tasks, and the TPS group returns to a nominal configuration.

Systems and methods for managing a highly available and scalable distributed database in a cloud computing environment

Systems and methods for managing a highly available distributed database comprising: a memory storing instructions; and one or more processors configured to execute the instructions to: determine that a source node, in a distributed database comprising the source node and one or more replica nodes, is not available; select a most-updated replica node from the one or more replica nodes; switch a role of the most-updated replica node to source; update a data store to label the source node as unavailable and the selected replica node as being a promoted source node; send a notification to a user device to update a database topology based on the updated data store; determine whether the user device has updated the database topology; and upon determining the user device has not updated the database topology, continue to send the notification to the user device until the user device has updated the database topology.

AUTOMATIC REPLACEMENT OF COMPUTING NODES IN A VIRTUAL COMPUTER NETWORK

Techniques are described for providing managed computer networks, such as for managed virtual computer networks overlaid on one or more other underlying computer networks. In some situations, the techniques include facilitating replication of a primary computing node that is actively participating in a managed computer network, such as by maintaining one or more other computing nodes in the managed computer network as replicas, and using such replica computing nodes in various manners. For example, a particular managed virtual computer network may span multiple broadcast domains of an underlying computer network, and a particular primary computing node and a corresponding remote replica computing node of the managed virtual computer network may be implemented in distinct broadcast domains of the underlying computer network, with the replica computing node being used to transparently replace the primary computing node in the virtual computer network if the primary computing node becomes unavailable.

ENHANCED FILE INDEXING, LIVE BROWSING, AND RESTORING OF BACKUP COPIES OF VIRTUAL MACHINES AND/OR FILE SYSTEMS BY POPULATING AND TRACKING A CACHE STORAGE AREA AND A BACKUP INDEX

An illustrative approach accelerates file indexing operations for block-level backup copies in a data storage management system. A cache storage area is maintained for locally storing and serving key data blocks, thus relying less on retrieving data on demand from the backup copy. File indexing operations are used for populating the cache storage area for speedier retrieval during subsequent live browsing of the same backup copy, and vice versa. The key data blocks cached while file indexing and/or live browsing an earlier backup copy help to pre-fetch corresponding data blocks of later backup copies, thus producing a beneficial learning cycle. The approach is especially beneficial for cloud and tape backup media, and is available for a variety of data sources and backup copies, including block-level backup copies of virtual machines (VMs) and block-level backup copies of file systems, including UNIX-based and Windows-based operating systems and corresponding file systems.

Hardware-Assisted Memory Disaggregation with Recovery from Network Failures Using Non-Volatile Memory
20230205649 · 2023-06-29 ·

Techniques for implementing hardware-assisted memory disaggregation with recovery from network failures/problems are provided. In one set of embodiments, a hardware controller of a computer system can maintain a copy of a “remote memory” of the computer system (i.e., a section of the physical memory address space of the computer system that maps to a portion of the physical system memory of a remote computer system) in a local backup memory. The backup memory may be implemented using a non-volatile memory that is slower, but also less expensive, than conventional dynamic random-access memory (DRAM). Then, if the hardware controller is unable to retrieve data in the remote memory from the remote computer system within a specified time window due to, e.g., a network failure or other problem, the hardware controller can retrieve the data from the backup memory, thereby avoiding a hardware error condition (and potential application/system crash).

METERING FRAMEWORK FOR IMPROVING RESOURCE UTILIZATION FOR A DISASTER RECOVERY ENVIRONMENT
20230205653 · 2023-06-29 ·

A framework is described that improves resource utilization during operations executing within workflows of the distributed data processing system (e.g., having a plurality of interconnected nodes) in a disaster recovery (DR) environment configured to support synchronous and asynchronous (i.e., heterogeneous) DR workflows (e.g., generating snapshots and replicating data) that include synchronous replication, asynchronous replication, nearsync (i.e., short duration snapshots of metadata) replication and migration of data objects associated with the workflows for failover (e.g., replication and/or migration) to a secondary site in the event of failure of the primary site. The framework meters (regulates) execution of the operations directed to the workloads so as to efficiently use the resources in a manner that allows timely progress (completion) of certain (e.g., high-frequency) operations and reduction in blocking (stalling) of other (e.g., low-frequency) operations by avoiding unnecessary resource hoarding/consumption and contention. Notably, the framework also provides metering and tuning of properties during execution of the workflows and maintains their state to provide for recovery.

Fault tolerant system, server, and operation method of fault tolerant system
11687425 · 2023-06-27 · ·

A first server and a second server use a virtual address to mount the storage synchronous area in a storage by the NFS. The first server obtains a snapshot of memory content of a virtual system operated as an active system and transmits the snapshot to the second server. The first server replicates content of the storage synchronous area in the storage to a storage synchronous area in a storage. When a failure occurs in the first server, the second server sets a virtual address to the storage and uses the virtual address to mount the storage synchronous area in the storage by NFS. The second server uses the snapshot received from the first server to execute the application on the virtual system.

MANAGING HEALTH CONDITIONS TO DETERMINE WHEN TO RESTART REPLICATION AFTER A SWAP TRIGGERED BY A STORAGE HEALTH EVENT

Provided are a computer program product, system, and method for managing health conditions to determine when to restart replication after a swap triggered by a storage health event. A determination is made of a health condition with respect to access to a first storage that triggers a swap operation. The swap operation redirects host Input/Output (I/O) requests to data from a first server to a second server in response to determining the health condition. After the swap operation the I/O requests are directed to the second server and a second storage. The second server is instructed to mirror data in the second storage to the first server to store in the first storage in response to determining that the health condition is resolved.

HARDWARE ASSIST MECHANISMS FOR ALIVE DETECTION OF REDUNDANT DEVICES

An apparatus includes a first hardware assist device having at least one transmitter, at least one receiver, and a timer. The at least one transmitter is configured to transmit at least one first signal to a second hardware assist device of a redundant second apparatus. The at least one first signal indicates that the apparatus is functional. The at least one receiver is configured to receive at least one second signal from the second hardware assist device. The at least one second signal indicates that the second apparatus is functional. The timer is configured to control a driver to block transmission of the at least one first signal in response to a fault associated with the apparatus. The apparatus also includes at least one processing device configured to perform one or more actions in response to a loss of the at least one second signal from the second apparatus.