G06F11/2046

File service auto-remediation in storage systems

System and method for automatic remediation for a distributed file system uses a file system (FS) remediation module running in a cluster management server and FS remediation agents running in a cluster of host computers. The FS remediation module monitors the cluster of host computers for related events. When a first file system service (FSS)-impacting event is detected, a cluster-level remediation action is executed at the cluster management server by the FS remediation module in response to the detected first FSS-impacting event. When a second FSS-impacting event is detected, a host-level remediation action is executed at one or more of the host computers in the cluster by the FS remediation agents in response to the detected second FSS-impacting event.

Automatic configuration of a recovery service

A secondary location is configured as a recovery service for a primary location of the service. The secondary location is maintained in a warm state that is configured to replace the primary location in a case of a failover. During normal operation, the secondary location is automatically updated to reflect a current state of the primary location that is actively servicing user load. Content changes to the primary location are automatically reflected to the secondary location. System changes applied to the primary location are automatically applied to the secondary location. For example, removing/adding machines, updating machine/role assignments, removing adding/database are automatically applied to the secondary location such that the secondary location substantially mirrors the primary location. After a failover to the secondary location, the secondary location becomes the primary location and begins to actively service the user load.

Remote direct memory access (RDMA)-based recovery of dirty data in remote memory

Techniques for implementing RDMA-based recovery of dirty data in remote memory are provided. In one set of embodiments, upon occurrence of a failure at a first (i.e., source) host system, a second (i.e., failover) host system can allocate a new memory region corresponding to a memory region of the source host system and retrieve a baseline copy of the memory region from a storage backend shared by the source and failover host systems. The failover host system can further populate the new memory region with the baseline copy and retrieve one or more dirty page lists for the memory region from the source host system via RDMA, where the one or more dirty page lists identify memory pages in the memory region that include data updates not present in the baseline copy. For each memory page identified in the one or more dirty page lists, the failover host system can then copy the content of that memory page from the memory region of the source host system to the new memory region via RDMA.

Access Consistency in High-Availability Databases
20220121510 · 2022-04-21 ·

Techniques are disclosed relating to maintaining a high availability (HA) database. In some embodiments, a computer system receives, from a plurality of host computers, a plurality of requests to access data stored in a database implemented using a plurality of clusters. In some embodiments, the computer system responds to the plurality of requests by accessing data stored in an active cluster. The computer system may then determine, based on the responding, health information for ones of the plurality of clusters, wherein the health information is generated based on real-time traffic for the database. In some embodiments, the computer system determines, based on the health information, whether to switch from accessing the active cluster to accessing a backup cluster. In some embodiments, the computer system stores, in respective clusters of the database, a changeover decision generated based on the determining.

Managing access of multiple executing programs to nonlocal block data storage

Techniques are described for managing access of executing programs to non-local block data storage. In some situations, a block data storage service uses multiple server storage systems to reliably store network-accessible block data storage volumes that may be used by programs executing on other physical computing systems. A group of multiple server block data storage systems that store block data volumes may in some situations be co-located at a data center, and programs that use volumes stored there may execute on other physical computing systems at that data center. If a program using a volume becomes unavailable, another program (e.g., another copy of the same program) may in some situations obtain access to and continue to use the same volume, such as in an automatic manner in some such situations.

System and method for secure connections in a high availability industrial controller

Secure data transmission between an input device and both industrial controllers in a high-availability system utilizes a secure connection established between the primary industrial controller and the input device. Data required to establish the secure connection is stored on the primary controller as part of the connection data corresponding to the secure connection. The input device transmits data to the primary controller over the secure connection according to the desired level of security. The primary controller transmits the connection data defining the secure connection to the secondary controller. If a failure occurs in the primary controller, the secondary controller establishes a connection to the input device using the connection data for the secure connection, such that the secondary controller may assume responsibility for the controller end of the secure connection. The primary controller transmits the input signals to the secondary controller via the dedicated connection between controllers.

VIRTUALIZED FILE SERVER DATA SHARING

In one embodiment, a system for managing a virtualization environment includes a set of host machines, each of which includes a hypervisor, virtual machines, and a virtual machine controller, and a first virtualized file server configured to receive a request to access a storage item located at a second virtualized file server, determine that the storage item is designated as being accessible by other virtualized file servers, identify an FSVM of the second virtualized file server at which the storage item is located, and forward the request to the FSVM of the second virtualized file server. The storage item may be designated as being accessible by other virtualized file servers when the storage item is associated with a predetermined tag value indicating that the storage item is shared among virtualized file servers. The predetermined tag value may be stored in a sharding map in association with the storage item.

High Availability For Persistent Memory

Techniques for implementing high availability for persistent memory are provided. In one embodiment, a first computer system can detect an alternating current (AC) power loss/cycle event and, in response to the event, can save data in a persistent memory of the first computer system to a memory or storage device that is remote from the first computer system and is accessible by a second computer system. The first computer system can then generate a signal for the second computer system subsequently to initiating or completing the save process, thereby allowing the second computer system to restore the saved data from the memory or storage device into its own persistent memory.

Distributed workload reassignment following communication failure

A generation identifier is employed with various systems and methods in order to identify situations where a workload has been reassigned to a new node and where a workload is still being processed by an old node during a failure between nodes. A master node may assign a workload to a worker node. The worker node sends a request to access target data. The request may be associated with a generation identifier and workload identifier that identifies the node and workload. At some point, a failure occurs between the master node and worker node. The master node reassigns the workload to another worker node. The new worker node accesses the target data with a different generation identifier, indicating to the storage system that the workload has been reassigned. The old worker node receives an indication from the storage system that the workload has been reassigned and stops processing the workload.

Automatic configuration of a recovery service

A secondary location is configured as a recovery service for a primary location of the service. The secondary location is maintained in a warm state that is configured to replace the primary location in a case of a failover. During normal operation, the secondary location is automatically updated to reflect a current state of the primary location that is actively servicing user load. Content changes to the primary location are automatically reflected to the secondary location. System changes applied to the primary location are automatically applied to the secondary location. For example, removing/adding machines, updating machine/role assignments, removing adding/database are automatically applied to the secondary location such that the secondary location substantially mirrors the primary location. After a failover to the secondary location, the secondary location becomes the primary location and begins to actively service the user load.