G06F2201/85

CLUSTER SYSTEM AND RESTORATION METHOD
20230229572 · 2023-07-20 ·

A cluster system including a plurality of nodes, a plurality of clusters included in each node and a management module managing the cluster system and an arithmetic module, which are included in each of the clusters, wherein, among all the management modules included in the cluster system, one management module is set representative management module, in the individual clusters, one is set as a master management module, and another is set as a standby management module. Each of the management modules includes a failure monitoring unit and a failover control unit. When a failure in the representative management module is detected by any of the failure monitoring units, any of the management modules included in the non-representative management modules, is set as a new representative management module. A recovery unit restores the failure monitoring unit and the failover control unit in the management module in which a failure is detected.

Management of microservices failover

Embodiments described herein are generally directed to intelligent management of microservices failover. In an example, responsive to an uncorrectable hardware error associated with a processing resource of a platform on which a task of a service is being performed by a primary microservice, a failover trigger is received by a failover service. A secondary microservice is identified by the failover service that is operating in lockstep mode with the primary microservice. The secondary microservice is caused by the failover service to takeover performance of the task in non-lockstep mode based on failover metadata persisted by the primary microservice. The primary microservice is caused by the failover service to be taken offline.

Systems and methods for transitioning from legacy computer systems

A method may include receiving a communication from a user device, determining whether to forward the communication to a first computer system or a second computer system and forwarding the communication to the first computer system based on the determining. The method may also include generating, by the first computer system, a first response to the communication, determining whether an error occurred when processing the communication at the first computer system and forwarding the communication to the second computer system, in response to determining that an error occurred. The method may further include generating, by the second computer system, a second response to the communication and comparing the first response from the first computer system to the second response from the second computer system.

COMPUTER SYSTEM INSTALLED ON BOARD A CARRIER IMPLEMENTING AT LEAST ONE SERVICE CRITICAL FOR THE OPERATING SAFETY OF THE CARRIER
20230012925 · 2023-01-19 ·

A computer system installed on board a carrier, communicating in a network with a data concentrator and with a monitor, and implementing at least one service that is critical for the operating safety of the carrier, the critical service being redundant in at least two instances (δ.sub.1, . . . δ.sub.m) on different respective computers (C.sub.1, . . . , C.sub.m) connected to the network, each computer (C.sub.k) implementing at least one software task implementing an instance (δ.sub.k) of the critical service being configured to implement the critical service by way of time control.

Storage system and information processing method
11704208 · 2023-07-18 · ·

A storage system includes a plurality of controllers and a relay device and is configured to perform mirror transfer between the plurality of controllers, the relay device detects an abnormality in each device located between the plurality of controller and aggregates error information in a register, and a first controller module that is a source of performing the mirror transfer among the plurality of controllers reads content of the register and determines whether the mirror transfer is completed normally, after the mirror transfer is performed.

SITE LOCALITY SUPPORT FOR FILE SERVICES IN A STRETCHED CLUSTER ENVIRONMENT
20230021195 · 2023-01-19 · ·

The location of resources for file services are located within the same site, thereby eliminating or reducing performance issues caused by cross-site accesses in a stretched cluster environment. A file server placement algorithm initially places file servers at a site based at least in part on host workload and affinity settings, and can perform failover to move the file servers to a different location (e.g., to a different host on the same site or to another site) in the event of a failure of the host where the file servers were initially placed. File servers may be co-located with clients at a location based on client latencies and site workload. Failover support is also provided in the event that the sites in the stretched cluster have different subnet addresses.

Error-handling flows in memory devices based on bins

An example memory sub-system includes a memory device and a processing device, operatively coupled to the memory device. The processing device is configured to detect a power-up state of the memory device following a power loss event; detect a read error with respect to data residing in a block of the memory device, wherein the block is associated with a current voltage offset bin; and perform temporal voltage shift (TVS)-oriented calibration for associating the block with a new voltage offset bin.

TRANSFERRING TASK DATA BETWEEN EDGE DEVICES IN EDGE COMPUTING
20220413974 · 2022-12-29 ·

Edge device task management by receiving an indicator corresponding to a first container running a task on a first edge device of a cluster of edge devices, wherein the indicator indicates an error status of the first container, and wherein task data of the task is stored in a first local storage of the first edge device, selecting a second edge device from the cluster of edge devices, wherein a second container on the second edge device is to run the task, instructing the first and second edge devices to transfer the task data from the first local storage of the first edge device to a second local storage of the second edge device, and in response to receiving a notification that indicates the task data has been transferred from the first local storage to the second local storage, sending the task to the second container.

Memory apparatus capable of autonomously detecting and repairing fail word line and memory system including the same
11531606 · 2022-12-20 · ·

A memory apparatus comprising: a cell array comprising multiple first and second word lines, a fuse array configured to substitute a selection word line of the multiple first word lines with the multiple second word lines, a fail determination unit configured to determine, as a fail word line, a word line matched with a first condition during an access operation for the multiple first word lines and to determine a fail grade of the fail word line based on a second condition, an information storage unit configured to store a physical address, fail grade and access count of the fail word line as determination information for the fail word line, and a rupture operation unit configured to select the selection word line from the fail word lines based on a result of the analysis of the determination information, and perform rupturing the selection word line into the fuse array.

Distributed Application Orchestration Management in a Heterogeneous Distributed Computing Environment

Distributed application orchestration management is provided. A first passive member of a set of passive members sends a notification message to other members indicating that the first passive member is initiating start of a distributed application in response to the first passive member validating that a self-restart by a leader member failed. The first passive member compares timestamps associated with an attempt to start the distributed application by other passive members in the set of passive members. The first passive member stops a particular attempt to start the distributed application in response to the first passive member determining that a timestamp associated with the particular attempt to start the distributed application by the first passive member is newer than another timestamp of another passive member. The first passive member designates the other passive member having an older timestamp as a new leader member to continue starting the distributed application.