G06F11/0757

Proactive cluster compute node migration at next checkpoint of cluster upon predicted node failure

While scheduled checkpoints are being taken of a cluster of active compute nodes distributively executing an application in parallel, a likelihood of failure of the active compute nodes is periodically and independently predicted. Responsive to the likelihood of failure of a given active compute node exceeding a threshold, the given active compute node is proactively migrated to a spare compute node of the cluster at a next scheduled checkpoint. Another spare compute node of the cluster can perform prediction and migration. Prediction can be based on both hardware events and software events regarding the active compute nodes.

LEADER ELECTION IN A DISTRIBUTED SYSTEM BASED ON NODE WEIGHT AND LEADERSHIP PRIORITY BASED ON NETWORK PERFORMANCE

Example implementations relate to consensus protocols in a stretched network. According to an example, a distributed system includes continuously monitoring network performance and/or network latency among a cluster of a plurality of nodes in a distributed computer system. Leadership priority for each node is set based at least in part on the monitored network performance or network latency. Each node has a vote weight based at least in part on the leadership priority of the node. Each node's vote is biased by the node's vote weight. The node having a number of biased votes higher than a maximum possible number of votes biased by respective vote weights received by any other node in the cluster is selected as a leader node.

Systems and methods for cross-referencing forensic snapshot over time for root-cause analysis

Aspects of the disclosure describe methods and systems for cross-referencing forensic snapshots over time. In one exemplary aspect, a method may comprise receiving a first snapshot of a computing device at a first time and a second snapshot of the computing device at a second time and applying a pre-defined filter to the first snapshot and the second snapshot, wherein the pre-defined filter includes a list of files that are to be extracted from each snapshot. The method may comprise subsequent to applying the pre-defined filter, identifying differences in the list of files extracted from the first snapshot and the second snapshot. The method may comprise creating a change map for the computing device that comprises the differences in the list of files over a period of time, wherein the period of time comprises the first time and the second time, and outputting the change map in a user interface.

ERROR DETECTION AND CORRECTION METHOD AND CIRCUIT

An error detection and correction method is provided. The method includes: when a pipeline stage error is detected, correcting the pipeline stage error; when it is determined that a plurality of cascaded pipeline stage circuits have continuous pipeline stage errors, stopping all operations of all pipeline stage circuits; flushing the data of the pipeline stage circuits; and re-processing the data of the pipeline stage circuits at a downclocked frequency.

Information processing apparatus, method of controlling information processing apparatus, and storage medium
11550594 · 2023-01-10 · ·

An information processing apparatus includes a storage unit configured to store at least a first boot program and a second boot program corresponding to the first boot program, a controller configured to read and execute a program, detect, in accordance with occurrence of a read error at reading of the first boot program, an address of a storage area storing a program in which the read error has occurred in the first boot program, and specify, from an address of a storage area storing the second boot program, an address corresponding to the detected address. The controller reads and executes the second boot program stored in the specified address.

Query watchdog
11693723 · 2023-07-04 · ·

A system for monitoring job execution includes an interface and a processor. The interface is configured to receive an indication to start a cluster processing job. The processor is configured to determine whether processing a data instance associated with the cluster processing job satisfies a watchdog criterion; and in the event that processing the data instance satisfies the watchdog criterion, cause the processing of the data instance to be killed.

Techniques for preventing concurrent execution of declarative infrastructure provisioners

Techniques for preventing concurrent execution of an infrastructure orchestration service are described. Worker nodes can receive instructions, or tasks, for deploying infrastructure resources and can provide heartbeat notifications to scheduler nodes, also considered a lease. A signing proxy can track the heartbeat notifications sent from the worker nodes to the scheduler node. The signing proxy can receive requests corresponding to a performance of the tasks assigned to the worker nodes. The signing proxy can determine whether the lease between each worker node and the scheduler is valid. If the lease is valid, the signing proxy may make a call to services on behalf of the worker node, and if the lease is not valid, the signing proxy may not make a call to services on behalf of the worker node. Instead, the signing proxy may cut off all outgoing network traffic, blocking access of the worker node to services.

REDUCING FALSE POSITIVE FAILURE EVENTS IN A HEALTH MONITORING SYSTEM USING DIRECTIONAL GRAPHS

Embodiments for reducing panic shutdown of components in a pipelined data processing system. Components are monitored for health, processing progress, and dependencies during normal system operation. A directed graph is generated showing non-circular dependencies of components in the pipeline. Deadlock of a particular component may or may not signal a panic condition depending on whether any of its presently downstream and depended on components are operating properly. The continuously monitored knowledge of proper operation of all downstream components is thus used to intelligently apply or defer panic alerts to keep the system operating uninterrupted from panic conditions that might soon or eventually be fixed by continued operation of the system pipeline.

LOW LATENCY PARITY FOR A MEMORY DEVICE

Apparatuses, systems, and methods for low latency parity for a memory device include a controller configured to accumulate, in a memory buffer, combined parity data for a plurality of regions of memory of a memory device in response to write operations for the plurality of regions of memory. The controller is configured to perform a recovery operation for a region of memory in response to determining that a latency setting for the region satisfies a latency threshold. The controller is configured to service a read request for data from the region based on a recovery operation to satisfy the latency setting.

RE-INITIATION OF MICROSERVICES UTILIZING CONTEXT INFORMATION PROVIDED VIA SERVICE CALLS
20230004427 · 2023-01-05 ·

An apparatus comprises a processing device configured to identify, at a first microservice, a service call that is to be transmitted to a second microservice, and to modify the service call to include context information, the context information characterizing a current state of execution of one or more tasks by one of the first microservice and the second microservice. The processing device is further configured to provide, from the first microservice to the second microservice, the modified service call including the context information. The context information enables re-initiation of said one of the first microservice and the second microservice to continue execution of the one or more tasks from the current state.