G06F11/1616

Node Anomaly Event Processing Method, Network Interface Card, and Storage Cluster
20250284602 · 2025-09-11 ·

A node anomaly event processing method is applied to a network interface card in a storage device. The storage device further includes a plurality of nodes configured to manage a storage. The network interface card is communicatively connected to a first node in the plurality of nodes. When detecting an anomaly event related to the first node, the network interface card can actively send a notification message to a host to notify the host that an anomaly occurs on a path on which the first node is located, so that the host performs path switching.

Input/output system interconnect redundancy and failover

A system and method for achieving peripheral component interconnect express (PCIe) redundancy and recovery are disclosed. In some embodiments, the system comprises an accelerated compute fabric (ACF) comprising a PCIe switch, an application host communicatively coupled to the ACF using one or more upstream PCIe links, and an endpoint device communicatively coupled to the ACF using one or more downstream PCIe links. The application host is configured to send PCIe transaction layer packets (TLP) addressed to the ghosted endpoint devices through the one or more upstream PCIe links, and the ACF is configured to redirect the PCIe TLP packets to the endpoint device through the one or more downstream PCIe links.

System And Method Of Drainless Link Repair For Increased Accelerator Utilization During Failure

A system for repair of an accelerator architecture including a plurality of accelerator host having a plurality of accelerator elements in each host and high-speed interconnects between accelerator elements and including a link health monitor service in the host determining a status of each interconnect internal and external to the host. The link health monitor service detects link status of the interconnects and communicates this to a pod manager responsible for managing a number of interconnected hosts. The pod manager communicates with a scheduler for scheduling processing tasks to each accelerator element or group of elements. Upon detection of an interconnect problem, the pod manager flags a least common ancestor in a binary tree of elements containing elements adjacent to the failed interconnect and flags the ancestor and its parent nodes as unavailable. Pod manager further generates a corrective action repairing the problem interconnect communicating the action to a technician.