G06F11/076

Method for a reliability, availability, and serviceability-conscious huge page support

A method includes, in response to a memory error indication indicating an uncorrectable error in a faulted segment, associating in a remapping table the faulted segment with a patch segment in a patch memory region, and in response to receiving from a processor a memory access request directed to the faulted segment, servicing the memory access request from the patch segment by performing the requested memory access at the patch segment based on a patch segment address identifying the location of the patch segment. The patch segment address is determined from the remapping table and corresponds to a requested memory address specified by the memory access request.

INFORMATION PROCESSING DEVICE AND METHOD OF TESTING
20180011756 · 2018-01-11 · ·

An information processing device includes a first port and a processor coupled to the first port and configured to transmit, via the first port, a first signal to a first device coupled to the first port, cause a second device coupled to the first port to determine whether a failure is present in the first port when the information processing device does not receive a first response signal in response to the first signal, and determine that the failure is present in the first device when the second device does not determine that the failure is present in the first port.

Adaptive memory performance control by thread group
11709748 · 2023-07-25 · ·

A device implementing adaptive memory performance control by thread group may include a memory and at least one processor. The at least one processor may be configured to execute a group of threads on one or more cores. The at least one processor may be configured to monitor a plurality of metrics corresponding to the group of threads executing on one or more cores. The metrics may include, for example, a core stall ratio and/or a power metric. The at least one processor may be configured to determine, based at least in part on the plurality of metrics, a memory bandwidth constraint with respect to the group of threads executing on the one or more cores. The at least one processor may be configured to, in response to determining the memory bandwidth constraint, increase a memory performance corresponding to the group of threads executing on the one or more cores.

Mitigating read disturb effects in memory devices

A die read counter and a block read counter are maintained for a specified block of a memory device. An estimated number of read events associated with the specified block is determined based on a value of the block read counter and a value of the die read counter. Responsive to determining that the estimated number of read events satisfies a criterion, a media management operation of one or more pages associated with the specified block is performed.

ADAPTIVE READ THRESHOLD VOLTAGE TRACKING WITH BIT ERROR RATE ESTIMATION BASED ON NON-LINEAR SYNDROME WEIGHT MAPPING

Adaptive read threshold voltage tracking techniques are provided that employ bit error rate estimation based on a non-linear syndrome weight mapping. An exemplary device comprises a controller configured to determine a bit error rate for at least one of a plurality of read threshold voltages in a memory using a non-linear mapping of a syndrome weight to the bit error rate for the at least one of the plurality of read threshold voltages.

Mitigating a voltage condition of a memory cell in a memory sub-system

A determination that a first programming operation has been performed on a particular memory cell can be made. A determination can be made, based on one or more threshold criteria, whether the particular memory cell has transitioned from a state associated with a decreased error rate to another state associated with an increased error rate. In response to determining that the particular memory cell has transitioned from the state associated with the decreased error rate to the another state associated with the increased error rate, an operation can be performed on the particular memory cell to transition the particular memory cell from the another state associated with the increased error rate to the state associated with the decreased error rate.

Method of detecting faults in a fault tolerant distributed computing network system

The present disclosure provides methods for detecting faults in a distributed computing network system. The method includes receiving, from a management services, authority information identifying peer computing devices of a distributed computing network system. For each respective peer computing device, a first message comprising a first instance of a dataset and a second message comprising a second instance of the dataset are received. Where the first peer computing device and the second peer computing device have authority over the data set, it is determined whether the first instance of the dataset matches the second instance of the dataset. Where the first instance of the dataset does not match the second instance of the dataset, a fault message is sent to the management services indicating that a fault has been detected at the first peer computing device.

TIME SERIES CLUSTERING TO TROUBLESHOOT DEVICE PROBLEMS BASED ON MISSED AND DELAYED DATA
20230236915 · 2023-07-27 ·

The described technology is generally directed towards processing time series (e.g., device telemetry) data, including identifying missing data (gaps in the time series data), and delayed data. The time series data are converted to ternary data, e.g., zero if timely, one if delayed or two if missing, and counts are obtained for each. If the missing data and/or delayed counts are significant, e.g., exceed a threshold percentage of the total data, the time series data indicates a problem that can be narrowed down to a more specific cause. For example, the time series data can be filtered by customer products/offers and customer locations, and if a filtered dataset's ternary data are similar to the problematic data, as determined via unsupervised clustering as similarity data (occurring at a similar time), the potential problem or problems can be narrowed to a potential cause based on that filtered dataset's similarity.

AREA-OPTIMIZED ROW HAMMER MITIGATION

Systems and methods for area-efficient mitigation of errors that are caused by row hammer attacks and the like in a memory media device are described. The counters for counting row accesses are maintained in a content addressable memory (CAM) the provides fast access times. The detection of errors is deterministically performed while maintaining a number of row access counters that is smaller than the total number of rows protected in the memory media device. The circuitry for the detection and mitigation may be in the memory media device or in a memory controller to which the memory media device attaches. The memory media device may be dynamic random access memory (DRAM).

Firmware-based SSD block failure prediction and avoidance scheme

A Solid State Drive (SSD) is disclosed. The SSD may comprise flash storage for data, the flash storage organized into a plurality of blocks. A controller may manage reading data from and writing data to the flash storage. Metadata storage may store device-based log data for errors in the SSD. Identification firmware may identify a block responsive to the device-based log data. In some embodiments of the inventive concept, verification firmware may determine whether the suspect block is predicted to fail responsive to both precise block-based data and the device-based log data.