G06F11/0751

HANDLING MEMORY ERRORS IN MEMORY MODULES THAT INCLUDE VOLATILE AND NON-VOLATILE COMPONENTS
20180004591 · 2018-01-04 ·

In one example in accordance with the present disclosure, a system for handling memory errors includes a memory module having volatile components and non-volatile components. The system includes a BIOS chip having BIOS code and a BIOS non-volatile (NV) memory. The BIOS NV memory stores error data associated with the memory module that was stored prior to a power-on or reset of the system. The system includes a processor to execute the BIOS code to, after the power-on or reset of the system end before an operating system is loaded; (1) read, from the BIOS NV memory, the error data; and (2) determine, based on the error data, whether to take a corrective action with respect to the memory module.

BYTE LEVEL GRANULARITY BUFFER OVERFLOW DETECTION FOR MEMORY CORRUPTION DETECTION ARCHITECTURES
20180004588 · 2018-01-04 ·

Memory corruption detection technologies are described. A processor can include a memory to store a memory corruption detection (MCD) table. A processor core of the processor can receive, from an application, an allocation request for an allocation of a memory object within a contiguous memory block in the memory. The processor core can allocate the contiguous memory block in view of a size of the memory object requested and write MCD meta-data into the MCD table, including a MCD identifier (ID) associated with the contiguous memory block and a MCD border value indicating a size of a memory region of the contiguous memory block.

Systems and methods for developing digital experience applications

In one implementation, systems and methods are provided for developing a computer-implemented digital experience application having a first and a second micro-application. Each micro-application includes a front end interface configured to receive and display information. The first micro-application includes a first event manager configured to detect an application event belonging to a category, and a first state manager configured to detect an application state belonging to the category. The digital experience application further includes a driver application configured to host the first and second micro-applications, an event hub configured to receive the detected application event from the first micro-application, and a state store configured to store the detected application state received from the first micro-application. The second micro-application includes a second event manager configured to receive the detected application event from the event hub, and a second state manager configured to receive the detected application state from the state store.

Managing error-handling flows in memory devices

Systems and methods are disclosed including a memory device and a processing device operatively coupled to the memory device. The processing device can perform operations including detecting a read error with respect to data residing in a block of the memory device, wherein the block is associated with a voltage offset bin, determining an ordered set of error-handling operations to be performed to the data, determining a most recently performed error-handling operation associated with the voltage offset bin; adjusting an order of the set of error-handling operations by positioning the most recently performed error-handling operation within a predetermined position in the order of the set of error-handling operations; and performing one or more error-handling operations of the set of error-handling operations in the adjusted order until data associated to the read error is recovered.

SYSTEMS AND METHODS FOR POWER LOSS PROTECTION OF STORAGE RESOURCES

In accordance with embodiments of the present disclosure, a method for power loss protection of one or more storage resources may include receiving information from each of the one or more storage resources regarding power loss protection capabilities of such storage resource. The method may also include based on the information, repurposing, for each power loss protection capable storage resource, a communications channel between a logic device and such power loss protection capable storage resource for transmission of a respective early power-off warning signal for such power loss protection capable storage resource. The method may further include in response to a power event of a power supply unit for providing electrical energy to the one or more storage resources, asserting for each power loss protection capable storage resource its respective early power-off warning signal.

PROFILING AND DIAGNOSTICS FOR INTERNET OF THINGS

A computing device and method for profiling and diagnostics in an Internet of Things (IoT) system, including matching an observed solution characteristic of the IoT system to an anomaly in an anomaly database.

NODE HEALTH PREDICTION BASED ON FAILURE ISSUES EXPERIENCED PRIOR TO DEPLOYMENT IN A CLOUD COMPUTING SYSTEM

To improve the reliability of nodes that are utilized by a cloud computing provider, information about the entire lifecycle of nodes can be collected and used to predict when nodes are likely to experience failures based at least in part on early lifecycle errors. In one aspect, a plurality of failure issues experienced by a plurality of production nodes in a cloud computing system during a pre-production phase can be identified. A subset of the plurality of failure issues can be selected based at least in part on correlation with service outages for the plurality of production nodes during a production phase. A comparison can be performed between the subset of the plurality of failure issues and a set of failure issues experienced by a pre-production node during the pre-production phase. A risk score for the pre-production node can be calculated based at least in part on the comparison.

Addressing Storage Device Performance

Improving storage device performance including initiating, on a storage device, execution of a rehabilitative action from a set of rehabilitative actions that can be performed on the storage device; determining that the storage device is operating outside of a defined range of expected operating parameters after the rehabilitative action has been executed; and initiating execution of a higher level rehabilitative action responsive to determining that the higher level rehabilitative action exists.

SYSTEM AND METHOD FOR ANOMALY DETECTION AND ROOT CAUSE AUTOMATION USING SHRUNK DYNAMIC CALL GRAPHS
20230004487 · 2023-01-05 ·

A system and method for real-time or near real-time anomaly detection and root cause automation in production environments or in other environments using shrunk dynamic call graphs are provided. The system includes an instrumentation agent that generates shrunk dynamic call graphs and exceptions/errors by injecting monitoring code or probes or call-tags into monitored application, a data agent that forwards collected data to the analysis engine over a network, an analysis engine that performs continuous clustering using machine learning, anomaly, and root cause detection. The system also includes a reporting module to report the anomaly.

METHOD AND SYSTEM FOR IDENTIFYING ROOT CAUSE OF A HARDWARE COMPONENT FAILURE

In general, embodiments relate to a method for identifying hardware component failures, comprising: obtaining system logs that show a transition of device states for a device; using a normalization and filtering module to process and extract relevant data from the system logs and important keywords for the device; creating a device state path for the device from a healthy device state to an unhealthy device state using the extracted relevant data; obtaining the device state path for the device from a storage and a current device state of the device; predicting a next device state of the device based on the current device state using an analysis module; generating a device state chain using the device state path, current device state, and next device state; and identifying root cause of a hardware component failure using the device state chain.