Patent classifications
G06F11/008
Regression-based calibration and scanning of data units
Read operations can be performed to read data stored at a data block. Parameters reflective of a separation between a pair of programming distributions associated with the data block can be determined based on the plurality of read operations. A read request to read the data stored at the data block can be received. In response to receiving the read request, a read operation can be performed to read the data stored at the data block based on the parameters that are reflective of the separation between the pair of programming distributions associated with the data block.
Proactive cluster compute node migration at next checkpoint of cluster upon predicted node failure
While scheduled checkpoints are being taken of a cluster of active compute nodes distributively executing an application in parallel, a likelihood of failure of the active compute nodes is periodically and independently predicted. Responsive to the likelihood of failure of a given active compute node exceeding a threshold, the given active compute node is proactively migrated to a spare compute node of the cluster at a next scheduled checkpoint. Another spare compute node of the cluster can perform prediction and migration. Prediction can be based on both hardware events and software events regarding the active compute nodes.
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR MEMORY FAULT PREDICTION
Embodiments of the present disclosure relate to a method, an electronic device, and a computer program product for memory fault prediction. In a method for memory fault prediction provided by the embodiments of the present disclosure, an accuracy of fault prediction over a past period of time is obtained, each fault prediction is made based on a comparison of a prediction confidence with a confidence threshold, and the accuracy indicates an amount of work to reconstruct and diagnose predicted faulty memories after the fault prediction; the confidence threshold is adjusted in response to the accuracy being less than an accuracy threshold; a detection rate of the fault prediction over the past period of time is obtained; and the confidence threshold is adjusted reversely in response to the detection rate being less than a detection rate threshold. In this way, the reliability of memories in nodes is guaranteed while reducing unnecessary reconstructions and diagnoses.
RAS (RELIABILITY, AVAILABILITY, AND SERVICEABILITY)-BASED MEMORY DOMAINS
Reliability, availability, and serviceability (RAS)-based memory domains can enable applications to store data in memory domains having different degrees of reliability to reduce downtime and data corruption due to memory errors. In one example, memory resources are classified into different RAS-based memory domains based on their expected likelihood of encountering errors. The mapping of memory resources into RAS-based memory domains can be dynamically managed and updated when information indicative of reliability (such as the occurrence of errors or other information) suggests that a memory resource is becoming less reliable. The RAS-based memory domains can be exposed to applications to enable applications to allocate memory in high reliability memory for critical data.
EXTRAPOLATED USAGE DATA
In an example in accordance with the present disclosure, a system is described. The system includes a data collector to collect usage data for the electronic device over a first period of time. The system also includes a model generator. The model generator extrapolates usage data for the electronic device over a second period of time that is longer than the first period of time and predicts a state of the electronic device based on extrapolated usage data for the electronic device over the second period of time.
EDGE SYSTEM HEALTH MONITORING AND AUDITING
Various embodiments herein each include at least one of systems, methods, software, and devices for edge system health monitoring and auditing. One embodiment, in the form of a method includes performing a system audit over a first network of devices deployed within the facility to determine a status of each respective device. This embodiment further includes determining an overall system status for the facility based on results of the system audit including consideration of a status of each of the devices deployed within the facility and storing data representative of the overall system status of the facility. This embodiment also transmits at least a portion of the data representative of the overall system status of the facility over a second network to a facility system status monitoring application which may then present a single indicator of the overall system status or health.
Adaptive fault prediction analysis of computing components
Systems and methods for adaptive fault prediction analysis are described. In one embodiment, the system includes one or more computing components, and one or more hardware controllers. In some embodiments, the storage system includes a storage drive. At least one of the one or more hardware controllers is configured to analyze one or more tolerance limits of a first computing component among the plurality of computing components; calculate a failure metric of the first computing component based at least in part on the analysis of the one or more tolerance limits of the first computing component; analyze sensor data from the first computing component in real time; and update the failure metric based at least in part on the analyzing of the sensor data.
Intelligent software agent to facilitate software development and operations
Some embodiments may facilitate software development and operations for an enterprise. A communication input port may receive information associated with a software continuous integration/deployment pipeline of the enterprise. An intelligent software agent platform, coupled to the communication input port, may listen for a trigger indication from the software continuous integration/deployment pipeline. Responsive to the trigger indication, the intelligent software agent platform may apply system configuration information and rule layer information to extract software log data and apply a machine learning model to the extracted software log data to generate a pipeline health check analysis report. The pipeline health check analysis report may include, for example, an automatically generated prediction associated with future operation of the software continuous integration/deployment pipeline. The intelligent software agent platform may then facilitate transmission of the pipeline health check analysis report via a communication output port and a distributed communication network.
Creating robustness scores for selected portions of a computing infrastructure
A system for generating a robustness score for hardware components, nodes, and clusters of nodes in a computing infrastructure is provided. The system includes a memory and at least one processing device coupled to the memory. The processing device is to obtain first telemetry data associated with a selected portion of a computing infrastructure, and the selected portion includes a first node and a first hardware component. The processing device is further to obtain first metadata associated with the selected portion, input one or more telemetry inputs corresponding to the first telemetry data into a machine learning model, input one or more metadata inputs corresponding to the first metadata into the machine learning model, and generate, from the machine learning model, a first robustness score for the first hardware component representing a health state of the first hardware component.
DATA TAPE MEDIA QUALITY VALIDATION AND ACTION RECOMMENDATION
Techniques for generating action recommendations for a data tape system are disclosed. A data tape system generates action recommendations for a data tape based on library-based metadata messages as well as a measured data quality value of the data tape. The system initiates an operation resulting in the data tape interacting with a media drive. A data tape library controller generates one or more metadata messages based on a result of a requested operation. The metadata message may include information regarding the type of error and a default recommended course of action. The system generates the recommended action for the data tape using a trained machine learning model.