Patent classifications
G06F11/0709
FAILURE DIAGNOSIS DEVICE, FAILURE DIAGNOSIS SYSTEM, HOUSEHOLD ELECTRICAL APPLIANCE, SENSOR UNIT, AND FAILURE DIAGNOSIS METHOD
A failure diagnosis device includes a communication unit, a data comparison unit, and an eligibility determination unit. The communication unit acquires first physical quantity data and second physical quantity data of a type different from that of the first physical quantity data that are used for performing failure diagnosis of a home appliance acquired by a sensor unit, and first control information related to the second physical quantity data acquired by the home appliance. The data comparison unit compares the second physical quantity data with the first control information. The eligibility determination unit determines whether or not the first physical quantity data is eligible as data used for the failure diagnosis based on a comparison result obtained by the data comparison unit.
Centralized error telemetry using segment routing header tunneling
A network device receives a data packet including a source address and a destination address. The network device drops the data packet before it reaches the destination address and generates an error message indicating that the data packet has been dropped. The network device encapsulates the error message with a segment routing header comprising a list of segments. The first segment of the list of segments in the segment routing header identifies a remote server, and at least one additional segment is an instruction for handling the error message. The network device sends the encapsulated error message to the remote server based on the first segment of the segment routing header.
AUTOMATED CROSS-SERVICE DIAGNOSTICS FOR LARGE SCALE INFRASTRUCTURE CLOUD SERVICE PROVIDERS
Example aspects include techniques for employing cross-service diagnostics for cloud service providers. These techniques may include dynamically generating a workflow of one or more diagnostic modules based on relationship information between an origin service experiencing an incident and one or more related services that the origin service depends on, and executing the workflow of one or more diagnostic modules to determine a root cause of the incident, each of the one or more diagnostic modules implemented by an individual service of the one or more related services in accordance with a schema. In addition, the techniques may include determining a diagnostic action based on the root cause, and transmitting, based on the diagnostic action, an engagement notification to a responsible entity.
AUTOMATED ISSUE DETECTION AND REMEDIATION ACROSS MULTIPLE USER SYSTEMS USING HEALING-AS-A-SERVICE TECHNIQUES
Methods, apparatus, and processor-readable storage media for automated issue detection and remediation across multiple user systems using healing-as-a-service techniques are provided herein. An example computer-implemented method includes obtaining system configuration data from at least a portion of multiple user systems within a network; obtaining an alert pertaining to an issue attributed to a first of the user systems; training a machine learning model related to user system issue detection using at least a portion of the system configuration data and data related to the alert; determining user system configuration adjustments related to remedying at least a portion of the issue, by processing the data related to the alert using the trained machine learning model; automatically performing the user system configuration adjustments in connection with the first user system; and sharing, using at least one healing-as-a-service component, the trained machine learning model with the user systems in the network.
MULTI-CONTROLLER DECLARATIVE FAULT MANAGEMENT AND COORDINATION FOR MICROSERVICES
Methods, systems, and computer program products for multi-controller declarative fault management and coordination for microservices are provided herein. A computer-implemented method includes processing information pertaining to at least one fault impacting multiple resources within a given system, wherein respective portions of the multiple resources are managed by multiple independent controllers; determining, by each of at least a portion of the multiple independent controllers and based at least in part on the processing of the information, one or more desired resource states and one or more remediation actions; generating, based at least in part on one or more of the determined desired resource states and the determined remediation actions, a sequential ordering of the determined remediation actions to be carried out by the at least a portion of the multiple controllers; and automatically initiating execution of the determined remediation actions in accordance with the generated sequential ordering.
DETECTING METRICS INDICATIVE OF OPERATIONAL CHARACTERISTICS OF A NETWORK AND IDENTIFYING AND CONTROLLING BASED ON DETECTED ANOMALIES
A machine learning anomaly detection system receives a time series of metrics indicative of operational characteristics of a computing system architecture. A distribution of the metrics values is identified and a volume of metrics detected during a current evaluation period is identified. A dynamic anomaly detection threshold is generated, based upon the distribution and the volume of detected metrics. Metric values from the current evaluation period are compared to the dynamic anomaly detection threshold to determine whether the metric values in the current evaluation period are anomalous. If so, an action signal is generated.
METHODS AND SYSTEMS THAT AUTOMATICALLY PREDICT DISTRIBUTED-COMPUTER-SYSTEM PERFORMANCE DEGRADATION USING AUTOMATICALLY TRAINED MACHINE-LEARNING COMPONENTS
The current document is directed to methods and systems that automatically generate training data for machine-learning-based components used by a metric-data processing-and-analysis component of a distributed computer system, a subsystem within a distributed computer system, or a standalone metric-data processing-and-analysis system. The training data sets are labeled using categorical KPI values. The machine-learning-based components are applied to metric data both for predicting anomalous operational behaviors and problems within the distributed computer system and for determination of potential causes of anomalous operational behaviors and problems within the distributed computer system. Training of machine-learning-based components is carried out concurrently and asynchronously with respect to other metric-data collection, aggregation, processing, storage, and analysis tasks.
AUTOMATED METHODS AND SYSTEMS FOR IDENTIFYING PROBLEMS IN DATA CENTER OBJECTS
Automated methods and systems for identifying problems associated with objects of a data center are described. Automated methods and systems are performed by an operations management server. For each object, the server determines a baseline distribution from historical events that are associated with a normal operational state of an object. The server determines a runtime distribution of runtime events that are associated with the object and detected in a runtime window of the object. The management server monitors runtime performance of the object while the object is running in the datacenter. When a performance problem is detected, the management server determines a root cause of a performance problem based on the baseline distribution and the runtime distribution and displays an alert in a graphical user interface of a display.
Datacenter IoT-triggered preemptive measures using machine learning
One example method includes performing a machine learning process that involves performing an assessment of a state of a computing system, and the assessment includes analyzing information generated by an IoT edge sensor in response to a sensed physical condition in the computing system, and identifying an entity in the computing system potentially impacted by an event associated with the physical condition. The example method further includes identifying a preemptive recovery action and associating the preemptive recovery action with an entity, and the preemptive recovery action, when performed, reduces or eliminates an impact of the event on the entity, determining a cost associated with implementation of the preemptive recovery action, evaluating the cost associated with the preemptive recovery actions and identifying the preemptive recovery action with the lowest associated cost, implementing the preemptive recovery action with the lowest associated cost, and repeating part of the machine learning process.
Database recovery time objective optimization with synthetic snapshots
Methods and systems for reducing the amount of time to restore a database or other application by dynamically generating and storing synthetic snapshots are described. When backing up a database, an integrated data management and storage system may acquire snapshots of the database at a snapshot frequency and acquire database transaction logs at a frequency that is greater than the snapshot frequency. In response to detecting that the database is unable to provide a database snapshot, the integrated data management and storage system may generate a synthetic snapshot of the database by instantiating a compatible version of the database locally, acquiring a previously stored snapshot of the database, applying data changes from one or more database transaction logs to the previously stored snapshot to generate the synthetic snapshot, and storing the synthetic snapshot of the database within the integrated data management and storage system.