Patent classifications
G06F11/0787
PREDICTIVE BATCH JOB FAILURE DETECTION AND REMEDIATION
Systems, methods, and computer programming products for predicting, preventing and remediating failures of batch jobs being executed and/or queued for processing at future scheduled time. Batch job parameters, messages and system logs are stored in knowledge bases and/or inputted into AI models for analysis. Using predictive analytics and/or machine learning, batch job failures are predicted before the failures occur. Mappings of processes used by each batch job, historical data from previous batch jobs and data identifying the success or failure thereof, builds an archive that can be refined over time through active learning feedback and AI modeling to predictively recommend actions that have historically prevented or remediated failures from occurring. Recommended actions are reported to the system administrator or automatically applied. As job failures occur over time, mappings of the current system log to logs for the unsuccessful batch jobs help the root cause analysis becomes simpler and more automated.
Machine-Learning Based Similarity Engine
An embodiment may involve storage containing incident logs and mappings between incident logs and vector representations generated by a machine learning (ML) model. The embodiment may further involve one or more processors configured to: receive, from a client device, a classification request corresponding to an additional incident log; transmit, to the ML model, additional values as appearing in the additional incident log, wherein reception of the additional values causes the ML model to generate an additional vector representation of the additional incident log; obtain confidence measurements respectively representing similarities between the additional vector representation and each of the vector representations corresponding to the incident logs; determine, based on the confidence measurements, a set of one or more incident logs that are semantically relevant to the additional incident log; and transmit, to the client device, representations of the one or more incident logs and their corresponding confidence measurements.
Classifying errors in a failure log
Techniques are disclosed relating to a method that includes accessing, by a failure management program, a failure log that includes a plurality of character strings corresponding to errors that are associated with execution of one or more batch processes. The failure management program may compare a particular character string of the plurality of character strings to a set of character strings that are associated with respective ones of a plurality of failure categories. This comparing may include determining whether particular keywords that are included in respective ones of the set of character strings are included in the particular character string. In response to the comparing, the failure management program may assign a particular error corresponding to the particular character string to a particular failure category, or may determine a new failure category if the particular character string does not match an existing failure category.
SYSTEM AND METHOD FOR EFFICIENT REAL TIME LOG EXPORT AND VIEW IN A MICRO SERVICES SYSTEM
According to various embodiments, a method, medium, and system for exporting log messages related to a particular job running in a micro services system is described in this disclosure. The method uses a mapping table to narrow down the search scope for finding relevant log files. The mapping table maps a job attribute combination to one or more micro services, and thus will direct the search for relevant log messages only to those log files related to the one or more micro services. The search scope can be further narrowed down using a start time and end time of the job. Once the relevant log files are found, log messages containing an identifier of the job can be extracted from the relevant log files for display or for a user to download. The mapping table can be automatically generated by parsing through historical files in a system non-busy time.
ERROR INFORMATION PROCESSING METHOD AND DEVICE, AND STORAGE MEDIUM
An error information processing method includes, in response to a memory error triggering an interrupt, collecting error information of the memory error that includes a first memory area where the memory error occurs, obtaining a second memory area for writing log information, determining whether the second memory area contains the first memory area, and, in response to determining that the second memory area contains the first memory area, skipping a process of writing the log information into the second memory area.
INFORMATION RECORDING METHOD, APPARATUS, AND DEVICE, AND READABLE STORAGE MEDIUM
An information recording method, apparatus, and device, and a readable storage medium are provided. The method includes: when a server is started, determining a ring buffer in a Double Data Rate (DDR) of a Field-Programmable Gate Array (FPGA) acceleration card based on an OpenPower platform; determining a start address and an end address of the ring buffer and configuring the start address and the end address to the FPGA acceleration card; and during a running process of the server, recording preset debugging information to the ring buffer in real time, so as to perform fault location according to data in the ring buffer after a fault occurs in the server. According to the present application, during a running process of a server, preset debugging information is recorded using a DDR of an FPGA acceleration card; therefore, when a down fault causes a Central Processing Unit (CPU) error of a server, recording of debugging information can also be ensured, thereby facilitating fault location.
TECHNIQUES TO PROVIDE SELF-HEALING DATA PIPELINES IN A CLOUD COMPUTING ENVIRONMENT
Embodiments may generally be directed to systems and techniques to detect failure events in data pipelines, determine one or more remedial actions to perform, and perform the one or more remedial actions.
System, method, and computer program for determining a network situation in a communication network
A system, method, and computer program product are provided for a determining a network situation in a communication network. In use, at least one threshold value of at least one operational parameter of a communication network is obtained, the at least one operational parameter representing at least one operational status of at least one of a computational device or a communication device. Additionally, log data of the communication network is obtained, the log data containing at least one value of the at least one operational parameter reported by at least one network entity of the communication network. The at least one value of the at least one operational parameter of the log data is compared with a corresponding threshold value of the at least one threshold value to form a detection of a network situation. Further, the detection of the network situation is reported if the at least one value of the at least one operational parameter of the log data traverses the corresponding threshold value of the at least one threshold value.
Methods and systems for reducing volumes of log messages sent to a data center
Computer-implemented methods and systems described herein are directed to reducing volumes of log messages sent from edge systems to a data center. The computer-implemented methods performed at each edge system includes collecting a stream of log messages generated by one or more event sources of the edge system. Representative log messages of the stream of log messages are determined. The edge systems may discard non-representative log messages from data storage devices at the edge system. The representative log messages are sent from each of the edge systems to the data center where the representative log messages are received and stored in data storage devices of the data center, thereby reducing the volumes of log messages sent from the edge systems to the data center.
Apparatus configured to perform a repair operation
An apparatus includes a storage area signal generation circuit configured to generate a storage area signal when performing an internal information storage operation and an external information storage operation; and an information storage circuit configured to receive internal failure information, stored in the apparatus, based on the storage area signal and store the received internal failure information as failure information in a set storage capacity, and store external failure information, applied from outside the apparatus, as the failure information in a variable storage capacity.