G06F11/0766

Identifying patterns in event logs to predict and prevent cloud service outages

In non-limiting examples of the present disclosure, systems, methods and devices for predicting hardware failure events are presented. A time series comprising event log data for a plurality of events and a plurality of event types that occurred on a server computing device may be received. The time series may be filtered for a subset of the plurality of event types. The filtered time series may be processed with a recurrent neural network that has been trained to predict hardware failure events from time series data comprising the subset of the plurality of event types. A prediction may be made that a hardware failure event will occur on the server computing device within a threshold duration of time. A prophylactic follow-up action corresponding to the predicted hardware failure event may be performed.

ARCHITECTURE, METHOD AND SYSTEM FOR LIVE TESTING IN A PRODUCTION ENVIRONMENT

There is provided an architecture, methods and a system for live testing in a production environment. The architecture comprises a platform independent Test Planner for generating a test package in response to receiving an event. Generating a test package comprises selecting test goals, generating a test suite and generating a test plan. The architecture comprises a platform dependent Test Execution Framework (TEF) for executing the test package in an environment serving live traffic. Executing the test package comprises initializing the test plan, starting the test plan and reporting the successful completion of the test plan, reporting the suspension of the test plan and waiting for further instructions, or reporting a failure of the test plan and executing a corresponding contingency plan.

Data processing device, monitoring method, and program

A data processing apparatus includes a first processing unit that executes real-time processing with respect to data, a second processing unit that executes batch processing with respect to data that is output from the first processing unit as a result of processing by the first processing unit, and a monitor that monitors a status of the processing by the first processing unit and a status of processing by the second processing unit. The first processing unit includes a plurality of subprocessing units and buffers, and the second processing unit also includes a plurality of subprocessing units and buffers. The second processing unit includes a storage. The monitor includes a first monitor that monitors, for each of the buffers included in the first processing unit, an amount of the data stored in the corresponding buffer and a second monitor that monitors a total amount of the data stored in the buffers included in the second processing unit and the data stored in the storage.

Media error reporting improvements for storage drives

A method of managing errors in a plurality of storage drives includes receiving, at a memory controller coupled to at least one storage medium in an SSD, a read command from a host interface. The method also includes retrieving, from the storage medium, read data corresponding to a plurality of data chunks to be retrieved in response to the read command, and determining that at least one data chunk of the plurality of data chunks is unable to be read, the at least one data chunk corresponding to a failed data chunk. And in response to determining the failed data chunk, sending to the host interface the read data including the failed data chunk or excluding the failed data chunk. And in response to the read command sending to the host interface status information about all data chunks.

Persistent health monitoring for volatile memory systems

Methods, systems, and devices for persistent health monitoring for volatile memory devices are described. A memory device may determine that an operating condition associated with an array of memory cells on the device, such as a temperature, current, voltage, or other metric of health status is outside of a range associated with a risk of device degradation. The memory device may monitor a duration over which the operating condition is outside of the range, and may determine whether the duration satisfies a threshold. In some cases, the memory device may store an indication of when (e.g., each time) the duration satisfied the threshold. The memory device may store the one or more indications in one or more non-volatile storage elements, such as fuses, which may enable the memory device to maintain a persistent indication of a cumulative duration over which the memory device is operated with operating conditions outside of the range.

AUTOMATICALLY CONTROLLING RESOURCE PARTITIONS IN ADVANCE OF PREDICTED BOTTLENECKS FOR LOG STREAMING MESSAGES
20230083701 · 2023-03-16 ·

Embodiments of the invention include a computer-implemented method for allocating computing resources. The computer-implemented method includes generating, using a processor, tracing data that results from data traffic processed through multiple data paths by the processor. The processor is used to analyze the tracing data to identify a predicted bottleneck path among the multiple data paths, wherein the predicted bottleneck path include a data path on which a data bottleneck is predicted to occur. The computer resources are allocated to the predicted bottleneck path before the predicted data bottleneck occurs.

FAILURE HANDLING APPARATUS AND SYSTEM, RULE LIST GENERATION METHOD, AND NON-TRANSITORY COMPUTER-READABLE MEDIUM
20230070080 · 2023-03-09 · ·

A failure handling apparatus (100) is provided with: an acceptance unit (15) that accepts specification of condition information in an execution condition; a code generation unit (16) that generates a program code of a conditional expression based on the specified condition information; a template generation unit (17) that generates an input template of a plurality of failure handling rules, including an input field of a determination criterion value for determining extracted information, based on the program code and an input field of a handling content; and a list generation unit (18) that sets input values, for the input template, in the input fields and stores the input values in a storage unit as a list.

STORAGE DEVICE AND OPERATING METHOD THEREOF
20230071289 · 2023-03-09 ·

A storage device and an operating method thereof are provided. The storage device includes a memory configured to store parameter data used as an input in a neural network. The storage device also includes a storage controller configured to receive a request signal from a host. The storage controller is also configured to encode, based on the parameter data, log data in the neural network, the log data indicating contexts of the plurality of components, and transmit the encoded log data to the host.

METHOD AND SYSTEM FOR DIFFERENTIATING BETWEEN APPLICATION AND INFRASTRUCTURE ISSUES
20230130886 · 2023-04-27 ·

Example aspects include techniques for detecting, for one or more instances of a dependency call from a service to a dependency in the cloud computing platform, the one or more instances of the dependency call having a common set of dependency call inputs, that a value of a dependency call performance metric of the dependency call is outside of a threshold range, providing, to a machine learning (ML) model and based on detecting that the value is outside of the threshold range, the common set of dependency call inputs for the one or more instances of the dependency call, obtaining, from the ML model and based on the common set of dependency call inputs, an expected value for the dependency call performance metric, and determining, based on comparing the value to the expected value, the entity causing the value to be outside of the threshold range.

METHOD AND DEVICE FOR AUTOMATICALLY DETECTING POTENTIAL FAILURES IN MOBILE APPLICATIONS
20230121281 · 2023-04-20 ·

A method and a device for automatically detecting potential failures in mobile applications implemented on an operating system for mobile devices, a mobile application being executable on the operating system installed on a hosting device by executing code instructions stored in an associated executable file. Provided an executable file associated to a mobile application, the device implements a module for decompiling the executable file to obtain at least one descriptive file of the mobile application containing descriptive code formatted with a markup language, a module for providing a plurality of predetermined string patterns related to potential failures, and a module for searching for the presence of at least one of the string patterns in the at least one descriptive file, and in case of presence, outputting an indication of presence of a potential failure associated to the detected string pattern.