G06F11/0709

TECHNIQUES FOR IMPLEMENTING ROLLBACK OF INFRASTRUCTURE CHANGES IN A CLOUD INFRASTRUCTURE ORCHESTRATION SERVICE

Techniques for implementing rollback of infrastructure changes in an infrastructure orchestration service are described. In certain examples, an infrastructure orchestration service is disclosed that manages both provisioning and deploying of infrastructure assets within a cloud environment. The service receives a plan comprising a set of instructions associated with a set of infrastructure assets of an execution target and identifies a first state of the set of infrastructure assets. The service executes the set of instructions in the plan to achieve a second state for the set of infrastructure assets. Based in part on the executing, the service receives a trigger for rolling back the plan to restore the set of infrastructure assets in the plan to the first state and executes a rollback plan for the plan. The service then transmits a result associated with the execution of the rollback plan.

SYSTEMS AND METHODS FOR AUTOMATICALLY APPLYING CONFIGURATION CHANGES TO COMPUTING CLUSTERS
20230236911 · 2023-07-27 ·

A system includes a memory and a processor. The processor is configured to access one or more configuration logs generated by a computing cluster. The processor is further configured to determine, by analyzing the one or more configuration logs, a particular service running on the computing cluster that has generated a plurality of errors within the plurality of log messages. The processor is further configured to determine whether the particular error has previously occurred. The processor is further configured to, in response to determining that the particular error has previously occurred, generate and send one or more commands to the computing cluster. The one or more commands are operable to change a current configuration value for the particular service running on the computing cluster to a new configuration value. The new configuration value is based on a historical value stored in the database of historical configuration errors.

Insider attack resistant system and method for cloud services integrity checking

An insider attack resistant system for providing cloud services integrity checking is disclosed. In particular, the system utilizes an automated integrity checking script and virtual machines to check the integrity of a service. The system may utilize the integrity checking script and virtual machines to execute a set of operations associated with the service so as to check the integrity of the service. When executing the set of operations, the system may only have access to the minimum level of access to peripherals that is required for each operation in the set of operations to be executed. After each operation is executed, the system may log each result for each operation, and analyze each result to determine if a failure exists for any of the operations. If a failure exists, the system may determine that a change in an expected system behavior associated with the service has occurred.

Method of detecting faults in a fault tolerant distributed computing network system

The present disclosure provides methods for detecting faults in a distributed computing network system. The method includes receiving, from a management services, authority information identifying peer computing devices of a distributed computing network system. For each respective peer computing device, a first message comprising a first instance of a dataset and a second message comprising a second instance of the dataset are received. Where the first peer computing device and the second peer computing device have authority over the data set, it is determined whether the first instance of the dataset matches the second instance of the dataset. Where the first instance of the dataset does not match the second instance of the dataset, a fault message is sent to the management services indicating that a fault has been detected at the first peer computing device.

TIME SERIES CLUSTERING TO TROUBLESHOOT DEVICE PROBLEMS BASED ON MISSED AND DELAYED DATA
20230236915 · 2023-07-27 ·

The described technology is generally directed towards processing time series (e.g., device telemetry) data, including identifying missing data (gaps in the time series data), and delayed data. The time series data are converted to ternary data, e.g., zero if timely, one if delayed or two if missing, and counts are obtained for each. If the missing data and/or delayed counts are significant, e.g., exceed a threshold percentage of the total data, the time series data indicates a problem that can be narrowed down to a more specific cause. For example, the time series data can be filtered by customer products/offers and customer locations, and if a filtered dataset's ternary data are similar to the problematic data, as determined via unsupervised clustering as similarity data (occurring at a similar time), the potential problem or problems can be narrowed to a potential cause based on that filtered dataset's similarity.

MACHINE LEARNING ASSISTED REMEDIATION OF NETWORKED COMPUTING FAILURE PATTERNS

Disclosed are techniques for automatically determining whether a new disruption of service alert corresponds to a pattern of failures and automatically applying remedies based on the determined pattern. Datasets of historical disruption of service alerts on networked computing clusters are used to train a machine learning algorithm to identify patterns between alerts. When a new disruption of service alert is received, historical disruption of service alerts for the originating networked computing cluster are also received and provided as input to the machine learning model. The machine learning model then automatically determines whether the new alert fits a pattern with the historical alerts from the same cluster, and when a fit is found, remedial actions are sourced from the alerts that fit the pattern to be applied automatically to the originating networked computing cluster.

Transitive tensor analysis for detection of network activities
11567816 · 2023-01-31 · ·

Described is a system for detection of network activities using transitive tensor analysis. The system divides a tensor into multiple subtensors, where the tensor represents communications on a communications network of streaming network data. Each subtensor is decomposed, separately and independently, into subtensor mode factors. Using transitive mode factor matching, orderings of the subtensor mode factors are determined. A set of subtensor factor coefficients is determined for the subtensor mode factors, and the subtensor factor coefficients are used to determine the relative weighting of the subtensor mode factors, and activity patterns represented by the subtensor mode factors are detected. Based on the detection, an alert of an anomaly is generated, indicating a in the communications network and a time of occurrence.

Intelligent network operation platform for network fault mitigation

Embodiments of the present disclosure provide systems, methods, and computer-readable storage media that leverage artificial intelligence and machine learning to identify, diagnose, and mitigate occurrences of network faults or incidents within a network. Historical network incidents may be used to generate a model that may be used to evaluate real-time occurring network incidents, such as to identify a cause of the network incident. Clustering algorithms may be used to identify portions of the model that share similarities with a network incident and then actions taken to resolve similar network incidents in the past may be identified and proposed as candidate actions that may be executed to resolve the cause of the network incident. Execution of the candidate actions may be performed under control of a user or automatically based on execution criteria and the configuration of the fault mitigation system.

Troubleshooting for a distributed storage system by cluster wide correlation analysis
11714701 · 2023-08-01 · ·

A troubleshooting technique provides faster and more efficient troubleshooting of issues in a distributed system, such as a distributed storage system provided by a virtualized computing environment. The distributed system includes a plurality of hosts arranged in a cluster. The troubleshooting technique uses cluster-wide correlation analysis to identify potential causes of a particular issue in the distributed system, and executes workflows to remedy the particular issue.

Failure Prediction In Distributed Environments
20230023646 · 2023-01-26 ·

Embodiments of the invention are directed to systems, method, and devices for detecting failures in distributed systems. A failure detection platform may identify anomalies in time series data, the time series data corresponding to historical network messages. The anomalies can be labeled and used to train a first predictive model. At least one other model may be trained using the time series data, the anomaly labels and a supervised machine-learning algorithm. A third model can be trained to identify a system failure based at least in part on the outputs provided by the first and the second model. The third model, once trained, can be utilized to predict a future system failure.