G06F2209/505

SYSTEMS AND METHODS FOR PERFORMANCE-AWARE CONTROLLER NODE SELECTION IN HIGH AVAILABILITY CONTAINERIZED ENVIRONMENT

Embodiments described herein provide for an election procedure, in a high availability (“HA”) environment, for a backup controller to assume operations performed by a master controller in the event that the master controller becomes unreachable. The master controller may be associated with (e.g., provisioned on) the same set of hardware as one or more worker nodes, and may control operation of the one or more worker nodes. The election procedure may be performed based on performance metrics, location, or efficiency metrics associated with candidate backup controllers (e.g., cloud-based backup controllers), including performance of communications between particular backup controllers and the one or more worker nodes.

Honoring resource scheduler constraints during maintenances

The present disclosure describes a technique for honoring virtual machine placement constraints established on a first host implemented on a virtualized computing environment by receiving a request to migrate one or more virtual machines from the first host to a second host and without violating the virtual machine placement constraints, identifying an architecture of the first host, provisioning a second host with an architecture compatible with that of the first host, adding the second host to the cluster of hosts, and migrating the one or more virtual machines from the first host to the second host.

Control cluster for multi-cluster container environments

The disclosure herein describes managing multiple clusters within a container environment using a control cluster. The control cluster includes a single deployment model that manages deployment of cluster components to a plurality of clusters at the cluster level. Changes or updates made to one cluster are automatically propagated to other clusters in the same environment, reducing system update time across clusters. The control cluster aggregates and/or stores monitoring data for the plurality of clusters creating a centralized data store for metrics data, log data and other systems data. The monitoring data and/or alerts are displayed on a unified dashboard via a user interface. The unified dashboard creates a single representation of clusters and monitor data in a single location providing system health data and unified alerts notifying a user as to issues detected across multiple clusters.

METHOD AND SYSTEM FOR OPTIMIZING PARAMETER CONFIGURATION OF DISTRIBUTED COMPUTING JOB
20230042890 · 2023-02-09 ·

The present disclosure relates to a method and system for optimizing a parameter configuration of a distributed computing job. The method includes: obtaining job programs of different distributed computing jobs, and determining a key parameter configuration set; obtaining a cluster status during execution of the distributed computing job, randomly generating a sample data set based on the key parameter configuration set and the cluster status, and establishing a performance prediction model; correcting the performance prediction model by using a multi-objective genetic algorithm and an optimization module configured with an optimal configuration selection strategy; obtaining a job program of a to-be-optimized distributed computing job and a cluster status during execution of the to-be-optimized distributed computing job, and determining a to-be-optimized key parameter configuration item combination; and inputting, to the performance prediction model, the to-be-optimized key parameter configuration item combination and the cluster status during execution of the to-be-optimized distributed computing job, and outputting a key parameter configuration item combination with a shortest execution time. The present disclosure can rapidly and effectively optimize the key parameter configuration.

SYSTEM AND METHOD OF UTILIZING THERMAL PROFILES ASSOCIATED WITH WORKLOAD EXECUTING ON INFORMATION HANDLING SYSTEMS

In one or more embodiments, one or more systems, one or more methods, and/or one or more processes may determine first thermal attribute values associated with multiple information handling systems (IHSs) with respect to a period of time as the IHSs execute a first workload; determine multiple variance ranges respectively associated with the first thermal attributes; periodically determine second thermal attribute values associated with the IHSs as the IHSs execute a second workload; determine that a thermal attribute value of the second thermal attribute values exceeds a respective variance range of the variance ranges as a first information handling system (IHS) of the IHSs executes the second workload; generate an alert based at least on the thermal attribute value exceeding the respective variance range; and in response to the alert, transfer at least a portion of the second workload from the first IHS to a second IHS of the IHSs.

Connection tracking for container cluster

Some embodiments provide a method for a module executing on a Kubernetes node in a cluster. The method retrieves data regarding ongoing connections processed by a forwarding element executing on the node. The method maps the retrieved data to Kubernetes concepts implemented in the cluster. The method exports the retrieved data along with the Kubernetes concepts to an aggregator that receives data regarding ongoing connections from a plurality of nodes in the cluster.

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND COMPUTER-READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM
20230010895 · 2023-01-12 · ·

An information processing apparatus includes: a memory; and a processor coupled to the memory and configured to: divide a job in units of computing nodes for a plurality of computing nodes; determine execution of scale-out or scale-in on the basis of a load in a case where each of the computing nodes is caused to execute a job obtained by the division; execute, in a case where determining execution of the scale-out, the scale-out according to the division of the job in units of computing nodes; and execute, in a case where determining execution of the scale-in, the scale-in according to the division of the job in units of computing nodes.

CPU CLUSTER SHARED RESOURCE MANAGEMENT

Embodiments include an asymmetric multiprocessing (AMP) system having a first central processing unit (CPU) cluster comprising a first core type, and a second CPU cluster comprising a second core type, where the AMP system can update a thread metric for a first thread running on the first CPU cluster based at least on: a past shared resource overloaded metric of the first CPU cluster, and on-core metrics of the first thread. The on-core metrics of the first thread can indicate that first thread contributes to contention of the same shared resource corresponding to the past shared resource overloaded metric of the first CPU cluster. The AMP system can assign the first thread to a different CPU cluster while other threads of the same thread group remain assigned to the first CPU cluster. The thread metric can include a Matrix Extension (MX) thread flag or a Bus Interface Unit (BIU) thread flag.

Leader election in a distributed system based on node weight and leadership priority based on network performance

Example implementations relate to consensus protocols in a stretched network. According to an example, a distributed system includes continuously monitoring network performance and/or network latency among a cluster of a plurality of nodes in a distributed computer system. Leadership priority for each node is set based at least in part on the monitored network performance or network latency. Each node has a vote weight based at least in part on the leadership priority of the node. Each node's vote is biased by the node's vote weight. The node having a number of biased votes higher than a maximum possible number of votes biased by respective vote weights received by any other node in the cluster is selected as a leader node.

CONFIGURING NODES FOR DISTRIBUTED COMPUTE TASKS
20230236895 · 2023-07-27 ·

Systems and methods are provided for improving compute job distribution using federated computing nodes. This includes identifying a plurality of independently controlled computing nodes which then receive a token such that they can each be identified as being authorized to participate in a federated computing node cluster. Metrics associated with the particular nodes are then received and based on the received metrics compute jobs are assigned to the particular node by assembling a compute job data packet comprising the one or more compute jobs and transmitting the assembled compute job data packet to the particular node. Other features are also described in which assigned compute jobs and/or unrelated compute tasks can be dynamically modified in order to optimize compute job completion based on the received metrics.