Patent classifications
G06F2209/508
Automated performance tuning using workload profiling in a distributed computing environment
Workload profiling can be used in a distributed computing environment for automatic performance tuning. For example, a computing device can receive a performance profile for a workload in a distributed computing environment. The performance profile can indicate resource usage by the workload in the distributed computing environment. The computing device can determine a performance bottleneck associated with the workload based on the resource usage specified in the performance profile. A tuning profile can be selected to reduce the performance bottleneck associate with the workload. The computing device can output a command to adjust one or more properties of the workload in accordance with the tuning profile to reduce the performance bottleneck associated with the workload.
TASK SCHEDULING METHOD FOR AUTOMATED MACHINE LEARNING
A method for scheduling a task for AutoML (Automated Machine Learning) by a terminal, includes: setting a ratio of 1) a first task requiring a plurality of arithmetic devices and 2) a second task requiring one arithmetic device, in a cluster connected with the terminal; allocating a third task for the AutoML on the basis of the set ratio; receiving a request for allocation of a session from a user; inspecting whether the session is allocable on the basis of the ratio of the second task; and allocating the session to the arithmetic device associated with the second task on the basis of the ratio of the second task when the session is allocable.
DETERMINING OPTIMAL DATA ACCESS FOR DEEP LEARNING APPLICATIONS ON A CLUSTER
A computer-implemented method, a computer program product, and a computer system for determining optimal data access for deep learning applications on a cluster. A server determines candidate cache locations for one or more compute nodes in the cluster. The server fetches a mini-batch of a dataset located at a remote storage service into the candidate cache locations. The server collects information about time periods of completing a job on the one or more nodes, where the job is executed against fetched mini-batch at the candidate cache locations and the mini-batch at the remote storage location. The server selects, from the candidate cache locations and the remote storage location, a cache location. The server fetches the data of the dataset from the remote storage service to the cache location, and the one or more nodes execute the job against fetched data of the dataset at the cache location.
LEADER ELECTION IN A DISTRIBUTED SYSTEM BASED ON NODE WEIGHT AND LEADERSHIP PRIORITY BASED ON NETWORK PERFORMANCE
Example implementations relate to consensus protocols in a stretched network. According to an example, a distributed system includes continuously monitoring network performance and/or network latency among a cluster of a plurality of nodes in a distributed computer system. Leadership priority for each node is set based at least in part on the monitored network performance or network latency. Each node has a vote weight based at least in part on the leadership priority of the node. Each node's vote is biased by the node's vote weight. The node having a number of biased votes higher than a maximum possible number of votes biased by respective vote weights received by any other node in the cluster is selected as a leader node.
PERFORMANCE EVALUATION OF AN APPLICATION BASED ON DETECTING DEGRADATION CAUSED BY OTHER COMPUTING PROCESSES
Performance degradation of an application that is caused by another computing process that shares infrastructure with the application is detected. The application and the other computing device may execute via different virtual machines hosted on the same computing device. To detect the performance degradation that is attributable to the other computing process, certain storage segments of a data storage (e.g., a cache) shared by the virtual machines is written with data. A pattern of read operations are then performed on the segments to determine whether an increase in read access time has occurred. Such a performance degradation is attributable to another computing process. After detecting the degradation, a metric that quantifies the detected degradation attributable to the other computing process is provided to an ML model, which determines the actual performance of the application absent the degradation attributable to the other computing process.
DATA LOCALITY FOR BIG DATA ON KUBERNETES
Controlling data locality in a Kubernetes computing environment by establishing a Kubernetes computing environment including a controller and at least one executor pod for running an application, and receiving a request for a task to be run in the Kubernetes computing environment. The controller dispatches a sidecar to collect resource data from the at least one executor pod for an input to a directed acyclic graph (DAG) feature analyzer. The directed acyclic graph (DAG) feature analyzer identifies from the at least one executor pod a best dynamic resource that are available to execute. The at least one executor pod meeting the best dynamic resource that is available executes the task to be run in the Kubernetes computing.
CLOUD APPLICATION THRESHOLD BASED THROTTLING
Systems and methods are provided for intercepting computing requests and modifying the execution timing thereof based on thresholds and minimum performance criteria and/or adjusting hosted services plans in order to monitor and control costs of hosting software applications on hosted provider computing resources.
PERFORMANCE OVERHEAD OPTIMIZATION IN GPU SCOPING
The present disclosure relates to methods and devices for graphics processing including an apparatus, e.g., a GPU. The apparatus may process a first workload of a plurality of workloads at each of multiple clusters in a GPU pipeline. The apparatus may also increment a plurality of performance counters during the processing of the first workload at each of the multiple clusters. Further, the apparatus may determine, at each of the multiple clusters, whether the first workload is finished processing. The apparatus may also read, upon determining that the first workload is finished processing, a value of each of the multiple clusters for each of the plurality of performance counters. Additionally, the apparatus may transmit an indication of the read value of each of the multiple clusters for all of the plurality of performance counters.
Autoscaling nodes of a stateful application based on role-based autoscaling policies
Example implementations relate to a role-based autoscaling approach for scaling of nodes of a stateful application in a large scale virtual data processing (LSVDP) environment. Information is received regarding a role performed by the nodes of a virtual cluster of an LSVDP environment on which a stateful application is or will be deployed. Role-based autoscaling policies are maintained defining conditions under which the roles are to be scaled. A policy for a first role upon which a second role is dependent specifies a condition for scaling out the first role by a first step and a second step by which the second role is to be scaled out in tandem. When load information for the first role meets the condition, nodes in the virtual cluster that perform the first role are increased by the first step and nodes that perform the second role are increased by the second step.
METHOD AND SYSTEM FOR PERFORMING PREDICTIVE COMPOSITIONS FOR COMPOSED INFORMATION HANDLING SYSTEMS USING TELEMETRY DATA
Techniques described herein relate to a method for managing composed information handling systems. The method includes obtaining, by a system control processor manager, a composition request for a composed information handling system to perform a workflow; in response to obtaining the composition request: identifying a composed system blueprint associated with the workflow; making a first determination that there are first predictive analytics associated with the composed system blueprint; in response to the first determination: identifying a composed infrastructure associated with the composed system blueprint capable of performing the workflow based on telemetry data and the first predictive analytics; instantiating a composed information handling system using the composed infrastructure to service the composition request; and setting up telemetry services for the composed information handling system using an at least one control resource set.