G06F9/5072

Budgeting open blocks based on power loss protection

A storage system has zones in solid-state storage memory, with power loss protection. The system identifies portions of data for processes that utilize power loss protection. The system determines to activate or deactivate power loss protection for the portions of data for the processes. The system tracks activation and deactivation of power loss protection in zones in the solid-state storage memory, in accordance with the portions of data having power loss protection activated or deactivated.

Automated orchestration of containers by assessing microservices

Performing container scaling and migration for container-based microservices is provided. A first set of features is extracted from each respective microservice of a plurality of different microservices. A number of containers required at a future point in time for each respective microservice of the plurality of different microservices is predicted using a trained forecasting model and the first set of features extracted from each respective microservice. A scaling label and a scaling value are assigned to each respective microservice of the plurality of different microservices based on a predicted change in a current number of containers corresponding to each respective microservice according to the number of containers required at the future point in time for each respective microservice. The current number of containers corresponding to each respective microservice of the plurality of different microservices is adjusted based on the scaling label and the scaling value assigned to each respective microservice.

Determining optimal placements of workloads on multiple platforms as a service in response to a triggering event

A computer-implemented method, a computer program product, and a computer system for placements of workloads in a system of multiple platforms as a service. A computer detects a triggering event for modifying a matrix that pairs respective workloads on respective platforms and includes attributes of running respective workloads on respective platforms. The computer recalculates the attributes in the matrix, in response to the triggering event being detected. The computer determines optimal placements of the respective workloads on the respective platforms, based on information in the matrix. The computer places the respective workloads on the respective platforms, based on the optimal placements.

Persistently available container services through resurrection of user jobs in new compute container instances designated as lead instances

A method makes container services persistently available. A computing device receives a request for implementation of a user job in a container environment, and assigns the user job to a compute runner agent of a plurality of compute runner agents to execute the user job. Each compute runner agent is associated with a compute container instance having a unique compute container identifier corresponding to the user job. A computing device assigns the user job to a balancer task to monitor progress of the user job, and assigns the user job to a storage agent to store artifacts associated with running the user job. A computing device receives a notification from the balancer task describing whether the runner agent is correctly running the user job. In response to the runner agent incorrectly running the user job, a computing device resurrects the user job in a new compute container instance.

MACHINE LEARNING WORKLOAD ORCHESTRATION IN HETEROGENEOUS CLUSTERS

Systems and methods are described herein to orchestrate the execution of an application, such as a machine learning or artificial intelligence application, using distributed compute clusters with heterogeneous compute resources. A discovery subsystem may identify the different compute resources of each compute cluster. The application is divided into a plurality of workloads with each workload associated with resource demands corresponding to the compute resources of one of the compute clusters. Adaptive modeling allows for hyperparameters to be defined for each workload based on the compute resources associated with the compute cluster to which each respective workload is assigned and the associated dataset.

AUTOMATED SERVICES EXCHANGE
20230015524 · 2023-01-19 ·

Methods, apparatus, and processor-readable storage media for providing an automated services exchange are described herein. An example computer-implemented method includes obtaining provider requests from one or more service providers, wherein each of the provider requests comprises an indication of at least one type of service provided by the corresponding service provider and attributes associated with the at least one type of the service; processing the provider requests, wherein the processing for a respective one of the provider requests comprises generating a corresponding set of metrics associated with the at least one type of service and the attributes of the respective provider request; and matching a given one of the provider requests to at least one consumer request based at least in part on: the processing and constraints identified in the at least one consumer request with respect to at least a portion of the attributes of the given provider request.

DETERMINING OPTIMAL DATA ACCESS FOR DEEP LEARNING APPLICATIONS ON A CLUSTER

A computer-implemented method, a computer program product, and a computer system for determining optimal data access for deep learning applications on a cluster. A server determines candidate cache locations for one or more compute nodes in the cluster. The server fetches a mini-batch of a dataset located at a remote storage service into the candidate cache locations. The server collects information about time periods of completing a job on the one or more nodes, where the job is executed against fetched mini-batch at the candidate cache locations and the mini-batch at the remote storage location. The server selects, from the candidate cache locations and the remote storage location, a cache location. The server fetches the data of the dataset from the remote storage service to the cache location, and the one or more nodes execute the job against fetched data of the dataset at the cache location.

MULTI-REGION DEPLOYMENT OF JOBS IN A FEDERATED CLOUD INFRASTRUCTURE
20230014635 · 2023-01-19 ·

A system and method for multi-region deployment of application jobs in a federated cloud computing infrastructure. A job is received for execution in two or more regions of the federated cloud computing infrastructure, each of the two or more regions comprising a collection of servers joined in a raft group for separate, regional execution of the job generating a copy of the job for each of the two or more regions. The job is then deployed to the two or more regions, the workload orchestrator deploying the job according to a deployment plan. A state indication is received from each of the two or more regions, the state indication representing a state of completion of the job by each respective region of the multi-cloud computing infrastructure.

DYNAMIC CROSS-ARCHITECTURE APPLICATION ADAPTION
20230014741 · 2023-01-19 · ·

Embodiments described herein are generally directed to improving performance of high-performance computing (HPC) or artificial intelligence (AI) workloads on cluster computer systems. According to one embodiment, a section of a high-performance computing (HPC) or artificial intelligence (AI) workload executing on a cluster computer system is identified as significant to a figure of merit (FOM) of the workload. An alternate placement among multiple heterogeneous compute resources of a node of the cluster computer system is determined for a portion of the section currently executing on a given compute resource of the multiple heterogeneous compute resources. After predicting an improvement to the FOM based on the alternate placement, the portion is relocated to the alternate placement.

System and Method for a Workload Management and Scheduling Module to Manage Access to a Compute Environment According to Local and Non-Local User Identity Information
20230222003 · 2023-07-13 · ·

A system, method and computer-readable media for managing a compute environment are disclosed. The method includes importing identity information from an identity manager into a module performs workload management and scheduling for a compute environment and, unless a conflict exists, modifying the behavior of the workload management and scheduling module to incorporate the imported identity information such that access to and use of the compute environment occurs according to the imported identity information. The compute environment may be a cluster or a grid wherein multiple compute environments communicate with multiple identity managers.