G06F11/3017

Systems and methods for orchestrating seamless, distributed, and stateful high performance computing

An orchestration system may provide distributed and seamless stateful high performance computing for performance critical workflows and data across geographically distributed compute nodes. The system may receive a task with different jobs that operate on a particular dataset, may determine a set of policies that define execution priorities for the jobs, and may determine a current state of compute nodes that are distributed across different compute sites. The system may distribute the jobs across a selected set of the compute nodes in response to the current state of the set of compute nodes satisfying more of the execution priorities than the current state of other compute nodes. The system may produce task output based on modifications made to the particular database as each compute node of the set of compute nodes executes a different job of the plurality of jobs.

SYSTEM AND METHOD FOR ENHANCING THE EFFICIENCY OF MAINFRAME OPERATIONS
20210141704 · 2021-05-13 ·

A method includes monitoring a job being executed at the source mainframe. A job comprises multiple tasks. A method includes monitoring a particular task of the multiple tasks being executed at a source mainframe and determining an application required to execute the particular task. In response to determining that the particular task requires an application to execute, determining a target mainframe where the application is installed. A method further includes validating the environment of the target mainframe to confirm that the particular task can be executed using the target mainframe, and upon validating the target mainframe, redirecting the particular task to the target mainframe for execution. A method also includes monitoring the particular task being executed at the target mainframe and returning the results of the particular task from the target mainframe to the source mainframe.

MONITORING LONG RUNNING WORKFLOWS FOR ROBOTIC PROCESS AUTOMATION
20210133078 · 2021-05-06 · ·

Systems and methods for monitoring a robotic process automation (RPA) system are provided. Job execution data for one or more jobs in the RPA system is determined based on logs of the RPA system. The job execution data for the one or more jobs in the RPA system is caused to be displayed in substantially real time.

SYSTEMS AND METHODS FOR SIMULATING WORST-CASE CONTENTION TO DETERMINE WORST-CASE EXECUTION TIME OF APPLICATIONS EXECUTED ON A PROCESSOR

Techniques for determining worst-case execution time for at least one application under test are disclosed using memory thrashing. Memory thrashing simulates shared resource interference. Memory that is thrashed includes mapped memory, and optionally shared cache memory.

DIAGNOSING SLOW TASKS IN DISTRIBUTED COMPUTING
20210049047 · 2021-02-18 · ·

Machine learning is utilized to analyze respective execution times of a plurality of tasks in a job performed in a distributed computing system to determine that a subset of the plurality of tasks are straggler tasks in the job, where the distributed computing system includes a plurality of computing devices. A supervised machine-learning algorithm is performed using a set of inputs including performance attributes of the plurality of tasks, where the supervised machine learning algorithm uses labels generated from determination of the set of straggler tasks, the performance attributes include respective attributes of the plurality of tasks observed during performance of the job, and applying the supervised learning algorithm results in identification of a set of rules defining conditions, based on the performance attributes of the plurality of tasks, indicative of which tasks will be straggler tasks in a job. Rule data is generated to describe the set of rules.

Automated performance debugging of production applications
10915425 · 2021-02-09 · ·

Performance anomalies in production applications can be analyzed to determine the dynamic behavior over time of hosting processes on the same or different computers. Problematic call sites (call sites that are performance bottlenecks or that are causing hangs) can be identified. Instead of relying on static code analysis and development phase load testing to identify a performance bottleneck or application hang, a lightweight sampling strategy collects predicates representing key performance data in production scenarios. Performance predicates provide information about the subject (e.g., what the performance issue is, what caused the performance issue, etc.). The data can be fed into a model based on a decision tree to identify critical threads running the problematic call sites. The results along with the key performance data can be used to build a call graph prefix binary tree for analyzing call stack patterns. Data collection, analysis and visualizations of result can be performed.

Dynamic thread mapping

In one example, a central processing unit (CPU) with dynamic thread mapping includes a set of multiple cores each with a set of multiple threads. A set of registers for each of the multiple threads monitors for in-flight memory requests the number of loads from and stores to at least a first memory interface and a second memory interface by each respective thread. The second memory interface has a greater latency than the first memory interface. The CPU further has logic to map and migrate each thread to respective CPU cores where the number of cores accessing only one of the at least first and second memory interfaces is maximized.

Automated SLA non-compliance detection and prevention system for batch jobs

A method and system is disclosed herein for detecting one or more violations in managing service level agreements (SLA) in an information technology service management (ITSM). A batch job system is characterized by the set of jobs and dependencies between jobs. Each job is in turn characterized by run-time, from-time and SLA definitions. SLAs can be of two kinds Start-time and End-time. Start-time SLA requires that the job execution starts before the specified time while End-time SLA necessitates that the job finishes its execution before the specified time. To optimize processing time required for executing one or more batch jobs the disclosure identifies SLA violations and solves them to produce a set of actionable levers.

MULTIPROCESSOR DEVICE
20210089310 · 2021-03-25 ·

A multiprocessor device includes a first processor and a second processor, wherein the multiprocessor device is configured to cause, when debugging of the first processor is to be performed by using the second processor, the second processor to refer to a value of a program counter of the first processor and fetch an instruction from a memory by using the value referred from the program counter.

INDEXING AND REPLAYING TIME-TRAVEL TRACES USING DIFFGRAMS
20210081299 · 2021-03-18 ·

Utilizing diffgrams for trace indexing and replay. A subset of instructions of a trace, beginning with a first instruction and ending with a second instruction, are replayed to obtain state of one or more named resources. Based on replaying the subset of instructions, a diffgram is generated, which is structured such that addition of the diffgram at the first instruction brings the one or more named resources to the second state, and subtraction of the diffgram at the second instruction brings the one or more named resource to the first state. A pat of reaching a target instruction, the diffgram is later added at the first instruction to restore the second state at the second instruction, or subtracted at the second instruction to restore the first state of the first instruction.