DEVICE AND METHOD FOR SCHEDULING A SET OF JOBS FOR A PLURALITY OF MACHINES

Abstract

A method for scheduling a set of jobs for a plurality of machines. Each job is defined by at least one feature which characterizes a processing time of the job. If any of the machines is free, a job from of the set of jobs is selected to be carrying out by said machine and scheduled for said machine. The job is selected as follows: a Graph Neural Network receives as input the set of jobs and a current state of at least the machine which is free, the Graph Neural Network outputs a reward for the set of jobs if launched on the machines, which states are inputted into the Graph Neuronal Network, and the job for the free machine is selected depending on the Graph Neural Network output.

Claims

1. A method for scheduling a set of jobs on a plurality of machines, each job being defined by at least one feature which characterizes a processing time of the job, the method comprising the following steps: when any of the machines is free: selecting a job from of the set of jobs to be carried out by the machine, and scheduling the selected job on the free machine, wherein the job is selected by: receiving as input, by a Graph Neural Network, the set of jobs and a current state of each machine of at least one of the machines, the at least one of the machines including the free machine, outputting, by the Graph Neural Network, rewards for each job launched on the at least one of the machines, which states are inputted to the Graph Neuronal Network, and selecting the job for the free machine depending on the Graph Neural Network outputted rewards.

2. The method according to claim 1, wherein the job is defined by a further feature that characterizes a class of a plurality of classes, wherein each of the classes characterize a corresponding setup of the machines required to carry out a corresponding job, wherein the current state of each machine also characterize its current setup, and wherein the Graph Neural Network receives a further input, which are time penalties, when the machine has to be reconfigured by a change of the setup of any machines due to two immediately successively jobs of different classes for the same machine.

3. The method according to claim 1, wherein additional jobs arrive at certain points of time and the additional jobs are added to the set of jobs.

4. The method according to claim 1, wherein a parametrization e of the Graph Neural Network is optimized by reinforcement learning.

5. The method according to claim 1, wherein a parametrization e of the Graph Neural Network is optimized by deep Q-Network learning.

6. The method according to claim 1, wherein for selecting a subsequent job, a Monte-Carlo Tree Search is applied, wherein the Monte-Carlo Tree Search builds iteratively a search tree starting from the current state of each of the machines and the set of jobs, wherein for expending the search tree, the outputted reward of the Graph Neural Network is used as a search heuristic, and depending on the search tree the subsequent job is selected.

7. The method according to claim 6, wherein for expending the search tree, actions are selected based on an Upper Confidence-Bound.

8. The method according to claim 6, wherein the search tree is pruned by considering only a predetermined number of jobs with highest Q-values within at least one rollout.

9. The method according to claim 1, wherein the machines are production machines.

10. The method according to claim 1, wherein the machines are logistic robots are used to supply production machines with production material required for a current job of a respective machine.

11. The method according to claim 1, wherein the state of each of the machines is determined depending on received sensor signals, wherein the state of each of the machines also includes a current load status and/or a parameter characterizing maintenance needs of the respective machine.

12. A non-transitory machine-readable storage medium on which is stored a computer program for scheduling a set of jobs on a plurality of machines, each job being defined by at least one feature which characterizes a processing time of the job, the computer program, when executed by a processor, causing the processor to perform the following steps: when any of the machines is free: selecting a job from of the set of jobs to be carried out by the machine, and scheduling the selected job on the free machine, wherein the job is selected by: receiving as input, by a Graph Neural Network, the set of jobs and a current state of each machine of at least one of the machines, the at least one of the machines including the free machine, outputting, by the Graph Neural Network, rewards for each job launched on the at least one of the machines, which states are inputted to the Graph Neuronal Network, and selecting the job for the free machine depending on the Graph Neural Network outputted rewards.

13. A system configured to schedule a set of jobs on a plurality of machines, each job being defined by at least one feature which characterizes a processing time of the job, the system configured to: when any of the machines is free: select a job from of the set of jobs to be carried out by the machine, and schedule the selected job on the free machine, wherein to select the job, the system is configured to: receive as input, by a Graph Neural Network, the set of jobs and a current state of each machine of at least one of the machines, the at least one of the machines including the free machine, output, by the Graph Neural Network, rewards for each job launched on the at least one of the machines, which states are inputted to the Graph Neuronal Network, and select the job for the free machine depending on the Graph Neural Network outputted rewards.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0032] FIG. 1 shows a flow-chart diagram of a method carried out by a job scheduler in accordance with an example embodiment of the present invention.

[0033] FIG. 2 shows a training system in accordance with an example embodiment of the present invention.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

[0034] In the following, a job scheduler in accordance with an example embodiment of the present invention is disclosed, which is built upon both reinforcement learning and global search, more specifically, by combining graph convolutional DQN with MCTS, while allowing one to benefit from the other. Said job scheduler is able to compute a solution to a scheduling problem represented as a graph directly, without the aid of external solvers or simulation environments.

[0035] At first, the setting of a problem which can be solved by the job scheduler shall be defined.

[0036] In general, the job scheduler shall be able to solve an online scheduling, a particular online combinatorial problem. Note that said job scheduler can be readily extended to other online combinatorial problems such as the travelling salesperson problem and vehicle routing problems, e.g. where the destinations are added online. The online combinatorial optimization problems, including online logistics problems (e.g., planning the route of a courier through a city, where additional points to pick up/drop off items are added while the courier is on his way.

[0037] In the following, the embodiment considers scheduling problems where a number of waiting jobs need to be processed by a set of similar machines in particular scheduling in production factories (determining the processing order of goods at different processing steps), as well as scheduling in other industrial areas, e.g., packaging. The machines can be any kind of machine, e.g. a production machine, a transportation machine and so on.

[0038] The jobs can belong to different classes. Whenever two jobs of different classes are processed consecutively on the same machine, the setup of the machine needs to be changed, which incurs a time penalty due to the reconfiguration of the machine. Two consecutive jobs of the same class do not incur such a penalty. In addition to their classes, jobs are characterized by their processing times, preferably by their weights (i.e., importance), and preferably by the time when they arrived at the queue.

[0039] The total weighted completion time (TWCT) can be defined as:

[00001] $TWCT = \underset{j \in J}{.Math.} wj * Cj$

[0040] where J is the set of jobs, w.sub.j is the weight of job j, and C.sub.j is its completion time, that is, the difference between the time when j has finished processing and when it arrived in the queue.

[0041] The job scheduler is configured to minimize the TWCT.

[0042] A state can be characterized by the set of waiting jobs and their characteristics as defined above (e.g. completion time, weight of the jobs, . . . ), and the current state of the machines. Concretely, each machine is represented by a class that they are currently set-up for, whether they are currently occupied by some job, and if so, the remaining processing time. The matrix of setup change times is also observable. Note that a state is treated as a decision point if there are at least one free machine and one waiting job in that state.

[0043] Additionally, received sensor signals, e.g. from built-in monitoring sensors of the factory processing machines (e.g., current load, maintenance needs) and sensors that monitor the position and state of jobs within the factory (e.g., silicon wafers), can be used to characterized the state. Optional, a status of the current situation of a material supply chain can be used to characterize the state.

[0044] Actions can be represented as tuples (j;m) and correspond to starting job j on machine m. When the action is executed, an event, representing an update to the state, is registered for the time when the job terminates. Alternatively, a no-operation (noop) action can be executed that does not start a job on any of the free machines. In both cases, the simulation is iteratively fast-forwarded to the next event until the next decision point is reached.

[0045] There can be static and dynamic settings. In static settings, all relevant jobs are visible in the initial state, e.g. at the beginning of the method o before starting the job scheduler. In the dynamic setting, additional jobs can arrive at certain points in time (i.e., in arrival intervals), where the number of arriving jobs at the start of each interval can be estimated by a Poisson distribution for all classes, each with its own rate. The frequency of job classes can follow Zipf's law. The episode terminates when all the jobs have arrived and finished processing.

[0046] The immediate rewards can be given as r.sub.t=TWCT(s.sub.t+1)−TWCT(s.sub.t), replacing the completion time with the current time for uncompleted jobs. Thus, the total reward of an episode is:

[00002] ${.Math.}_{t = 0}^{T - 1} r_{t} = {.Math.}_{t = 0}^{T - 1} TWC T (s_{t + 1}) - TWC T (s_{t}) p = TWC T (s_{T}) - TWC T (s_{0}) = TWC T (s_{T})$

[0047] Note that the current time is 0 in s.sub.0 so TWCT(s.sub.0)=0, and that all jobs are completed in s.sub.T, so TWCT(s.sub.T) is just the TWCT objective value.

[0048] Explanation for the choice for the deep Q-network (DGN): Q Graph Neural Network.

[0049] The above described problem of online scheduling can be seen as a Markov Decision Process (MDP), where S and A are the state and action spaces respectively, and P(s.sub.0ls,a) denotes the probability of transitioning from state s∈S to state s′∈S when applying action a∈A. To solve the MDP formulation, a reinforcement learning algorithm is applied.

[0050] Preferably, an off-policy DQN algorithm can be used that learns an approximation to Q(s,a) function, which represents the cumulative discounted reward for applying a at s and then following some policy of actions which could be greedy or explorative. For more information about the Q function, the reader is referred to the document: Bertsekas, D. “Dynamic programming and optimal control”, Athena scientific Belmont, Mass., 2005.

[0051] Due to both the large state space S and the Q function complexity, a neural network parametrized by θ is a common choice for Q function approximation.

[0052] This approximation can be trained to minimize the squared temporal difference denoted by:

[00003] $TD ≅ r (s, a) + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ) - Q (s, a; θ)$

[0053] with the discount factor y and an immediate reward r(s, a).

[0054] A GNN is utilized as Q function approximation in the DQN algorithm implementation. GNN is a common name for a framework, where a neural network or a set of neural networks is repetitively applied to the graph representation of the input data. GNNs have several properties that are important in a context of combinatorial optimization: They are permutation equivariant and are able to process observations of varying sizes (i.e., different number of nodes and edges) as long as features dimensionality remains constant. In addition, GNNs are able to leverage the relational bias in observations via multi-step propagation of information within the graph representation, i.e., message-passing.

[0055] For the scheduling problem, each observation is represented with a bipartite, undirected graph represented by the tuple G=(w;U;V;E).

[0056] In the proposed encoding U and V correspond to two disjoint independent sets of nodes, where u.sub.i is the feature vector of machine i and v.sub.j the feature vector of job j. The machine and job feature vectors consist of the characteristics described above. The actual feature vector of every node u.sub.i∈U and v.sub.j∈V is the concatenation of machine and job features where the irrelevant dimensions can be masked by zeros. The sets of machine and job nodes are interconnected by the set of edges E=(e.sub.k,m.sub.k,j.sub.k), where e.sub.k is a edge feature vector, m.sub.k is an index of a machine node, and j.sub.k an index of a job node connected by the edge k. Edge features encode the setup time required to process the job j.sub.k on corresponding machine m.sub.k. Additionally, the graph representation contains a parameter w that is the global information vector. This parameter w is referred to as global feature and is initialized to 0 for every observation and is used for the information propagation during the message-passing.

[0057] Preferably, the encoding message-passing decoding GNN architecture is used as described by Battaglia, P. W et al. “Relational inductive biases, deep learning, and graph networks”, arXiv preprint arXiv:1806.01261, 2018.

[0058] The output of the GNN is a graph of the same structure but with the updated nodes u.sub.i and v.sub.j, edge e.sub.k and global w feature vectors. Feature e.sub.k of edge k is interpreted as the Q-value of scheduling job j.sub.k on machine m.sub.k and global feature was the Q-value of noop action.

[0059] For learning the Q function, a simulator of the machines is applied that models the problem as a discrete event simulation. Preferably, a simulated factory, a.k.a. “digital twin”, is applied that reflects the dynamics of a real factory, i.e., how the state of the factory changes when a job is started on a machine and how additional tasks arrive over time. In an alternative embodiment, recorded data from the machines or from the factory can be used as training data for learning the Q function or the recoded data can be used to build the simulator. As already stated above, the DGN learning is preferably utilized to learn the Q function and train the GNN.

[0060] After the GNN is trained, it can be straight forward applied to schedule jobs on a plurality of machines. For example, if a decision point occurs, the above described inputs of the GNN are proceed of the current situation at the point of time when the decision point occurs. Depending on the output of the GNN, from the set of waiting jobs, the next job is selected with the highest Q value. This procedure can be applied each time, if decision points occur. Interestingly, this procedure is neither limited to static nor to dynamic settings.

[0061] For the case, that the decision point occurs or for the case, that a complete schedule of all waiting jobs shall be scheduled into one schedule or queue, a Monte-Carlo Tree Search can be used. Monte-Carlo Tree Search (MCTS) is an online planning method for sequential and (possibly) stochastic decision problems. MCTS operates by searching for the optimal decision to make at each state via iteratively sampling rollouts: sequences of steps that begin in the current state and end in either terminal states or states that were reached by the end of the planning horizon. These rollouts are used to build a search tree, which stores information about the visited states. Such information includes Q-value estimations Q.sup.T(s; a) that are used to guide the search in subsequent rollouts and eventually, when timeout is reached, output a decision for the current state.

[0062] While MCTS is capable of finding high-quality decisions without incorporating any domain-specific knowledge, the availability of good domain-specific heuristics can significantly facilitate the search process, yielding better decisions in shorter deliberation times. Preferably, a variation of MCTS algorithm called UCT (for more information: Kocsis, L. and Szepesvari, C. “Bandit based monte-carlo planning”, In European conference on machine learning, pp. 282-293. Springer, 2006.) is applied in conjunction with the trained GNN. In particular, rollouts are sampled by choosing actions according to the Upper Confidence-Bound (UCB, for more details: see also Kocsis et al or Liu et al “Adapting improved upper confidence bounds for Monte-Carlo tree search”, In: Advances in Computer Games. Springer, Cham, 2015. S. 53-64.) policy for states that are inside the search tree (i.e., added to the tree in preceding rollouts), and otherwise choose the action with the highest Q-value estimate Q.sup.N(s; a)=Q(s; a; θ), obtained from the trained GNN. UCB balances exploration and exploitation by trying first all actions once, and afterwards choosing the action that maximizes the value:

[00004] $Q^{T} (s, a) + β \sqrt{\frac{\log (n_{s})}{n_{s, a}}}$

[0063] where n.sub.s,a denotes the number of times action a was sampled in state s, n.sub.s=Σn.sub.s,a, and ß is the exploration factor.

[0064] The Q-values in each states are estimated as:

[00005] $Q^{T} (s, a) = {.Math.}_{i = 1}^{n_{s, a}} \frac{\overline{r_{l}}}{n_{s, a}}$

[0065] where r.Math. is the sample (rollout) i accumulated reward from states until the end of the rollout.

[0066] Since the number of actions in some cases can be quite large, often leading to poor performance, a pruning mechanism can be applied that makes use of the GNN in additional manner. Namely, during search, only the k actions with the highest Q-values estimates Q.sup.N(s; a) are considered.

[0067] Given the current state, the MCTS can be applied to generate the search tree, wherein depending on the search tree, the schedule of jobs is derived, e.g. choosing the branches of the search tree which results in the shortest TWCT.

[0068] MCTS can be run for as much time as is available for optimization (e.g. seconds to minutes) or until an optimal job queue is found.

[0069] FIG. 1 shows a flow chart (10) of one embodiment to train the job scheduler and to carry out the job scheduler.

[0070] The method starts with step 11. In this step, the simulator of the factory is initialized.

[0071] In the subsequent step 12, the GNN is trained, wherein a reinforcement learning algorithm, in particular DGN is applied. The reinforcement rithm is carried out on the simulator to explore the simulator and its behavior for given actions at given states.

[0072] If the training of the GNN is terminated, step 13 follows. For a given state of the machines and a set of jobs, a queue of jobs is determined. The queue is determined by the MCTS as described above.

[0073] It is noted that determining the queue of jobs via the MCTS can be started if all or part of the machines are free or if the decision point is reached for one machine.

[0074] After the queue is obtained, the jobs are subsequently proceeded by the machines in step 14. For example, after a job queue has been determined by the MCTS or a next job has been determined depending on the Q-value, the next job is assigned to the corresponding determined machine. The machine is then controlled to carry out its newly assigned job.

[0075] If the queue of jobs is subsequently proceeded, if a decision point is reached, either the MCTS can be again applied and the next job is not selected from the order of the queue, instead the next job is selected via the updated queue determined by the MCTS search tree, or the next job is selected only depending on the Q value of the GNN, or the order of the queue is further utilized.

[0076] In a further embodiment, depending on the next determined job, a control signal is determined. Said control signal is for controlling the machine and/or is for controlling a physical system e.g., a computer-controlled robot that loads jobs into available machines and/or reconfigures the machine for the required setup for its next job.

[0077] FIG. 2 shows an embodiment of a training system (20) for training the GNN (22). A training module (21) provides training data for the GNN (22) which parameters e are stored in a storage (23) and said parameters are updated depending on the output of the GNN (22) by preferably the DGN reinforcement learning.

[0078] A computer-readable memory (24) contains instructions, which when carried out on a computing (25) initiate the computer to carry out the training of the GNN.

[0079] The term “computer” covers any device for the processing of pre-defined calculation instructions. These calculation instructions can be in the form of software, or in the form of hardware, or also in a mixed form of software and hardware.

[0080] It is further understood that the procedures cannot only be completely implemented in software as described. They can also be implemented in hardware, or in a mixed form of software and hardware.

DEVICE AND METHOD FOR SCHEDULING A SET OF JOBS FOR A PLURALITY OF MACHINES

Inventors

Cpc classification

Classification Explorer

Y02P90/02

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G06F18/24323

PHYSICS

Classification Explorer

G06F18/217

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N5/01

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G05B2219/32335

PHYSICS

Classification Explorer

G05B19/41865

PHYSICS

International classification

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G05B19/418

PHYSICS

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N5/00

PHYSICS

Abstract

Claims

Description