DEVICE, COMPUTER PROGRAM AND COMPUTER-IMPLEMENTED METHOD FOR MACHINE LEARNING
20230037759 · 2023-02-09
Inventors
- Hamish Flynn (Stuttgart, DE)
- David Reeb (Renningen, DE)
- Jan Peters (Seeheim-Jugenheim, DE)
- Melih Kandemir (Agedrup, DK)
Cpc classification
G06F18/217
PHYSICS
International classification
Abstract
A device, computer program and computer-implemented method for machine learning. The method comprises providing a task comprising an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions, providing a hyperprior, wherein the hyperprior is a distribution over the action space, determining, depending on the hyperprior, a hyperposterior for that a lower bound for an expected reward on future bandit tasks has as large a value as possible, when using priors sampled from the hyperposterior, and wherein the hyperposterior is a distribution over the action space.
Claims
1. A computer-implemented method for machine learning, the method comprising the following steps: providing a task including an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions; providing a hyperprior, wherein the hyperprior is a distribution over the action space; and determining, depending on the hyperprior, a hyperposterior so that a lower bound for an expected reward on future bandit tasks has as large a value as possible, when using priors sampled from the hyperposterior, wherein the hyperposterior is a distribution over the action space.
2. The method according to claim 1, wherein the determining of the hyperposterior includes determining the hyperposterior that maximises the lower bound for the expected reward.
3. The method according to claim 1, further comprising: processing sensor data including digital image data or audio data, depending on a prior sampled from the hyperposterior, for classifying the sensor data; and (i) detecting presence of objects in the sensor data, or (ii) performing a semantic segmentation on the sensor data, or (iii) determining a measure for robustness of the machine learning including a probability that an expected error on a next task is not above a predetermined value, when sampling priors from the hyperposterior, or (iv) detecting an anomaly in sensor data depending on a prior sampled from the hyperposterior, or (v) learning a policy for controlling a physical system and determining a control signal for controlling the physical system depending on a prior sampled from the hyperposterior.
4. The method according to claim 1, further comprising: determining the hyperposterior in a number of iterations; and sampling the prior of an iteration from the hyperposterior of a previous iteration.
5. The method according to claim 4, further comprising: sampling the task of the iteration from a distribution over tasks.
6. The method according to claim 1, further comprising: initializing the hyperposterior with the hyperprior.
7. The method according to claim 1, further comprising: initializing a posterior with the prior; determining from a set of behaviour policies of the task a behaviour policy that is associated with the task posterior, wherein the behaviour policy includes a distribution over actions with a probability mass; sampling or selecting randomly from the probability mass an action; sampling a reward depending on the action from the distribution over rewards; determining a task dataset that includes the action and the reward, and updating the posterior to include the determined task dataset.
8. The method according to claim 7, wherein the providing the task includes providing the task including a state space and a distribution over initial states, and wherein the method further comprises sampling or selecting randomly an initial state from the distribution over initial states, and wherein the distribution over rewards is conditioned on the actions and states of the state space.
9. The method according to claim 7, the method further comprising: initializing the dataset with an empty set and then updating the posterior in a predetermined number of rounds.
10. The method according to claim 1, wherein the determining of the hyperposterior includes determining an approximation of an expected reward depending on a Kullback-Leibler divergence of the hyperposterior and the hyperprior.
11. A device for machine learning, the device configured to: provide a task including an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions; provide a hyperprior, wherein the hyperprior is a distribution over the action space; and determine, depending on the hyperprior, a hyperposterior so that a lower bound for an expected reward on future bandit tasks has as large a value as possible, when using priors sampled from the hyperposterior, wherein the hyperposterior is a distribution over the action space.
12. A non-transitory computer-readable medium on which is stored a computer program including computer-readable instructions for machine learning, the instructions, when executed by a computer, causing the computer to perform the following steps: providing a task including an action space of a multi-armed bandit problem or a contextual bandit problem and a distribution over rewards that is conditioned on actions; providing a hyperprior, wherein the hyperprior is a distribution over the action space; and determining, depending on the hyperprior, a hyperposterior so that a lower bound for an expected reward on future bandit tasks has as large a value as possible, when using priors sampled from the hyperposterior, wherein the hyperposterior is a distribution over the action space.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0021]
[0022]
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0023]
[0024] The method of machine learning in the example uses a base learning algorithm Q=Q(D,P) and starts with a step 200. The base learning algorithm returns a posterior distribution Q. Since the base learning algorithm uses a data set D and a prior P, the posterior is in the description written as Q(D, P) to make the dependence on D and P clear.
[0025] In the step 200 the method comprises providing a task Ti.
[0026] The task Ti may be a task a multi-armed bandit problem that is represented as a couple
T.sub.i=(A,ρ.sub.i)
comprising an action space A, i.e. a set of actions a, of the multi-armed bandit problem, and a distribution ρ.sub.i(r|a) over rewards r that is conditioned on actions a.
[0027] The task Ti may be a task a contextual bandit problem that is represented as a quadruple
T.sub.i=(S,A,μ.sub.i,ρ.sub.i)
comprising an action space A, i.e. a set of actions a, and a state space S, and a distribution ρ.sub.i(r|s,a) over rewards r that is conditioned on actions a and states s, and a distribution over initial states μ(s) of the contextual bandit problem.
[0028] Rewards r are assumed to be between 0 and 1 and tasks T.sub.i are assumed to be sampled i.i.d. from an environment T.
[0029] Afterwards, a step 202 is executed.
[0030] In the step 202, the method comprises providing a hyperprior P.
[0031] The hyperprior P is a distribution over the set of possible prior distributions. Each prior distribution is a distribution over the action space A.
[0032] Afterwards, a step 204 is executed.
[0033] In the step 204, the method comprises initializing a hyperposterior Q with the hyperprior P. The hyperposterior Q is another distribution over prior distributions.
[0034] Afterwards a step 206 is executed.
[0035] In the step 206 the method comprises sampling the task Ti of an iteration i from a distribution T over tasks T.
[0036] Associated with the task T.sub.i is a set of behaviour policies B.sub.ij=(b.sub.i1, . . . , b.sub.im.sub.
[0037] A behaviour policy b.sub.ij comprises for the multi-armed bandit problem a distribution over actions a with a probability mass b.sub.ij(a). In the example, the behaviour policy b.sub.ij is a distribution over actions a that depends in the iteration i on previously observed training data
D.sub.1, . . . ,D.sub.1−i,D.sub.i.sup.:j−1
[0038] where for the in the multi-armed bandit task
D.sub.i.sup.:j−1=((a.sub.i1,r.sub.i1), . . . ,(a.sub.ij−1,r.sub.ij−1))
[0039] The elements of the training dataset are in the multi-armed bandit task distributed as follows
a.sub.ij˜b.sub.ij(a|D.sub.1, . . . ,D.sub.i−1,D.sub.i.sup.:j−1),r.sub.ij˜ρ.sub.i(r|a.sub.ij)
[0040] A behaviour policy b.sub.ij comprises for the contextual bandit problem a distribution over actions a conditions on states s with a probability mass b.sub.ij(a|s.sub.ij). In the example, the behaviour policy b.sub.ij is a distribution over actions a that depends in the iteration i on previously observed training data
D.sub.i.sup.:j−1=((s.sub.i1,a.sub.i1,r.sub.i1), . . . ,(s,a.sub.ij−1,r.sub.ij−1))
[0041] The elements of the training dataset are in the contextual bandit task distributed as follows
a.sub.ij˜b.sub.ij(a|s.sub.ij,D.sub.1, . . . ,D.sub.1−i)r.sub.ij˜ρ.sub.i(r|s.sub.ij,a.sub.ij)
[0042] Afterwards a step 208 is executed.
[0043] In step 208, the method comprises sampling the prior P of the iteration i from the hyperposterior Q of a previous iteration i−1.
[0044] Afterwards, a step 210 is executed.
[0045] In the step 210, the method comprises initializing a posterior Q with the prior P.
[0046] Afterwards, a step 212 is executed.
[0047] In the step 212, the method comprises initializing the dataset D.sub.i with an empty set.
[0048] Afterwards, a step 214 is executed.
[0049] In the step 214, the method comprises determining from the set of behaviour policies B.sub.ij of the task T.sub.i a behaviour policy b.sub.ij that is associated with the task posterior Q.
[0050] Afterwards, a step 216 is executed.
[0051] In the step 216, the method comprises sampling or selecting randomly from the probability mass of the behaviour policy b.sub.ij an action a.sub.ij.
[0052] Afterwards, a step 218 is executed.
[0053] In the step 218, the method comprises for the multi-armed bandit problem sampling a reward r.sub.ij depending on the action a.sub.ij from the distribution ρ.sub.i(r|a) over rewards r.
[0054] In the step 218, the method comprises for the contextual bandit problem sampling or selecting randomly an initial state s.sub.ij from the distribution over initial states μ(s) and sampling the reward r.sub.ij depending on the action a.sub.ij from the distribution ρ.sub.i(r|s,a)) over rewards r that is conditioned on the actions a and states s of the state space S.
[0055] Afterwards, a step 220 is executed.
[0056] In the step 220, the method comprises, determining a dataset D.sub.i that includes the action aij and the reward rij.
[0057] The dataset D.sub.i is a training set that for the multi-armed bandit problem comprises m.sub.i observable training data pairs:
D.sub.i=((a.sub.i1,r.sub.i1), . . . ,(a.sub.im.sub.
[0058] The dataset D.sub.i is a training set that for the contextual bandit problem comprises m.sub.i observable training data triple:
D.sub.i=((s.sub.i1,a.sub.i1,r.sub.i1), . . . ,(s,a.sub.ij−1,r.sub.ij−1))
[0059] Afterwards, a step 222 is executed.
[0060] In the step 222, the method comprises, updating the posterior Q to include the dataset D.sub.i.
[0061] To this end, the deterministic base learning algorithm takes the dataset D.sub.i and the prior P as inputs and produces the posterior Q=Q(D.sub.i; P).
[0062] The goal of the task T.sub.i is to find a posterior Q that maximises an expected reward. The expected reward for the multi-armed bandit problem is:
[0063] wherein E is the expectancy value.
[0064] Using the dataset D.sub.i, the expected reward is estimated in the example for the multi-armed bandit problem as:
[0065] I{a.sub.ij=a} is an indicator function that returns 1 if a.sub.ij=a and 0 otherwise. The expected reward for the contextual bandit problem is:
[0066] wherein E is the respective expectancy value, and wherein the prior P and the posterior Q are distributions over a space Π of policies, wherein π∈Π is a distribution π(a|s) over actions a conditioned on states s.
[0067] Using the dataset D.sub.i, the expected reward is estimated in the example for the contextual bandit problem as:
[0068] In the example, the method comprises updating the posterior Q in a predetermined number of rounds m. This means, the method continues with step 214 for m rounds.
[0069] Afterwards a step 224 is executed.
[0070] In the step 224, the method comprises determining, depending on the hyperprior P, the hyperposterior Q for that a lower bound for an expected reward L on future bandit tasks has as large a value as possible, when using priors P sampled from the hyperposterior Q.
[0071] Determining the hyperposterior Q in the example comprises determining the hyperposterior Q that maximises the lower bound for the expected reward L.
Q←argmax.sub.Q′ lower bound(L(Q′))
[0072] In the example, the function lower bound (L(Q′)) for the multi-armed bandit problem is one of the following objective functions:
[0073] wherein, β.sub.1 and β.sub.1 are constants 0≤β.sub.1, β.sub.2≤1,
[0074] wherein
[0075] and wherein 0<δ<1 is a confidence parameter, so that for the multi-armed bandit problem the following inequalities each hold with probability 1−δ over the training data D.sub.1, . . . , D.sub.n and the posteriors Q.sub.1, . . . , Q.sub.n:
[0076] wherein Q.sup.n denotes the joint distribution over actions a.sub.1, . . . , a.sub.n where a.sub.i˜Q.sub.i and wherein p.sup.n denotes the joint distribution over actions a.sub.1, . . . , a.sub.n where a.sub.i˜P.sub.i and wherein
and wherein
[0077] wherein m.sub.i is the number of training set samples for task i.
[0078] For the contextual bandit problem, these objective functions may be applied by replacing L.sub.i with L.sub.i.sup.(cb) and {circumflex over (L)}.sub.i with {circumflex over (L)}.sub.i.sup.(cb), wherein Q.sub.i and P.sub.i are distributions over policies π.sub.i rather than actions a.sub.i and let
where π.sub.max is the maximum possible probability mass for any policy π and any action state.
[0079] The performance of a hyperposterior Q is for example measured by a marginal transfer reward, which is the expected reward on a new task T.sub.n+1, when using the base learning algorithm Q(D.sub.n+1; P) equipped with a prior P˜Q.
[0080] For the multi-armed bandit problem the marginal transfer reward is:
[0081] For the contextual bandit problem the marginal transfer reward is:
[0082] with the above functions, instead of maximising the marginal transfer reward, the method maximises PAC-Bayesian lower bounds on the marginal transfer reward. Some of the bounds contain the inverse of the binary Kullback-Leibler, KL, divergence.
[0083] This means, that determining the hyperposterior Q comprises determining and approximation of the expected reward L depending on a Kullback-Leibler divergence DKL(Q∥P) of the hyperposterior Q and the hyperprior P.
[0084] The hyperposterior Q is determined in the example in a number n of iterations i.
[0085] This means, the method continues with step 206 for n iterations. Afterwards, optionally, a step 226 is executed.
[0086] In the step 226, the method comprises in the example, processing sensor data, in particular digital image data or audio data, depending on a prior sampled from the hyperposterior Q. The sensor may be a microphone, a video, radar, LiDAR, ultrasonic, motion, thermal image, or sonar sensor.
[0087] The sensor data is for example processed for classifying the sensor data, detecting the presence of objects in the sensor data or performing a semantic segmentation on the sensor data. The sensor data classification is for example framed as a contextual bandit problem. The images may be classified for example regarding presence or absence or type of traffic signs, road surfaces, pedestrians, vehicles.
[0088] Instead, or additionally, the step 226 may comprise learning a policy for controlling a physical system and determining a control signal for controlling the physical system depending on a prior sampled from the hyperposterior Q.
[0089] By way of example, a probability that an expected error on a next task is not above a predetermined value may be determined, when sampling priors from the hyperposterior Q.
[0090] Instead, or additionally, the step 226 may comprise determining a measure for robustness of the machine learning. In the example, the measure for robustness is an expected error, for priors that are sampled from the hyperposterior Q.
[0091] A prior may be used for processing the sensor data e.g. when the expected error is less than a threshold and otherwise, another prior may be sampled for processing the sensor data or determining the control signal.
[0092] Instead, or additionally, the step 226 may comprise detecting an anomaly in sensor data depending on a prior sampled from the hyperposterior Q. In this aspect, an anomaly detection problem is framed as a multi-armed bandit problem.
[0093] Afterwards, the method ends in the example.
[0094] An example for the steps taken by the computer in the method for machine learning of multi-armed banded problems is provided below:
TABLE-US-00001 Algorithm 1 PAC-Bayesian Lifelong MABs 1: Input: Task distribution , base learning algorithm
=
(D, P), hyperprior
2: Initialise hyperposterior
←
3: for i = 1 to n do 4: Sample task
.sub.s ~
5: Sample price P ~
6: Initialise task posterior
← P 7: Initialise task dataset D.sub.i ← ( ) 8: for j = 1 to m do 9: Get behaviour policy b.sub.i,j ← B(
) 10: Sample a.sub.ij ~ b.sub.ij(α) and r.sub.ij ~ p.sub.i(r|a.sub.ij) 11: Append (a.sub.ij, r.sub.ij) to D.sub.i 12: Update task posterior
←
(D.sub.i, P) 13: end for 14:
←
(lower bound (
(
′))) 15: end for