DEVICE AND METHOD FOR TD-LAMBDA TEMPORAL DIFFERENCE LEARNING WITH A VALUE FUNCTION NEURAL NETWORK

20220374697 · 2022-11-24

    Inventors

    Cpc classification

    International classification

    Abstract

    The present disclosure relates to a synapse circuit of a neural network for performing TD-lambda temporal difference learning, the neural network approximating a value function, the synapse circuit comprising: a first resistive memory device (506); a second resistive memory device (516); and a synapse control circuit (528) configured to update a synaptic weight (g.sub.θ) of the synapse circuit by programming a resistive state of the first resistive memory device (506) based on a programmed conductance of the second resistive memory device (516).

    Claims

    1. A synapse circuit of a neural network for performing TD-lambda temporal difference learning, the neural network approximating a value function, the synapse circuit comprising: a first resistive memory device; a second resistive memory device; and a synapse control circuit configured to update a synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− of the synapse circuit by programming a resistive state of the first resistive memory device based on a programmed conductance of the second resistive memory device.

    2. The synapse circuit of claim 1, wherein the second resistive memory device is configured to have a conductance γλ that decays over time.

    3. The synapse circuit of claim 2, wherein the second resistive memory device is a phase-change memory device or a conductive bridging RAM element.

    4. The synapse circuit of claim 1, wherein the synapse control circuit is further configured to update an eligibility trace of the synapse circuit by programming a resistive state of the second resistive memory device based on a back-propagated derivative ∂V.sub.t/∂θ.sub.t of an output value V.sub.t of the neural network.

    5. The synapse circuit of claim 1, wherein the synapse control circuit is configured to update the synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− by applying a voltage or current level generated based on a temporal difference error δ to an electrode of the second resistive memory device to generate an output current or voltage level.

    6. The synapse circuit of claim 5, wherein the synapse control circuit is further configured to compare the output current or voltage level with one or more thresholds, and to program the resistive state of the first resistive memory device based on the comparison.

    7. An agent device of a TD-lambda temporal difference learning system, the agent device comprising a neural network comprising an input layer of neurons, one or more hidden layers of neurons, and an output layer of neurons, wherein: each neuron of the input layer is coupled to one or more neurons of a first hidden layer of the one or more hidden layers via a corresponding synapse circuit implemented by the circuit of claim 5.

    8. The agent device of claim 7, further comprising a control circuit configured to generate the temporal difference error δ based on a reward signal R.sub.t received from the environment, and to provide the temporal difference error δ to the neural network.

    9. The agent device of claim 8, wherein the control device provides to the neural network a signal representative of the product of the temporal difference error δ and a learning rate α.

    10. A system for TD-lambda temporal difference learning comprising: the agent device of claim 7 configured to generate an output signal indicating an action A.sub.t to be applied to an environment based on an output of the neural network; one or more actuators configured to apply the action A.sub.t to the environment; and one or more sensors configured to detect a state S.sub.t+1 of the environment and a reward R.sub.t+1 resulting from the action A.sub.t.

    11. A method of TD-lambda temporal difference learning, the method comprising: updating a synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− of a synapse circuit of a neural network, the neural network approximating a value function, the synapse circuit comprising: a first resistive memory device; a second resistive memory device; and a synapse control circuit, wherein updating the synaptic weight comprises programming, by the synapse control circuit, a resistive state of the first resistive memory device based on a programmed conductance of the second resistive memory device.

    12. The method of claim 11, wherein the second resistive memory device is configured to have a conductance γλ that decays over time.

    13. The method of claim 11, further comprising updating, by the synapse control circuit, an eligibility trace of the synapse circuit by programming a resistive state of the second resistive memory device based on a back-propagated derivative ∂V.sub.t/∂θ.sub.t of an output value V.sub.t of the neural network

    14. The method of claim 11, wherein updating the synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− comprises applying a voltage or current level generated based on a temporal difference error δ to an electrode of the second resistive memory device in order to generate an output current or voltage level.

    15. The method of claim 14, further comprising comparing, by the synapse control circuit, the output current or voltage level with one or more thresholds, and programming the resistive state of the first resistive memory device based on the comparison.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0027] The foregoing features and advantages, as well as others, will be described in detail in the following description of specific embodiments given by way of illustration and not limitation with reference to the accompanying drawings, in which:

    [0028] FIG. 1 schematically illustrates a system for reinforcement learning according to an example embodiment of the present disclosure;

    [0029] FIG. 2 schematically illustrates the system of FIG. 1 in more detail according to an example embodiment;

    [0030] FIG. 3 is a flow diagram illustrating an example of operations in a method of TD-lambda temporal difference learning according to an example embodiment of the present disclosure;

    [0031] FIG. 4 schematically illustrates a deep neural network according to an example embodiment of the present disclosure;

    [0032] FIG. 5 illustrates an array of synapse circuits interconnecting layers of a deep neural network according to an example embodiment of the present disclosure;

    [0033] FIG. 6 is a graph illustrating an example of conductance drift of a phase change memory (PCM) device over time;

    [0034] FIG. 7 is a graph illustrating, on a logarithmic scale, an example of resistance drift of a phase-change memory device over time;

    [0035] FIG. 8 schematically illustrates an agent of FIGS. 1 and 2 in more detail according to an example embodiment of the present disclosure;

    [0036] FIG. 9 schematically illustrates a synapse circuit in more detail according to an example embodiment;

    [0037] FIG. 10A is a flow diagram illustrating operations in a method of storing an eligibility trace according to an example embodiment of the present disclosure;

    [0038] FIG. 10B is a timing diagram representing variation of a conductance of a resistive memory device storing an eligibility trace according to an example embodiment of the present disclosure;

    [0039] FIG. 10C is a flow diagram illustrating operations in a method of storing a synaptic weight according to an example embodiment of the present disclosure;

    [0040] FIG. 10D is a timing diagram representing stored values of a synaptic weight according to an example embodiment of the present disclosure; and

    [0041] FIG. 11 is a cross-section view illustrating a transistor layer and metal stack forming part of a deep neural network according to an example embodiment of the present disclosure.

    DETAILED DESCRIPTION OF THE PRESENT EMBODIMENTS

    [0042] Like features have been designated by like references in the various figures. In particular, the structural and/or functional features that are common among the various embodiments may have the same references and may dispose identical structural, dimensional and material properties.

    [0043] Unless indicated otherwise, when reference is made to two elements connected together, this signifies a direct connection without any intermediate elements other than conductors, and when reference is made to two elements coupled together, this signifies that these two elements can be connected or they can be coupled via one or more other elements.

    [0044] In the following disclosure, unless indicated otherwise, when reference is made to absolute positional qualifiers, such as the terms “front”, “back”, “top”, “bottom”, “left”, “right”, etc., or to relative positional qualifiers, such as the terms “above”, “below”, “higher”, “lower”, etc., or to qualifiers of orientation, such as “horizontal”, “vertical”, etc., reference is made to the orientation shown in the figures.

    [0045] Unless specified otherwise, the expressions “around”, “approximately”, “substantially” and “in the order of” signify within 10%, and preferably within 5%.

    [0046] FIG. 1 schematically illustrates a system 100 for reinforcement learning according to an example embodiment of the present disclosure. The system 100 comprises an agent (AGENT) 102, implemented for example by a data processing device, and an environment (ENVIRONMENT) 104, implemented for example by one or more actuators and one or more sensors. The agent 102 is for example configured to generate actions A.sub.t (ACTION A.sub.t), and to apply these actions to the environment, and in particular to the one or more actuators of the environment. The one or more sensors for example generate signals representing a state S.sub.t+1 (STATE S.sub.t+1) and a reward R.sub.t+1 (REWARD R.sub.t+1) resulting from each action A.sub.t. These state and reward signals are processed by the agent 102 in order to generate the next action A.sub.t to be applied to the environment.

    [0047] During a learning phase, reinforcement learning is used in order for the agent to learn a policy for selecting actions based on the rewards received from the actions applied to the environment. The agent updates its policy as a function of the actions and the rewards in order to improve its future expected discounted reward. While there are many manners in which the policy implemented by the agent 102 can be described and updated, there is a recent trend towards the use of a deep neural network that acts as a policy approximation. Such solutions are known as deep reinforcement learning.

    [0048] In some embodiments, the agent applies TD-lambda temporal difference learning. In such a case, the neural network maintains an internal representation of a value function V(s), which gives the value of being in each state in view of the current state. The neural network is configured to learn the value function V(s) based on the state information and on the rewards. For example, the policy is updated by iteratively differentiating the difference between the predicted and received value with respect to the synaptic weights of the current policy. This difference is known as the temporal difference (TD) error.

    [0049] In other embodiments, the agent uses a function Q(s,a). In such a case, the neural network is configured to learn, based on the state information and on the rewards, a function Q that gives the value of each action that may be taken while in the current state. The training involves, for example, minimizing the difference (TD error) between the predicted Q-value, i.e. the one that resulted in a given action being taken, and the received reward plus the maximum Q value that is selected next as a function of the resulting state S.sub.t+1.

    [0050] FIG. 2 schematically illustrates the system 100 of FIG. 1 in more detail according to an example embodiment in which the agent 102 is implemented by an artificial neural network, such as a deep neural network (DNN) 200. The DNN 200 comprises a plurality of layers of neurons 202 interconnected by synapses 204. An input layer of the network for example receives the state S.sub.t. The output layer of the neural network, which approximates a state-value function, is a scalar number corresponding to the predicted value of that state. Where the neural network approximates a state-action function, the output layer is the vector of possible actions A.sub.t (ACTION A.sub.t). From this output vector, the corresponding action taken by the network can be deduced using the maximum argument. This action A.sub.t is then taken, which updates the environment.

    [0051] For example, in one embodiment, the neural network implements a value function V(s), and the outputs indicate the value of being in a given state. A state-value network for example has one or more output neurons.

    [0052] In state-action value functions Q(s,a), a neural network for example has multiple output neurons each of which corresponds to a different action that can be taken in that state. The highest output for example indicates the action that should be taken. A corresponding action A.sub.t is for example selected and applied to the environment in order to move to this next state.

    [0053] The environment 104 provides the next state S.sub.t+1 to the input of the DNN 200, and also supplies the reward R.sub.t+1 to the agent 102, as will be described in more detail below.

    [0054] FIG. 3 is a flow diagram illustrating an example of operations in a method of TD-lambda temporal difference learning according to an example embodiment of the present disclosure. This method is for example applied by the agent 102 of FIGS. 1 and 2.

    [0055] In an operation 301 (INITIALISE θ and e), matrices θ and e stored by the agent 102 are initialized. For example, the matrix θ corresponds to a parameter matrix of the DNN 200, defining the synaptic weights of the synapses of the DNN 200. The matrix e corresponds for example to an eligibility matrix of the DNN 200, and defines for example, for each synapse, an eligibility trace of the synapse for use in updating the corresponding synaptic weight.

    [0056] After the initialization operation 301, an iterative learning phase is for example entered, each iteration involving operations 302 to 310.

    [0057] In the operation 302 (RECEIVE STATE S.sub.t AND ANY REWARD R.sub.t), the agent 102 for example receives from the environment, at a timestep t, the state S.sub.t of the environment, and any reward R.sub.t occurring during the timestep t. Indeed, given that rewards may occur after a certain time delay with respect to actions, there may be no rewards received during some timesteps.

    [0058] In the operation 303 (FORWARD PROPAGATE STATE S.sub.t), a current state S.sub.t of the environment is forward propagated through the DNN 200. The state is thus modified by the parameter matrix θ of the DNN 200, and values V.sub.t at the output layer of the DNN 200 are thus generated.

    [0059] In the operation 304 (DETERMINE+APPLY ACTION A.sub.t), the action to be applied to the environment 104, based on the output values V.sub.t resulting from the state S.sub.t, is determined and applied to the environment 104, for example via one or more actuators of the environment 104. For example, the action A.sub.t is one that is associated with a neuron of the output layer of the DNN 200 having the highest value.

    [0060] In the operations 305 and 306, the eligibility matrix e is for example updated based on the output values V.sub.t resulting from the forward propagation of the state S.sub.t in the operation 303.

    [0061] In the operation 305 (BACK PROPAGATE DERIVATIVE ∂V.sub.t/∂θ.sub.t), the derivatives ∂V.sub.t/∂θ.sub.t of the output values V.sub.t with respect to the model defined by the synaptic weights θ.sub.t are backpropagated through the neural network. For each synapse, the derivative ∂V.sub.t/∂θ.sub.t represents in particular how each synaptic weight θ impacts the calculation of the value function V.sub.t. This is a different approach from a standard learning technique in a neural network, in which it is the derivative of the cost with respect to the model, or the loss with respect to the labelled output, that is back propagated through the network.

    [0062] In the operation 306 (UPDATE ELIGIBILITY e), the derivative ∂V.sub.t/∂θ.sub.t of each synapse is used to update the eligibility trace e of the synapse. For example, the new eligibility value e.sub.t for timestep t is generated based on the following equation:

    [00001] e t = e t - 1 γλ + V t θ t [ Math 1 ]

    where e.sub.t−1 is the previous value of the eligibility trace at the timestep t−1, γ is a discounting rate, and λ is a decay rate defining how quickly the eligibility trace decays. The discounting rate γ and the decay rate λ are for example each equal to between 0 and 1, and in some cases either or both is for example equal to between 0.8 and 0.99.

    [0063] In the operations 307 and 308, the parameter matrix θ is updated based on the output values V.sub.t resulting from the forward propagation of the state S.sub.t in the operation 303, and also based on the output values V.sub.t−1 resulting from the forward propagation of the state S.sub.t−1 during the operation 303 of the previous iteration, in other words at the timestep t−1.

    [0064] In operation 307 (CALCULATE TD ERROR δ.sub.t), a temporal difference error value δ.sub.t is calculated based on any reward R.sub.t received from the environment during the timestep t. For example, in one embodiment, the TD error value δ.sub.t is calculated based on the following equation:


    δ.sub.t=R.sub.t+γV.sub.t−V.sub.t−1   [Math 2]

    where γ is the discounting rate, V.sub.t represents the output of the value function during the timestep t, and V.sub.t−1 represents the outputs of the value function during the previous iteration, i.e. the timestep t−1. For example, in the case of a value function V(s), the output value V.sub.t is a scalar value indicating the value of the state. After simulating multiple potential states, an action is selected that leads to be best next state, in line with the NN predictions. Thus, the subtraction γV.sub.t−V.sub.t−1 is a subtraction of scalars. The TD error is thus based on a difference between the predicted value V.sub.t−1 of the neural network outputs at the previous iteration, and the discounted observed output γV.sub.t during the current iteration, plus the observed reward. In case of no reward, the TD error is only based on the difference, and the weights of the neural network are still updated. In the case of Q(s,a) value functions, the output is a vector corresponding to the actions. In this case, γQ.sub.t−Q.sub.t−1 is also a subtraction of scalars, for example only taking the value that corresponded to the predicted Q of the action that was actually taken.

    [0065] In an operation 308 (UPDATE SYNAPTIC WEIGHTS θ), the parameter matrix θ of the DNN is for example updated based on the eligibility matrix e updated in the operation 306, and based on the temporal difference error value δ.sub.t calculated in operation 307. For example, each weight of the parameter matrix θ is updated based on the following equation:


    θ.sub.t=θ.sub.t−1+αδ.sub.te.sub.t   [Math 3]

    where θ.sub.t is the updated synaptic weight, θ.sub.t−1 is the previous synaptic weight, and α is a learning rate, for example equal to between 1e-6 and 1e-4, and for example equal to or less than 1e-5. In some embodiments, the value of α is chosen such that the term αδ.sub.te.sub.t modifies the synaptic weight θ.sub.t−1 by a desired quantity, corresponding for example to a few percent, for example by between 0.1 and 3 percent. The factor αδ.sub.t is for example a scalar value that is the same for all the synapses of the network.

    [0066] In an operation 309 (END LEARNING PHASE?), it is determined whether a stop condition has been met in order to stop the learning phase. For example, the stop condition may be met after a certain number of iterations of the algorithm, or once the TD error δ.sub.t, for example after application of a low-pass filter, falls below a given threshold. If the stop condition is not met (branch N), a new iteration is started, involving an operation 310 (t=t+1) in which t is incremented, and thus the next timestep is considered. The method then returns to the operation 302, and the operations 302 to 309 are for example repeated. Once the stop condition of operation 309 is met (branch Y), the next operation 311 (FUNCTIONAL PHASE) for example involves switching from the learning phase to a function phase in which the parameter matrix θ for example becomes fixed, and the eligibility matrix e is no longer used.

    [0067] While FIG. 3 illustrates a method based on discrete learning and functional phases, in alternative embodiments the method of FIG. 3 could be adapted to a continuous learning approach in which the agent continues to learn throughout its lifetime.

    [0068] While in the example of FIG. 3, the eligibility matrix e is updated in each iteration before the parameter matrix θ is updated, in alternative embodiments the parameter matrix θ could be updated before the eligibility matrix e, for example before the forward propagation step 303.

    [0069] Furthermore, while in the example of FIG. 3 the neural network implements a value function indicating the value V of being in each state, in alternative embodiments the neural network could implement a function indicating, at the outputs of the network, the value Q corresponding to an estimation of future expected discounted reward associated with each action. In such a case, the values V.sub.t and V.sub.t−1 are for example replaced by Q.sub.t and Q.sub.t−1. The scalar values of Q used in the equation correspond to the predicted Q-values of the action that was taken.

    [0070] FIG. 4 illustrates the DNN 200 of FIG. 2 in more detail according to an example in which it is implemented by a multi-layer perceptron DNN architecture, and in which the network implements a value function V.

    [0071] The DNN architecture 200 according to the example of FIG. 4 comprises three layers, in particular an input layer (INPUT LAYER), a hidden layer (HIDDEN LAYER), and an output layer (OUTPUT LAYER). In alternative embodiments, there could be more than one hidden layer. Each layer for example comprises a number of neurons. For example, the DNN architecture 200 defines a model in a 2-dimensional space, and there are thus two visible neurons in the input layer receiving the corresponding values S1 and S2 representing the input state S.sub.t. The model has a hidden layer with seven output hidden neurons, and thus corresponds to a matrix of dimensions custom-character.sup.2*7. The DNN architecture 200 of FIG. 4 corresponds to a value network, and the number of neurons in the output layer thus corresponds to the number of states. In the example of FIG. 4, there are three neurons in the output layer. In an alternative example, the DNN 200 could implement the action value function Q, and the number of output states would then correspond to the number of actions.

    [0072] The policy V=Π.sub.θ(S) applied by the DNN architecture 200 is a functions aggregation, comprising an associative function g.sub.n within each layer, these functions being connected in a chain to map V=Π.sub.θ(S)=g.sub.n( . . . (g.sub.2(g.sub.1(S)) . . . )). There are just two such functions in the simple example of FIG. 4, corresponding to those of the hidden layer and the output layer.

    [0073] Each neuron of the hidden layer receives the signal from each input neuron, a corresponding synaptic weight θ.sub.j.sup.i being applied to each neuron j of the hidden layer from each input neuron i of the input layer. FIG. 4 illustrates the synaptic weights θ.sub.1.sup.1 to θ.sub.7.sup.1 applied to the outputs of a first of the input neurons to each of the seven hidden neurons.

    [0074] Similarly, each neuron of the output layer receives the signal from each neuron of the hidden layer, a corresponding synaptic weight θ.sub.j.sup.k being applied to each neuron k of the output layer from each neuron j of the hidden layer. FIG. 4 illustrates the synaptic weights θ.sub.1.sup.1 to θ.sub.1.sup.3 applied between the output of a top neuron of the hidden layer and each of the three neurons of the output layer.

    [0075] FIG. 5 illustrates an array 500 of synapse circuits 502, 504 interconnecting layers N (LAYER N) and N+1 (LAYER N+1) of a deep neural network, such as the network 200 of FIG. 2 or FIG. 4. For example, the layer N is the input layer of the network, and the layer N+1 is a first hidden layer of the network. In another example, the layers N and N+1 are both hidden layers, or the layer N is a last hidden layer of the network, and the layer N+1 is the output layer of the network.

    [0076] In the example of FIG. 5, the layers N and N+1 each comprise four neurons, although in alternative embodiments there could be a different number of neurons in either or both layers. The array 500 comprises a sub-array of synapse circuits 502, which each connects a corresponding neuron of the layer N to a corresponding neuron of the layer N+1, and a sub-array of synapse circuits 504, which each connect a corresponding neuron of the layer N to a corresponding neuron of the layer N+1. The synapse circuits 502 store the synaptic weights of the parameter matrix θ, while the synapse circuits 504 store the eligibility traces of the eligibility matrix e.

    [0077] Each of the synapse circuits 502 for example comprises a non-volatile memory device storing, in the form of a conductance, a synapse weight g.sub.θ associated with the synapse circuit. The memory device of each synapse circuit 502 is for example implemented by a PCM device, or other type of resistive random-access memory (ReRAM) device, such as an oxide RAM (OxRAM) device, which is based on so-called “filamentary switching”. The device for example has low or negligible drift of its programmed level of conductive over time. In the case of a PCM device, the device is for example programmed with relatively high conductance/low resistance states, which are less affected by drift than the low conductance/high resistance states. The synapse circuits 502 are for example coupled at each intersection between a pre-synaptic neuron of the layer N and a post-synaptic neuron of the layer N+1 in a cross-bar fashion, as known by those skilled in the art. For example, a blow-up view in FIG. 5 illustrates an example of this intersection for the synapse circuits 502, a resistive memory device 506 being coupled in series with a transistor 508 between a line 510 coupled to a corresponding pre-synaptic neuron, and a line 512 coupled to a corresponding post-synaptic neuron. The transistor 508 is for example controlled by a selection signal SEL_θ generated by a control circuit (not illustrated in FIG. 5).

    [0078] During the forward propagation of the state S.sub.t through the DNN 200, each neuron n of the layer N+1 for example receives an activation vector equal to S.sub.in.Math.W, where S.sub.in is the input vector from the previous layer, and W are the weights of the parameter matrix θ associated with the synapses leading to the neuron n. A voltage is for example applied to each of the lines 512, which is for example coupled to the top electrode of each resistive device 506 of a column and to the neuron n. The selection transistors 508 are then for example activated, such that a current will flow through each device 506 equal to V×g.sub.θ, where V is the top electrode voltage, and g.sub.θ is the conductance of the device 506. The current flowing through the line 512 will thus be the addition of the current flowing through each device 506 of the column, and the result is a weighted sum operation. A similar operation for example occurs at each neuron of each layer of the network, except in the input layer.

    [0079] Each of the synapse circuits 504 for example comprises a volatile memory device storing, in the form of a conductance, a synapse eligibility value g.sub.e associated with the synapse circuit. The memory device of each synapse circuit 504 is for example implemented by a PCM device with pronounced drift behavior, or another type of resistive memory having a conductance decay over time, such as a silver-oxide based conductive bridge RAM element. In the case of a PCM device, the device is for example programmed with relatively low conductance/high resistance states, which have a more pronounced drift than the high conductance/low resistance states. The synapse circuits 504 are for example coupled at each intersection between a pre-synaptic neuron of the layer N and a post-synaptic neuron of the layer N+1 in a cross-bar fashion. For example, a blow-up view in FIG. 5 illustrates an example of this intersection for the synapse circuits 504, a resistive memory device 516 being coupled in series with a transistor 518 between a line 520 coupled to a corresponding pre-synaptic neuron, and a line 522 coupled to a corresponding post-synaptic neuron. The transistor 518 is for example controlled by a selection signal SEL_e generated by the control circuit.

    [0080] The conductance of the resistive memory elements of the pair of synapse circuits 502, 504 coupling a same pair of neurons are for example used in a complementary fashion during the updating of the synapse weight g.sub.θ, as represented by a dashed arrow 524 in FIG. 5. Indeed, the conductance g.sub.e is used during the operation 308 in order to update to the synaptic weight θ in the operation 308 of FIG. 3. This exchange of information between the memory devices of the synapse circuits 502, 504 is for example controlled by a synapse control circuit (SYNAPSE CTRL) 528, described in more detail below with reference to FIG. 9. The conductance g.sub.θ is also used indirectly during the updating of the conductance g.sub.e. Indeed, the conductance g.sub.θ is used during forward propagation of the state S.sub.t through the DNN 200 to generate the outputs V of the network, and the derivative of these outputs V are then back propagated and used during the operation 306 of FIG. 3 to update the eligibility value g.sub.e.

    [0081] In some embodiments, the sub-arrays of synapse circuits 502, 504 are overlaid such that the corresponding synapse circuits 502, 504 are relatively close, permitting a local updating of synaptic weight g.sub.θ of the corresponding synapse circuits. For example, the sub-arrays are integrated in a same wafer or structure, as will be described in more detail below with reference to FIG. 11.

    [0082] The type of resistive memory used to implement the memory devices 506, 516 of the synapse circuits 502 and 504 is for example chosen such that while programmed conductance levels of the memory devices storing the conductances g.sub.θ decay relatively little over time, the conductance levels of the memory devices storing the conductances g.sub.e have a relatively high rate of decay. For example, the two memory devices 506, 516 of the synapse circuits 502 are implemented by different technologies of resistive memory device, one providing non-volatile storage, and the other providing volatile storage with a relatively high decay rate. Alternatively, the two memory devices 506, 516 of the synapse circuits 502 are implemented by the same technology of resistive memory device, such as PCM technology, and the decay rates are varied between the devices by other means, such as by using different conductance ranges.

    [0083] The use of a relatively high conductance decay rate for the memory device 516 storing the conductance g.sub.e provides a simple and effective implementation of the decay rate λ, without the need of further circuitry such as timers, etc. Furthermore, it for example allows the multiplication of the eligibility value e with the learning rate γ and the TD error δ.sub.t in an analog manner, leading to a simple and low-power solution.

    [0084] While in FIG. 5 the sub-array of synapse circuits 504 has been illustrated arranged in a similar configuration to the synapse circuits 502, it will be apparent to those skilled in the art that any arrangement that permits the memory cells of the circuit to be accessed and selectively programmed could be implemented. For example, rather than having orthogonal source and bit lines, the source and bit lines could be parallel to each other, an orthogonal word line for example being used to select the gate of transistors.

    [0085] The drift of a PCM device will now be described in more detail with reference to FIGS. 6 and 7.

    [0086] FIG. 6 is a graph illustrating an example of conductance drift of a phase change memory device over time. In particular, for a PCM device that has its resistance state reset to a high resistive state (HRS) at a time t0 and is left drifting for 30 seconds, it can be observed that the conductivity presents a power law decay, the time-constant of which depends on the reset conditions. In the example embodiment, the conductance is at around 0.35 μS after 2 s, and has fallen to around 0.27 μS after 7 s, and to around 0.255 μS after 12 s. Thus, the conductance drift substantially follows a relation of 1/t.

    [0087] The phase-change memory devices are for example chalcogenide-based devices, in which the resistive switching layer is formed of polycrystalline chalcogenide, placed in contact with a heater.

    [0088] As known by those skilled in the art, a reset operation of a PCM device involves applying a relatively high current through the device for a relatively short duration. For example, the duration of the current pulse is of less than 10 ns. This causes a melting of a region of a resistive switching layer of the device, which then changes from a crystalline phase to an amorphous phase, and then cools without recrystallizing. This amorphous phase has a relatively high electrical resistance. Furthermore, this resistance increases with time following the reset operation, corresponding to a decrease in the conductance of the device. Such a drift is for example particularly apparent when the device is reset using a relatively high current, leading to a relatively high initial resistance, and a higher subsequent drift. Those skilled in the art will understand how to measure the drift that occurs based on different reset states, i.e. different programming currents, and will then be capable of choosing a suitable programming current that results in an amount of drift that can be exploited as described herein.

    [0089] As also known by those skilled in the art, a set operation of a PCM device involves applying a current that is lower than the current applied during the reset operation, for a longer duration. For example, the duration of the current pulse is of more than 100 ns. This for example causes the amorphous region of the resistive switching layer of the device to change from the amorphous phase back to the crystalline phase as the current reduces. The resistance of the device is thus relatively low.

    [0090] FIG. 7 is a graph illustrating, on a logarithmic scale, an example of a drift in a resistance of a phase-change memory device over time in the set (SET) and reset (RESET) states. It can be seen that, whereas the resistance varies relatively little in the set state, there is a relatively high increase over time in the reset state. For example, the resistance R in both the set and reset states substantially follows the model R=R.sub.0(t/t.sub.0).sup.v, where R.sub.0 is the initial resistance at time t.sub.0. In the case of the set state, the parameter v is for example of less than 0.01, whereas for the reset state, the parameter v is for example over 0.1, and for example equal to around 0.11.

    [0091] FIG. 8 schematically illustrates the agent 102 of FIGS. 1 and 2 in more detail according to an example embodiment of the present disclosure. For example, in addition to the DNN 200, the agent 102 comprises a control circuit (CTRL) 602 that receives the state S.sub.t+1 and the reward R.sub.t+1 from the environment 104, and provides to the DNN 200 the state S.sub.t and a scalar value equal to αδ.sub.t. The control circuit 802 also for example provides the control signals SEL_θ and SEL_e to the DNN 200 to control the different phases.

    [0092] FIG. 9 schematically illustrates part of a synapse circuit in more detail according to an example embodiment, and illustrates in particular memory devices 506, 516 of the synapse circuits 502, 504 respectively, which respectively store the conductances g.sub.e and g.sub.θ, and the synapse control circuit 528.

    [0093] During the operations 305 and 306 of FIG. 3, the derivative ∂V.sub.t/∂θ.sub.t associated with the neuron and resulting from the backpropagation through the network is for example provided to a programming circuit (PROG) 908, which generates a control signal Δg.sub.e for modifying the conductance of the memory device 516. In view of the drift over time of the conductance of the memory device 516, the new conductance thus becomes g.sub.e=γλg.sub.e.sub.t−1+Δg.sub.e, where γλ is represented by the decay rate of the memory device 516. Alternatively, in the case that the memory device 516 is capable of only being reset, a decision is for example made by the programming circuit 908 of whether or not to reset the resistive state of the device 516 based on the value of the derivative ∂V.sub.t/∂θ.sub.t. For example, this involves comparing the value of the derivative ∂V.sub.t/∂7θ.sub.t with a threshold, and if the threshold is exceeded, the device 516 is reset, whereas otherwise no action is taken. It would also be possible to read a current value of the conductivity γλg.sub.e.sub.t−1. In this case, γλg.sub.e.sub.t−1+Δg.sub.e can be evaluated and compared with a threshold in order to decide whether or not to reset the conductance of the memory device.

    [0094] During the operation 308 of FIG. 3, the memory device 516 for example receives the value αδ.sub.t, which is for example in the form of an analog voltage level generated by a digital to analog converter (DAC—not illustrated in FIG. 9). Applying this signal to the memory device 516, for example to its top electrode, causes a current to be generated that is a function of this voltage and of the conductance g.sub.e of the device 516. Thus, the current represents αδe.sub.t. The value αδe.sub.t is for example provided to a programming circuit (PROG) 910, which generates a control signal Δg.sub.θ for modifying the conductance of the corresponding memory device 506 based on the value αδe.sub.t. For example, the new conductance thus becomes g.sub.θ.sub.t=g.sub.θ.sub.t−1+Δg.sub.θ. While the above example is based on the use of an analog voltage level to represent αδ.sub.t, in alternative embodiments, it would also be possible to represent this as an analog current level, the voltage across the memory device then representing the output αδe.sub.t.

    [0095] FIG. 10A is a flow diagram illustrating operations in a method of storing an eligibility trace to the memory device 516 of FIG. 9, according to an example in which a resistive state of the memory device is selectively reset.

    [0096] In an operation 1002, the value of the derivative ∂V.sub.t/∂θ.sub.t is compared to a threshold Th. If the threshold is exceeded (branch Y), the conductance g.sub.e of the memory device is reset in an operation 1004 (RESET g.sub.e). Otherwise (branch N), the conductance of the memory device 516 is not modified, as shown by an operation 1006 (DO NOTHING).

    [0097] FIG. 10B is a timing diagram representing variation of the conductance g.sub.e of the memory device 516 storing an eligibility trace as a function of time (TIME) according to an example embodiment, over three iterations corresponding to timesteps t1, t2 and t3. The conductance y.sub.e for example starts at an initial value INITIAL, and decays until the timestep t1. A value of the derivative ∂V.sub.t/∂θ.sub.t is then compared to the threshold Th, which is exceeded, and thus the conductance is reset to a reset level g.sub.e_rst. The conductance g.sub.e then for example decays until the timestep t2. This time the value of the derivative ∂V.sub.t/∂θ.sub.t does not exceed the threshold Th, and thus no action is taken, and the conductance g.sub.e continues to decay until the timestep t3. A value of the derivative ∂V.sub.t/∂θ.sub.t is then compared to the threshold Th, which is exceeded, and thus the conductance is reset again to the reset level g.sub.e_rst.

    [0098] FIG. 10C is a flow diagram illustrating operations in a method of storing a synaptic weight to the memory device 506 of FIG. 9, according to an example in which the memory device 506 storing the synaptic weight θ formed by two devices respectively having conductances g.sub.θ+ and g.sub.θ−. Each of these devices is for example of a technology permitting its conductance to be increased gradually using programming pulses, for example during a set operation. However, decreasing the conductance is for example performed by an abrupt reset operation. For example, the memory device is a PCM device or an OxRAM device. The method of FIG. 10C is for example implemented by the programming circuit 910 of FIG. 9.

    [0099] In an operation 1012, the output αδe.sub.t from the memory device 516 is positive or negative, indicating whether the synaptic weight θ should be increased or reduced. Indeed, in some embodiments, the parameters e.sub.t and/or δ may have positive or negative values. For example, this comparison is performed in an analog manner using a comparator. If the output αδe.sub.t is positive (branch Y), in an operation 1014 (NUMBER OF SET PULSES TO g.sub.θ+ PROPORTIONAL TO αδ.sub.te.sub.t), a number of SET pulses is applied to the memory device of conductance g.sub.θ+ in order to increase the conductance of this device. Alternatively, if the output αδe.sub.t is negative (branch N), in an operation 1016 (NUMBER OF SET PULSES TO g.sub.θ− PROPORTIONAL TO αδ.sub.te.sub.t), a number of SET pulses is applied to the memory device of conductance g.sub.θ− in order to increase the conductance of this device. The overall conductance g.sub.θ for example results from the combined conductances of the two memory devices, as will now be described with reference to FIG. 10D.

    [0100] FIG. 10D is a timing diagram representing examples of the conductances g.sub.θ− and g.sub.θ+ and of the corresponding value of the synaptic weight θ, equal for example to a difference between the conductances g.sub.θ− and g.sub.θ+, plus an offset.

    [0101] Initially, it is assumed that both memory devices have a low conductance of g.sub.L, and that this corresponds to an intermediate value Vint of the synaptic weight θ.

    [0102] At a timestep t1, it is for example found that the output value αδe.sub.t1 is positive, and thus the conductance g.sub.θ+ is increased by an amount Δg.sub.θ1, for example by applying three consecutive current or voltage pulses to the corresponding memory device based on the magnitude of αδe.sub.t1, and the synaptic weight thus increases by a corresponding amount Δθ1.

    [0103] At a timestep t2, it is for example found that the output value αδe.sub.t2 is negative, and thus the conductance g.sub.θ− is increased by an amount Δg.sub.θ2, for example by applying two consecutive current or voltage pulses to the corresponding memory device based on the magnitude of αδe.sub.t2, and the synaptic weight thus decreases by a corresponding amount Δθ2.

    [0104] At a timestep t3, it is for example found that the output value αδe.sub.t3 is positive, and thus the conductance g.sub.θ+ is increased by an amount Δg.sub.θ3, for example by applying a single current or voltage pulse to the corresponding memory device based on the magnitude of αδe.sub.t3, and the synaptic weight thus increases by a corresponding amount Δθ3.

    [0105] FIG. 11 is a cross-section view illustrating a transistor layer 1101 and a metal stack 1102 forming a portion 1100 of a deep neural network, and illustrates an example of the co-integration of two types of resistive memory devices. For example, such a structure is used to form the array 500 of FIG. 5 comprising the devices 506 and 516 of FIG. 9. The device 506 stores the synaptic weight θ and has relatively low conductance decay, for example corresponding to a non-volatile behavior, and the device 516 stores the eligibility trace e and for example has a relatively high conductance decay, for example corresponding to a volatile behavior.

    [0106] The transistor layer 1101 is formed of a surface region 1103 of a silicon substrate in which transistor sources and drains S, D, are formed, and a transistor gate layer 1104 in which gate stacks 1106 of the transistors are formed. Two transistors 1108, 1110 are illustrated in the example of FIG. 11.

    [0107] The metal stack 1102 comprises four interconnection levels 1112, 1113, 1114 and 1115 in the example of FIG. 11, each interconnection level for example comprising a patterned metal layer 1118 and metal vias 1116 coupling metal layers, surrounded by a dielectric material. Furthermore, metal vias 1116 for example extend from the source, drain and gate contacts of the transistors 1108, 1110 to the metal layer 1118 of the interconnection level 1112.

    [0108] In the example of FIG. 11, a restive memory device 1120 of a first type, is formed in the interconnection level 1113, and for example extends between the metal layers 1118 of the interconnection levels 1113 and 1114. This device 1120 for example corresponds to the device 516 of FIG. 9. A resistive memory device 1122 of a second type is formed in the interconnection level 1114, and for example extends between the metal layers 1118 of the interconnection levels 1114 and 1115. This device 1122 for example corresponds to the device 506 of FIG. 9.

    [0109] An advantage of the embodiments described herein is that TD-lambda temporal difference learning using a neural network to approximate a value function can be implemented by a DNN with relatively low complexity, using relatively compact and low-cost circuitry. In particular, the values of the synaptic weights θ can be updated locally at the synapses based on the corresponding eligibility trace e, leading to gains in terms of complexity, surface area, cost, and also power consumption.

    [0110] Various embodiments and variants have been described. Those skilled in the art will understand that certain features of these embodiments can be combined and other variants will readily occur to those skilled in the art. In particular, it will be apparent to those skilled in the art that, while certain examples of resistive memory types have been provided, other technologies could also be used to implement the memory devices of the DNN. Furthermore, while the example of a DNN has been described, the implementation of the agent is not limited to a DNN, and other types of neural networks could equally be used.

    [0111] Finally, the practical implementation of the embodiments and variants described herein is within the capabilities of those skilled in the art based on the functional description provided hereinabove.