DEVICE AND METHOD FOR TD-LAMBDA TEMPORAL DIFFERENCE LEARNING WITH A VALUE FUNCTION NEURAL NETWORK
20220374697 · 2022-11-24
Inventors
Cpc classification
G06N7/01
PHYSICS
G06N3/006
PHYSICS
G06N3/049
PHYSICS
International classification
Abstract
The present disclosure relates to a synapse circuit of a neural network for performing TD-lambda temporal difference learning, the neural network approximating a value function, the synapse circuit comprising: a first resistive memory device (506); a second resistive memory device (516); and a synapse control circuit (528) configured to update a synaptic weight (g.sub.θ) of the synapse circuit by programming a resistive state of the first resistive memory device (506) based on a programmed conductance of the second resistive memory device (516).
Claims
1. A synapse circuit of a neural network for performing TD-lambda temporal difference learning, the neural network approximating a value function, the synapse circuit comprising: a first resistive memory device; a second resistive memory device; and a synapse control circuit configured to update a synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− of the synapse circuit by programming a resistive state of the first resistive memory device based on a programmed conductance of the second resistive memory device.
2. The synapse circuit of claim 1, wherein the second resistive memory device is configured to have a conductance γλ that decays over time.
3. The synapse circuit of claim 2, wherein the second resistive memory device is a phase-change memory device or a conductive bridging RAM element.
4. The synapse circuit of claim 1, wherein the synapse control circuit is further configured to update an eligibility trace of the synapse circuit by programming a resistive state of the second resistive memory device based on a back-propagated derivative ∂V.sub.t/∂θ.sub.t of an output value V.sub.t of the neural network.
5. The synapse circuit of claim 1, wherein the synapse control circuit is configured to update the synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− by applying a voltage or current level generated based on a temporal difference error δ to an electrode of the second resistive memory device to generate an output current or voltage level.
6. The synapse circuit of claim 5, wherein the synapse control circuit is further configured to compare the output current or voltage level with one or more thresholds, and to program the resistive state of the first resistive memory device based on the comparison.
7. An agent device of a TD-lambda temporal difference learning system, the agent device comprising a neural network comprising an input layer of neurons, one or more hidden layers of neurons, and an output layer of neurons, wherein: each neuron of the input layer is coupled to one or more neurons of a first hidden layer of the one or more hidden layers via a corresponding synapse circuit implemented by the circuit of claim 5.
8. The agent device of claim 7, further comprising a control circuit configured to generate the temporal difference error δ based on a reward signal R.sub.t received from the environment, and to provide the temporal difference error δ to the neural network.
9. The agent device of claim 8, wherein the control device provides to the neural network a signal representative of the product of the temporal difference error δ and a learning rate α.
10. A system for TD-lambda temporal difference learning comprising: the agent device of claim 7 configured to generate an output signal indicating an action A.sub.t to be applied to an environment based on an output of the neural network; one or more actuators configured to apply the action A.sub.t to the environment; and one or more sensors configured to detect a state S.sub.t+1 of the environment and a reward R.sub.t+1 resulting from the action A.sub.t.
11. A method of TD-lambda temporal difference learning, the method comprising: updating a synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− of a synapse circuit of a neural network, the neural network approximating a value function, the synapse circuit comprising: a first resistive memory device; a second resistive memory device; and a synapse control circuit, wherein updating the synaptic weight comprises programming, by the synapse control circuit, a resistive state of the first resistive memory device based on a programmed conductance of the second resistive memory device.
12. The method of claim 11, wherein the second resistive memory device is configured to have a conductance γλ that decays over time.
13. The method of claim 11, further comprising updating, by the synapse control circuit, an eligibility trace of the synapse circuit by programming a resistive state of the second resistive memory device based on a back-propagated derivative ∂V.sub.t/∂θ.sub.t of an output value V.sub.t of the neural network
14. The method of claim 11, wherein updating the synaptic weight g.sub.θ g.sub.θ+ g.sub.θ− comprises applying a voltage or current level generated based on a temporal difference error δ to an electrode of the second resistive memory device in order to generate an output current or voltage level.
15. The method of claim 14, further comprising comparing, by the synapse control circuit, the output current or voltage level with one or more thresholds, and programming the resistive state of the first resistive memory device based on the comparison.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] The foregoing features and advantages, as well as others, will be described in detail in the following description of specific embodiments given by way of illustration and not limitation with reference to the accompanying drawings, in which:
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
DETAILED DESCRIPTION OF THE PRESENT EMBODIMENTS
[0042] Like features have been designated by like references in the various figures. In particular, the structural and/or functional features that are common among the various embodiments may have the same references and may dispose identical structural, dimensional and material properties.
[0043] Unless indicated otherwise, when reference is made to two elements connected together, this signifies a direct connection without any intermediate elements other than conductors, and when reference is made to two elements coupled together, this signifies that these two elements can be connected or they can be coupled via one or more other elements.
[0044] In the following disclosure, unless indicated otherwise, when reference is made to absolute positional qualifiers, such as the terms “front”, “back”, “top”, “bottom”, “left”, “right”, etc., or to relative positional qualifiers, such as the terms “above”, “below”, “higher”, “lower”, etc., or to qualifiers of orientation, such as “horizontal”, “vertical”, etc., reference is made to the orientation shown in the figures.
[0045] Unless specified otherwise, the expressions “around”, “approximately”, “substantially” and “in the order of” signify within 10%, and preferably within 5%.
[0046]
[0047] During a learning phase, reinforcement learning is used in order for the agent to learn a policy for selecting actions based on the rewards received from the actions applied to the environment. The agent updates its policy as a function of the actions and the rewards in order to improve its future expected discounted reward. While there are many manners in which the policy implemented by the agent 102 can be described and updated, there is a recent trend towards the use of a deep neural network that acts as a policy approximation. Such solutions are known as deep reinforcement learning.
[0048] In some embodiments, the agent applies TD-lambda temporal difference learning. In such a case, the neural network maintains an internal representation of a value function V(s), which gives the value of being in each state in view of the current state. The neural network is configured to learn the value function V(s) based on the state information and on the rewards. For example, the policy is updated by iteratively differentiating the difference between the predicted and received value with respect to the synaptic weights of the current policy. This difference is known as the temporal difference (TD) error.
[0049] In other embodiments, the agent uses a function Q(s,a). In such a case, the neural network is configured to learn, based on the state information and on the rewards, a function Q that gives the value of each action that may be taken while in the current state. The training involves, for example, minimizing the difference (TD error) between the predicted Q-value, i.e. the one that resulted in a given action being taken, and the received reward plus the maximum Q value that is selected next as a function of the resulting state S.sub.t+1.
[0050]
[0051] For example, in one embodiment, the neural network implements a value function V(s), and the outputs indicate the value of being in a given state. A state-value network for example has one or more output neurons.
[0052] In state-action value functions Q(s,a), a neural network for example has multiple output neurons each of which corresponds to a different action that can be taken in that state. The highest output for example indicates the action that should be taken. A corresponding action A.sub.t is for example selected and applied to the environment in order to move to this next state.
[0053] The environment 104 provides the next state S.sub.t+1 to the input of the DNN 200, and also supplies the reward R.sub.t+1 to the agent 102, as will be described in more detail below.
[0054]
[0055] In an operation 301 (INITIALISE θ and e), matrices θ and e stored by the agent 102 are initialized. For example, the matrix θ corresponds to a parameter matrix of the DNN 200, defining the synaptic weights of the synapses of the DNN 200. The matrix e corresponds for example to an eligibility matrix of the DNN 200, and defines for example, for each synapse, an eligibility trace of the synapse for use in updating the corresponding synaptic weight.
[0056] After the initialization operation 301, an iterative learning phase is for example entered, each iteration involving operations 302 to 310.
[0057] In the operation 302 (RECEIVE STATE S.sub.t AND ANY REWARD R.sub.t), the agent 102 for example receives from the environment, at a timestep t, the state S.sub.t of the environment, and any reward R.sub.t occurring during the timestep t. Indeed, given that rewards may occur after a certain time delay with respect to actions, there may be no rewards received during some timesteps.
[0058] In the operation 303 (FORWARD PROPAGATE STATE S.sub.t), a current state S.sub.t of the environment is forward propagated through the DNN 200. The state is thus modified by the parameter matrix θ of the DNN 200, and values V.sub.t at the output layer of the DNN 200 are thus generated.
[0059] In the operation 304 (DETERMINE+APPLY ACTION A.sub.t), the action to be applied to the environment 104, based on the output values V.sub.t resulting from the state S.sub.t, is determined and applied to the environment 104, for example via one or more actuators of the environment 104. For example, the action A.sub.t is one that is associated with a neuron of the output layer of the DNN 200 having the highest value.
[0060] In the operations 305 and 306, the eligibility matrix e is for example updated based on the output values V.sub.t resulting from the forward propagation of the state S.sub.t in the operation 303.
[0061] In the operation 305 (BACK PROPAGATE DERIVATIVE ∂V.sub.t/∂θ.sub.t), the derivatives ∂V.sub.t/∂θ.sub.t of the output values V.sub.t with respect to the model defined by the synaptic weights θ.sub.t are backpropagated through the neural network. For each synapse, the derivative ∂V.sub.t/∂θ.sub.t represents in particular how each synaptic weight θ impacts the calculation of the value function V.sub.t. This is a different approach from a standard learning technique in a neural network, in which it is the derivative of the cost with respect to the model, or the loss with respect to the labelled output, that is back propagated through the network.
[0062] In the operation 306 (UPDATE ELIGIBILITY e), the derivative ∂V.sub.t/∂θ.sub.t of each synapse is used to update the eligibility trace e of the synapse. For example, the new eligibility value e.sub.t for timestep t is generated based on the following equation:
where e.sub.t−1 is the previous value of the eligibility trace at the timestep t−1, γ is a discounting rate, and λ is a decay rate defining how quickly the eligibility trace decays. The discounting rate γ and the decay rate λ are for example each equal to between 0 and 1, and in some cases either or both is for example equal to between 0.8 and 0.99.
[0063] In the operations 307 and 308, the parameter matrix θ is updated based on the output values V.sub.t resulting from the forward propagation of the state S.sub.t in the operation 303, and also based on the output values V.sub.t−1 resulting from the forward propagation of the state S.sub.t−1 during the operation 303 of the previous iteration, in other words at the timestep t−1.
[0064] In operation 307 (CALCULATE TD ERROR δ.sub.t), a temporal difference error value δ.sub.t is calculated based on any reward R.sub.t received from the environment during the timestep t. For example, in one embodiment, the TD error value δ.sub.t is calculated based on the following equation:
δ.sub.t=R.sub.t+γV.sub.t−V.sub.t−1 [Math 2]
where γ is the discounting rate, V.sub.t represents the output of the value function during the timestep t, and V.sub.t−1 represents the outputs of the value function during the previous iteration, i.e. the timestep t−1. For example, in the case of a value function V(s), the output value V.sub.t is a scalar value indicating the value of the state. After simulating multiple potential states, an action is selected that leads to be best next state, in line with the NN predictions. Thus, the subtraction γV.sub.t−V.sub.t−1 is a subtraction of scalars. The TD error is thus based on a difference between the predicted value V.sub.t−1 of the neural network outputs at the previous iteration, and the discounted observed output γV.sub.t during the current iteration, plus the observed reward. In case of no reward, the TD error is only based on the difference, and the weights of the neural network are still updated. In the case of Q(s,a) value functions, the output is a vector corresponding to the actions. In this case, γQ.sub.t−Q.sub.t−1 is also a subtraction of scalars, for example only taking the value that corresponded to the predicted Q of the action that was actually taken.
[0065] In an operation 308 (UPDATE SYNAPTIC WEIGHTS θ), the parameter matrix θ of the DNN is for example updated based on the eligibility matrix e updated in the operation 306, and based on the temporal difference error value δ.sub.t calculated in operation 307. For example, each weight of the parameter matrix θ is updated based on the following equation:
θ.sub.t=θ.sub.t−1+αδ.sub.te.sub.t [Math 3]
where θ.sub.t is the updated synaptic weight, θ.sub.t−1 is the previous synaptic weight, and α is a learning rate, for example equal to between 1e-6 and 1e-4, and for example equal to or less than 1e-5. In some embodiments, the value of α is chosen such that the term αδ.sub.te.sub.t modifies the synaptic weight θ.sub.t−1 by a desired quantity, corresponding for example to a few percent, for example by between 0.1 and 3 percent. The factor αδ.sub.t is for example a scalar value that is the same for all the synapses of the network.
[0066] In an operation 309 (END LEARNING PHASE?), it is determined whether a stop condition has been met in order to stop the learning phase. For example, the stop condition may be met after a certain number of iterations of the algorithm, or once the TD error δ.sub.t, for example after application of a low-pass filter, falls below a given threshold. If the stop condition is not met (branch N), a new iteration is started, involving an operation 310 (t=t+1) in which t is incremented, and thus the next timestep is considered. The method then returns to the operation 302, and the operations 302 to 309 are for example repeated. Once the stop condition of operation 309 is met (branch Y), the next operation 311 (FUNCTIONAL PHASE) for example involves switching from the learning phase to a function phase in which the parameter matrix θ for example becomes fixed, and the eligibility matrix e is no longer used.
[0067] While
[0068] While in the example of
[0069] Furthermore, while in the example of
[0070]
[0071] The DNN architecture 200 according to the example of .sup.2*7. The DNN architecture 200 of
[0072] The policy V=Π.sub.θ(S) applied by the DNN architecture 200 is a functions aggregation, comprising an associative function g.sub.n within each layer, these functions being connected in a chain to map V=Π.sub.θ(S)=g.sub.n( . . . (g.sub.2(g.sub.1(S)) . . . )). There are just two such functions in the simple example of
[0073] Each neuron of the hidden layer receives the signal from each input neuron, a corresponding synaptic weight θ.sub.j.sup.i being applied to each neuron j of the hidden layer from each input neuron i of the input layer.
[0074] Similarly, each neuron of the output layer receives the signal from each neuron of the hidden layer, a corresponding synaptic weight θ.sub.j.sup.k being applied to each neuron k of the output layer from each neuron j of the hidden layer.
[0075]
[0076] In the example of
[0077] Each of the synapse circuits 502 for example comprises a non-volatile memory device storing, in the form of a conductance, a synapse weight g.sub.θ associated with the synapse circuit. The memory device of each synapse circuit 502 is for example implemented by a PCM device, or other type of resistive random-access memory (ReRAM) device, such as an oxide RAM (OxRAM) device, which is based on so-called “filamentary switching”. The device for example has low or negligible drift of its programmed level of conductive over time. In the case of a PCM device, the device is for example programmed with relatively high conductance/low resistance states, which are less affected by drift than the low conductance/high resistance states. The synapse circuits 502 are for example coupled at each intersection between a pre-synaptic neuron of the layer N and a post-synaptic neuron of the layer N+1 in a cross-bar fashion, as known by those skilled in the art. For example, a blow-up view in
[0078] During the forward propagation of the state S.sub.t through the DNN 200, each neuron n of the layer N+1 for example receives an activation vector equal to S.sub.in.Math.W, where S.sub.in is the input vector from the previous layer, and W are the weights of the parameter matrix θ associated with the synapses leading to the neuron n. A voltage is for example applied to each of the lines 512, which is for example coupled to the top electrode of each resistive device 506 of a column and to the neuron n. The selection transistors 508 are then for example activated, such that a current will flow through each device 506 equal to V×g.sub.θ, where V is the top electrode voltage, and g.sub.θ is the conductance of the device 506. The current flowing through the line 512 will thus be the addition of the current flowing through each device 506 of the column, and the result is a weighted sum operation. A similar operation for example occurs at each neuron of each layer of the network, except in the input layer.
[0079] Each of the synapse circuits 504 for example comprises a volatile memory device storing, in the form of a conductance, a synapse eligibility value g.sub.e associated with the synapse circuit. The memory device of each synapse circuit 504 is for example implemented by a PCM device with pronounced drift behavior, or another type of resistive memory having a conductance decay over time, such as a silver-oxide based conductive bridge RAM element. In the case of a PCM device, the device is for example programmed with relatively low conductance/high resistance states, which have a more pronounced drift than the high conductance/low resistance states. The synapse circuits 504 are for example coupled at each intersection between a pre-synaptic neuron of the layer N and a post-synaptic neuron of the layer N+1 in a cross-bar fashion. For example, a blow-up view in
[0080] The conductance of the resistive memory elements of the pair of synapse circuits 502, 504 coupling a same pair of neurons are for example used in a complementary fashion during the updating of the synapse weight g.sub.θ, as represented by a dashed arrow 524 in
[0081] In some embodiments, the sub-arrays of synapse circuits 502, 504 are overlaid such that the corresponding synapse circuits 502, 504 are relatively close, permitting a local updating of synaptic weight g.sub.θ of the corresponding synapse circuits. For example, the sub-arrays are integrated in a same wafer or structure, as will be described in more detail below with reference to
[0082] The type of resistive memory used to implement the memory devices 506, 516 of the synapse circuits 502 and 504 is for example chosen such that while programmed conductance levels of the memory devices storing the conductances g.sub.θ decay relatively little over time, the conductance levels of the memory devices storing the conductances g.sub.e have a relatively high rate of decay. For example, the two memory devices 506, 516 of the synapse circuits 502 are implemented by different technologies of resistive memory device, one providing non-volatile storage, and the other providing volatile storage with a relatively high decay rate. Alternatively, the two memory devices 506, 516 of the synapse circuits 502 are implemented by the same technology of resistive memory device, such as PCM technology, and the decay rates are varied between the devices by other means, such as by using different conductance ranges.
[0083] The use of a relatively high conductance decay rate for the memory device 516 storing the conductance g.sub.e provides a simple and effective implementation of the decay rate λ, without the need of further circuitry such as timers, etc. Furthermore, it for example allows the multiplication of the eligibility value e with the learning rate γ and the TD error δ.sub.t in an analog manner, leading to a simple and low-power solution.
[0084] While in
[0085] The drift of a PCM device will now be described in more detail with reference to
[0086]
[0087] The phase-change memory devices are for example chalcogenide-based devices, in which the resistive switching layer is formed of polycrystalline chalcogenide, placed in contact with a heater.
[0088] As known by those skilled in the art, a reset operation of a PCM device involves applying a relatively high current through the device for a relatively short duration. For example, the duration of the current pulse is of less than 10 ns. This causes a melting of a region of a resistive switching layer of the device, which then changes from a crystalline phase to an amorphous phase, and then cools without recrystallizing. This amorphous phase has a relatively high electrical resistance. Furthermore, this resistance increases with time following the reset operation, corresponding to a decrease in the conductance of the device. Such a drift is for example particularly apparent when the device is reset using a relatively high current, leading to a relatively high initial resistance, and a higher subsequent drift. Those skilled in the art will understand how to measure the drift that occurs based on different reset states, i.e. different programming currents, and will then be capable of choosing a suitable programming current that results in an amount of drift that can be exploited as described herein.
[0089] As also known by those skilled in the art, a set operation of a PCM device involves applying a current that is lower than the current applied during the reset operation, for a longer duration. For example, the duration of the current pulse is of more than 100 ns. This for example causes the amorphous region of the resistive switching layer of the device to change from the amorphous phase back to the crystalline phase as the current reduces. The resistance of the device is thus relatively low.
[0090]
[0091]
[0092]
[0093] During the operations 305 and 306 of
[0094] During the operation 308 of
[0095]
[0096] In an operation 1002, the value of the derivative ∂V.sub.t/∂θ.sub.t is compared to a threshold Th. If the threshold is exceeded (branch Y), the conductance g.sub.e of the memory device is reset in an operation 1004 (RESET g.sub.e). Otherwise (branch N), the conductance of the memory device 516 is not modified, as shown by an operation 1006 (DO NOTHING).
[0097]
[0098]
[0099] In an operation 1012, the output αδe.sub.t from the memory device 516 is positive or negative, indicating whether the synaptic weight θ should be increased or reduced. Indeed, in some embodiments, the parameters e.sub.t and/or δ may have positive or negative values. For example, this comparison is performed in an analog manner using a comparator. If the output αδe.sub.t is positive (branch Y), in an operation 1014 (NUMBER OF SET PULSES TO g.sub.θ+ PROPORTIONAL TO αδ.sub.te.sub.t), a number of SET pulses is applied to the memory device of conductance g.sub.θ+ in order to increase the conductance of this device. Alternatively, if the output αδe.sub.t is negative (branch N), in an operation 1016 (NUMBER OF SET PULSES TO g.sub.θ− PROPORTIONAL TO αδ.sub.te.sub.t), a number of SET pulses is applied to the memory device of conductance g.sub.θ− in order to increase the conductance of this device. The overall conductance g.sub.θ for example results from the combined conductances of the two memory devices, as will now be described with reference to
[0100]
[0101] Initially, it is assumed that both memory devices have a low conductance of g.sub.L, and that this corresponds to an intermediate value Vint of the synaptic weight θ.
[0102] At a timestep t1, it is for example found that the output value αδe.sub.t1 is positive, and thus the conductance g.sub.θ+ is increased by an amount Δg.sub.θ1, for example by applying three consecutive current or voltage pulses to the corresponding memory device based on the magnitude of αδe.sub.t1, and the synaptic weight thus increases by a corresponding amount Δθ1.
[0103] At a timestep t2, it is for example found that the output value αδe.sub.t2 is negative, and thus the conductance g.sub.θ− is increased by an amount Δg.sub.θ2, for example by applying two consecutive current or voltage pulses to the corresponding memory device based on the magnitude of αδe.sub.t2, and the synaptic weight thus decreases by a corresponding amount Δθ2.
[0104] At a timestep t3, it is for example found that the output value αδe.sub.t3 is positive, and thus the conductance g.sub.θ+ is increased by an amount Δg.sub.θ3, for example by applying a single current or voltage pulse to the corresponding memory device based on the magnitude of αδe.sub.t3, and the synaptic weight thus increases by a corresponding amount Δθ3.
[0105]
[0106] The transistor layer 1101 is formed of a surface region 1103 of a silicon substrate in which transistor sources and drains S, D, are formed, and a transistor gate layer 1104 in which gate stacks 1106 of the transistors are formed. Two transistors 1108, 1110 are illustrated in the example of
[0107] The metal stack 1102 comprises four interconnection levels 1112, 1113, 1114 and 1115 in the example of
[0108] In the example of
[0109] An advantage of the embodiments described herein is that TD-lambda temporal difference learning using a neural network to approximate a value function can be implemented by a DNN with relatively low complexity, using relatively compact and low-cost circuitry. In particular, the values of the synaptic weights θ can be updated locally at the synapses based on the corresponding eligibility trace e, leading to gains in terms of complexity, surface area, cost, and also power consumption.
[0110] Various embodiments and variants have been described. Those skilled in the art will understand that certain features of these embodiments can be combined and other variants will readily occur to those skilled in the art. In particular, it will be apparent to those skilled in the art that, while certain examples of resistive memory types have been provided, other technologies could also be used to implement the memory devices of the DNN. Furthermore, while the example of a DNN has been described, the implementation of the agent is not limited to a DNN, and other types of neural networks could equally be used.
[0111] Finally, the practical implementation of the embodiments and variants described herein is within the capabilities of those skilled in the art based on the functional description provided hereinabove.