Machine Learning Computer

20220051095 · 2022-02-17

    Inventors

    Cpc classification

    International classification

    Abstract

    A computer comprising a plurality of processing units, each processing unit having an execution unit and access to computer memory which stores code executable by the execution unit and input values of an input vector to be processed by the code, the code, when executed, configured to access the computer memory to obtain multiple pairs of input values of the input vector, determine a maximum or corrected maximum input value of each pair as a maximum result element, determine and store in a computer memory a maximum or corrected maximum result of each pair of maximum result elements as an approximation to the natural log of the sum of the exponents of the input values and access the computer memory to obtain each input value and apply it to the maximum or corrected maximum result to generate each output value of a Softmax output vector.

    Claims

    1. A computer-implemented method of processing an input vector comprising a plurality of input values, the method comprising: accessing computer memory to obtain the input values from the input vector; determining exponentials for the input values; and computing, using a processor, an activation function of the input vector to generate an output vector, the output vector comprising a plurality of output values, wherein the computing an activation function of the input vector to generate an output vector comprises: determining a natural logarithm of a sum of the exponentials of the input values; and determining each output value for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.

    2. The computer-implemented method as claimed in claim 1, wherein determining each output value comprises: subtracting the natural logarithm of the sum of the exponentials of the input values from the respective input value to generate a value based on a natural logarithm of the activation function; and exponentiating the value based on the natural logarithm of the activation function.

    3. The computer-implemented method as claimed in claim 1, wherein determining each output value comprises: subtracting the natural logarithm of the sum of the exponentials of the input values from the respective input to generate a value based on a natural logarithm of the activation function.

    4. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises: determining an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, until there is only one input value remaining, pairing input values and then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    5. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises: determining an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, selecting from each pair of input values a largest value, applying a correction factor to the largest value based on the pair of input values to generate a max* result value until there remains one max* result value which is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    6. The computer-implemented method of claim 5 comprising padding the values with a null value where there is an odd number of values to create an even number of input values to create pairs.

    7. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises determining an approximation of the natural logarithm of the sum of the exponential, including: iteratively until there is only one input value remaining, pairing input values and padding the values with a null value where there is an odd number of values then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    8. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises: determining an approximation of the natural logarithm of the sum of the exponentials, including: selecting from the input values a largest value, wherein the largest value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    9. The computer-implemented method as claimed in claim 1, wherein obtaining the input values comprises: receiving a first input array comprising a plurality of first input values; and applying adjustable weightings to respective ones of the first input values to generate the input vector.

    10. A computer program embodied on computer-readable storage, the program comprising code configured so as when run on one or more processors to perform a method of processing an input vector comprising a plurality of input values by performing operations including: accessing computer memory to obtain the input values from the input vector; determining exponentials for the input values; and computing an activation function of the input vector to generate an output vector, the output vector comprising one or more output values, wherein the computing an activation function of the input vector to generate an output vector comprises: determining a natural logarithm of a sum of the exponentials of the input values; and determining each output value for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.

    11. A computer system comprising: storage comprising one or more memory units, and processing apparatus comprising one or more processing units; wherein the storage stores code arranged to run on the processing apparatus, the code being configured so as when thus run to perform a method of processing an input vector comprising a plurality of input values by performing operations including: accessing computer memory to obtain the input values from the input vector; determining exponentials for the input values; and computing, using the processing apparatus, an activation function of the input vector to generate an output vector, the output vector comprising one or more output values, wherein the computing an activation function of the input vector to generate an output vector comprises: determining a natural logarithm of a sum of the exponentials of the input values; and determining each output value for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.

    12. A computer comprising: a plurality of processing units, each processing unit having an execution unit and access to computer memory which stores code executable by the execution unit and input values of an input vector to be processed by the code, and the code, when executed, is configured to cause each processing unit to: access the computer memory to obtain multiple pairs of the input values of the input vector; determine a maximum or a corrected maximum input value of each pair as a maximum result element; store in the computer memory a maximum or a corrected maximum result of each pair of maximum result elements as an approximation to a natural log of a sum of exponents of the input values; and access the computer memory to obtain each input value and apply each input value to the maximum or the corrected maximum result to generate each output value of a Softmax output vector.

    13. The computer as claimed in claim 12 wherein the corrected maximum value is determined as
    max(x,y)+log(1+e.sup.−|x-y|) where x, y are the input values.

    14. The computer as claimed in claim 12 wherein each processing unit is associated with and has access to its own processor memory which is not shared by others of the processing units.

    15. The computer as claimed in claim 14 wherein each processor memory stores a set of input values of the input vector constituting a fragment of that input vector.

    16. The computer as claimed in claim 12 wherein each processing unit has its own memory that stores a set of input values of the input vector constituting a fragment of that input vector, wherein each processing unit is configured to process the fragment of that input vector in its associated memory and to generate a maximum result for that fragment, each processing unit configured to share its maximum result with other processing units which are configured to process other fragments of that input vector.

    17. The computer as claimed in claim 12, wherein each processing unit has its own memory that stores a set of input values of the input vector constituting a fragment of that input vector, wherein each processing unit is configured to process the fragment of that input vector in its associated memory and to generate a corrected maximum result for that fragment, each processing unit configured to share its corrected maximum result with other processing units which are configured to process other fragments of the input vector.

    18. An apparatus comprising a processor unit and a memory, the memory including computer program code, wherein processor unit is configured to execute the computer program code, causing the processor unit to: obtain an input vector, the input vector comprising a plurality of input values; determining exponentials for the input values; and apply an activation function to the input vector to generate an output vector, the output vector comprising a plurality of output values, wherein application of the activation function to the input vector to generate an output vector causes the apparatus to: determine a natural logarithm of a sum of the exponentials of the input values; and determine the output values for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.

    19. The apparatus of claim 18, comprising a plurality of additional processor units, each with an associated memory.

    20. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the output values: generate a value based on a natural logarithm of the activation function; and exponentiate the value based on the natural logarithm of the activation function.

    21. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the output values: generate a value based on the natural logarithm of the activation function.

    22. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the output values: determine an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, until there is only one input value remaining, pairing input values and then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    23. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the natural logarithm of the sum of the exponentials: determine an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, selecting from each pair of input values a largest value, applying a correction factor to the largest value based on the pair of input values to generate a max* result value until there remains one max* result value which is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    24. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the natural logarithm of the sum of the exponentials: determine an approximation of the natural logarithm of the sum of the exponential, including: iteratively until there is only one input value remaining, pairing input values and padding the values with a null value where there is an odd number of values then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    25. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the natural logarithm of the sum of the exponentials: determine an approximation of the natural logarithm of the sum of the exponentials, including: selecting from the input values a largest value, wherein the largest value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.

    26. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when obtaining the input values: receive a first input array comprising a plurality of first input values; and apply adjustable weightings to respective ones of the first input values to generate the input vector.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0082] To assist understanding of embodiments of the present disclosure and to illustrate how such embodiments may be put into effect, reference is made, my way of example only, to the accompanying drawings in which:

    [0083] FIG. 1(a) is a schematic illustration of a neural network;

    [0084] FIG. 1(b) is a schematic illustration of a node within the neural network;

    [0085] FIG. 2 is a schematic illustration of a computing apparatus for implementing a neural network;

    [0086] FIG. 3 is a schematic illustration of an example of multi-tile processor suitable for implementing a neural network and for generating an activation function according to some embodiments;

    [0087] FIG. 4 is a schematic illustration of a single tile from the multi-tile processor shown in FIG. 3 illustrating the input and output signals;

    [0088] FIG. 5 schematically illustrates a flow diagram of the operations for generating an activation function according to some embodiments;

    [0089] FIG. 6 schematically illustrates the decomposition of the indices for the determination of the activation function according to some embodiments; and

    [0090] FIG. 7 schematically illustrates a flow diagram of the operations for schematically illustrates a flow diagram of the operations for generating an activation function according to some embodiments.

    DETAILED DESCRIPTION OF EMBODIMENTS

    [0091] The following will present an example activation function implementation suitable for neural networks.

    [0092] First however there are described example systems in which the presently disclosed techniques may be implemented. There is also provided an overview of the principles behind neural networks, based upon which embodiments may be built or expanded.

    [0093] FIG. 2 illustrates an example computing apparatus 200 for implementing an artificial intelligence (AI) algorithm including a machine-learning model in accordance with embodiments described herein. The computing apparatus 200 may take the form of a user terminal such as a desktop computer, laptop computer, tablet, smartphone, wearable smart device such as a smart watch, or an on-board computer of a vehicle such as car, etc. Additionally or alternatively, the computing apparatus 200 may comprise a server. A server herein refers to a logical entity which may comprises one or more physical server units located at one or more geographic sites. Where required, distributed or “cloud” computing techniques are in themselves known in the art. The one or more user terminals and/or the one or more server units of the server may be connected to one another via a packet-switched network, which may comprise for example a wide-area internetwork such as the Internet, a mobile cellular network such as a 3GPP network, a wired local area network (LAN) such as an Ethernet network, or a wireless LAN such as a Wi-Fi, Thread or 6LoWPAN network.

    [0094] The computing apparatus 200 comprises at least a controller 202, an interface (e.g. a user interface) 204, and an artificial intelligence (AI) algorithm 206. The controller 202 is operatively coupled to each of the interface 204 and the AI algorithm 206.

    [0095] Each of the controller 202, interface 204 and AI algorithm 206 may be implemented in the form of software code embodied on computer readable storage and run on processing apparatus comprising one or more processors such as CPUs, work accelerator co-processors such as GPUs or IPUs, and/or other application specific processors, implemented on one or more computer terminals or units at one or more geographic sites. The storage on which the code is stored may comprise one or more memory devices employing one or more memory media (e.g. electronic or magnetic media), again implemented on one or more computer terminals or units at one or more geographic sites. In embodiments, one, some or all the controller 202, interface 204 and AI algorithm 206 may be implemented on the server. Alternatively, a respective instance of one, some or all of these components may be implemented in part or even wholly on each of one, some or all of the one or more user terminals. In further examples, the functionality of the above-mentioned components may be split between any combination of the user terminals and the server. Again it is noted that, where required, distributed computing techniques are in themselves known in the art. It is also not excluded that one or more of these components may be implemented in dedicated hardware.

    [0096] The controller 202 comprises a control function for coordinating the functionality of the interface 204 and the AI algorithm 206. The interface refers to the functionality for receiving and/or outputting data. It may comprise a user interface (UI) for receiving and/or outputting data to and/or from one or more users, respectively. Alternatively the interface may be arranged to collect and/or output data to and/or from an automated function. The interface 204 may additionally or alternatively receive and output data to a different component of the computing apparatus and/or to a different device. The interface 204 may comprise a wired or wireless interface for communicating, via a wired or wireless connection respectively, with an external device. The interface 204 may comprise one or more constituent types of interface, such as voice interface, and/or a graphical user interface. The interface 204 may present a UI front end to the user(s) through one or more I/O modules on their respective user device(s), e.g. speaker and microphone, touch screen, etc., depending on the type of user interface. The logic of the interface may be implemented on a server and output to the user through the I/O module(s) on his/her user device(s). Alternatively some or all of the logic of the interface 204 may also be implemented on the user device(s) 102 its/themselves.

    [0097] The controller 202 is configured to control the AI algorithm 206 to perform operations in accordance with the embodiments described herein. It will be understood that any of the operations disclosed herein may be performed by the AI algorithm 206, under control of the controller 202 to collect experience data from the user and/or an automated process via the interface 204, pass it to the AI algorithm 206, receive predictions back from the AI algorithm and output the predictions to the user and/or automated process through the interface 204.

    [0098] The AI algorithm 206 comprises a machine-learning model 208, comprising one or more constituent statistical models such as one or more neural networks.

    [0099] As discussed above each of the controller 202, interface 204 and AI algorithm 206 may be implemented in the form of software code embodied on computer readable storage and run on a data processing system comprising one or more processors. In some embodiments the processor comprises a plurality of processor tiles. The data processing system may be a so called intelligence processing unit (IPU) or any class of accelerator (XPU). The techniques described herein can be used with the IPUs described in our earlier U.S. application Ser. No. 15/885,925, the contents of which are herein incorporated by reference.

    [0100] FIG. 3 illustrates schematically an example architecture of a single chip processor 302. The processor is referred to herein as an IPU (Intelligence Processing Unit) to denote its adaptivity to machine intelligence applications. In a computer, the single chip processors can be connected together as discussed later, using links on the chip, to form a computer. The processor 302 comprises multiple processing units referred to as tiles arranged on a single chip. In one embodiment, there are 1216 tiles organised in arrays 306a, 306b, 306c, and 306d. The processor 302 has two chip-to-host links 308a, 308b and 4 chip-to-chip links 330a, 330b arranged on an edge of the chip. The processor 302 receives work from a host (not shown) which is connected to the chip via one of the card-to-host links in the form of input data to be processed by the chip 302. The chips can be connected together into cards by a further chip-to-chip links 330a, 330b. The host may access a computer which is architected as a single chip processor 302 as described herein or a group of multiple interconnected single chip processors 302 depending on the workload from the host application.

    [0101] When the processor is executing a machine learning or other complex or graph based application, vectors to be processed are provided to the processor from the host as workloads to be processed. Where vectors are small enough, a single vector may be processed by a single respective tile. It is more common for vectors in ML applications to be extremely large. In that case they may be broken up into fragments, and each fragment processed by an individual tile. Results of the processing are provided by each tile and may be combined at the same tile or other tiles, or supplied to the host for combining.

    [0102] The chip 302 has a clock generator 303 which generates a clock signal from an on or off chip clock to control the timing of chip activity. The clock generator is connected to all of the chip's circuits and components. The chip 302 comprises a switching fabric 334 to which all tiles and links are connected by sets of connection wires to enable communication between tiles on the processor. Each tile has its own local memory (described later). The tiles do not share memory.

    [0103] FIG. 4 illustrates an example tile 404 in accordance with embodiments of the present disclosure. In the tile, multiple threads of execution are interleaved through a single execution pipeline. In some embodiments, each thread may process a vector or vector fragment in accordance with precompiled instructions stored on the tile in an instruction memory 412. The tile 404 comprises: a plurality of contexts 426 each arranged to represent the state of a different respective one of a plurality of threads; the shared instruction memory 412, which is common to the plurality of threads on that tile, but not shared by other tiles; a shared data memory 422 that is also common to the plurality of threads; a shared execution pipeline 414, 416, 418 that is again common to the plurality of threads; and a thread scheduler 424 for scheduling the plurality of threads for execution through the shared pipeline in an interleaved manner.

    [0104] The data memory holds data supplied to the tile for processing, for example vector values of a whole vector or fragment of a vector, and results of that processing.

    [0105] The thread scheduler 424 is schematically represented in the diagram by a sequence of time slots S.sub.0 . . . S.sub.5, but in practice is a hardware mechanism managing program counters of the threads in relation to their time slots. The execution pipeline comprises a fetch stage 414, a decode stage 416, and an execution stage 418 comprising an execution unit (EXU) and a load/store unit (LSU). Each of the contexts 426 comprises a respective set of registers R.sub.0, R.sub.1 . . . for representing the program state of the respective thread.

    [0106] The fetch stage 414 is connected to fetch instructions to be executed from the instruction memory 412, under control of the thread scheduler 424. The thread scheduler 424 is configured to control the fetch stage 414 to fetch instructions from the local program for execution in each time slot.

    [0107] Note that in normal operation the program loaded into each tile is determined by a processor or compiler to allocate work based on the graph of the machine intelligence model being supported. This graph defines what code (executable instructions) is stored and executed on each tile. Data (inputs and outputs) may be exchanged between tiles and or the host.

    [0108] Each thread may in some embodiments be a codelet intended to represent a vertex in the graph and to execute atomically. That is all the data it consumes is available at launch and all the data it produces is not visible to other threads until it exits. It runs to completion (excepting error conditions).

    [0109] As briefly mentioned above, data is exchanged between tiles in the chip. In general, there may exist dependencies between the portions of a program running on different tiles. A technique is therefore required to prevent a piece of code on one tile running ahead of data upon which it is dependent being made available by another piece of code on another tile. There are a number of possible schemes for achieving this. One scheme is known as “bulk synchronous parallel” (BSP). According to BSP, each tile performs a compute phase and an exchange phase in an alternating cycle. During the compute phase each tile performs one or more computation tasks locally on tile, but does not communicate any results of its computations with any others of the tiles. In the exchange phase, each tile is allowed to exchange one or more results of the computations from the preceding compute phase to and/or from one or more others of the tiles in the group, but does not yet proceed to the next compute phase. Further, according to the BSP principle, a barrier synchronization is placed at the juncture transitioning from the compute phase into the exchange phase, or transitioning from the exchange phase into the compute phase, or both. That is to say, either. (a) all tiles are required to complete their respective compute phases before any in the group is allowed to proceed to the next exchange phase, or (b) all tiles in the group are required to complete their respective exchange phases before any tile in the group is allowed to proceed to the next compute phase, or (c) both. In some scenarios a tile performing computation may be allowed to communicate with other system resources such as a network card or storage disk, as long as no communication with other tiles in the group is involved. As described further herein, results which are exchanged may be from respective fragments of a vector.

    [0110] As discussed earlier FIG. 1(a) illustrates the principle behind a neural network. A neural network 60 comprises a graph of interconnected nodes 102 and edges connecting between nodes, all implemented in software. Each node 102 has one or more input edges and one or more output edges, with at least some of the nodes 102 having multiple input edges per node, and at least some of the nodes 102 having multiple output edges per node. The input edges of one or more of the nodes 102 form the input to the graph (typically an input vector, i.e. there are multiple input edges). The output edges of one or more of the nodes 102 form the overall output of the graph (which may be an output vector in the case where there are multiple output edges). Further, the output edges of at least some of the nodes 102 form the input edges of at least some others of the nodes 102.

    [0111] Each node 102 represents a function of the input value(s) received on its input edges(s), the outputs of the function being output on the output edge(s) of the respective node 102, such that the value(s) output on the output edge(s) of the node 102 depend on the respective input value(s) according to the respective function. The function of each node 102 is also parametrized by one or more respective parameters w, sometimes also referred to as weights (not necessarily weights in the sense of multiplicative weights, though that is certainly one possibility). Thus the relation between the values of the input(s) and output(s) of each node 102 depends on the respective function of the node and its respective weight(s).

    [0112] As shown in the example of FIG. 1(a), the nodes 102 of the graph 60 may be arranged into a plurality of layers, each layer comprising one or more nodes 102.

    [0113] A simple example may be a machine-learning model which comprises a single graph, arranged to take a feature vector X as its input and to output a classification Y as its output. The input feature vector X comprises a plurality of elements x.sub.i, each representing a different feature i=0, 1, 2, . . . etc. E.g. in the example of image recognition, each element of the feature vector X may represent a respective pixel value. For instance one element represents the value of the red channel for pixel (0,0); another element represents the value of the green channel for pixel (0,0); another element represents the blue channel of pixel (0,0); another element represents the red channel of pixel (0,1); and so forth. As another example, where the neural network is used to make a medical diagnosis, each of the elements of the feature vector may represent a value of a different symptom or physical feature of the subject, e.g. body temperature, blood pressure, etc. Other example implementations include natural language processing. It would be understood that the concept as discussed herein is not limited to these applications but could be any suitable application.

    [0114] As discussed above an example activation function which can be used in the network is the Softmax function. In order to prevent memory overflow when implementing the Softmax function it is known to first determine the maximum value of the input vector before calculating the Softmax value and to subtract the maximum from each value before taking the exponent, such that the maximum value becomes 0 and the exponent of the maximum value is 1 and there is no risk of overflow.

    [0115] Furthermore as also discussed this implementation requires the computer to make a series of memory accesses to retrieve the values of the input vector and a series of comparisons between the input vector values to determine the maximum value of the vector. Having determined the maximum value, one needs to compute the denominator. Finally, the Softmax values are then calculated. This calculation requires for each value an access to get the exponent of the numerator, and a second access to get the previously computed denominator.

    [0116] The present systems and methods, as discussed in more detail in the embodiments hereafter, are such that the computation of the Softmax function is implemented in such a manner that there is reduction in memory access. Additionally when implementing the embodiments as described herein there may generally be a reduction of dependencies within a massive parallel compute.

    [0117] With respect to FIG. 5 is shown an example method for determining or computing the Softmax value which eliminates the risk of overflow, while reducing memory accesses.

    [0118] The activation function implementation may first be configured to receive or obtain at step 501 input vectors as a workload from a host and the input vectors stored into tile memory. The input vectors may comprise the input values (or the weighted input values) as shown in FIG. 5.

    [0119] Note that the technique described herein may be carried out on complete vectors, or on fragments of vectors. The word ‘vector’ in the following encompasses matrices which represent whole vectors or tensors as well as matrices which represent vector or tensor fragments resulting from ‘chopping up’ a whole vector or tensor. The word ‘vector’ may be considered to denote a set of input values.

    [0120] In the following examples the input values are shown as unweighted input values x.sub.i (but it would be understood that weighted input values can be used in some examples and embodiments).

    [0121] Having obtained the input values x.sub.i for one vector the activation function implementation is configured to determine or compute the natural log of the Softmax value for that vector.

    [0122] The operation of determining the natural log of the Softmax value is shown in FIG. 5 by step 503.

    [0123] The calculation of the natural log of the Softmax value eliminates the risk of overflow.

    [00003] log ( p i ) = log ( e x i .Math. k = 1 N e x k ) = log ( e x i ) - log ( .Math. k = 1 N e x k ) = x i - log ( .Math. k = 1 N e x k ) .

    [0124] As can be seen in the expression above the calculation of the log of the Softmax value can itself be implemented as two separate terms, a first term x.sub.i and a second term log(Σ.sub.k=1.sup.Ne.sup.x.sup.k).

    [0125] The second term of the above expression can in some embodiments be computed using a method called max* reduction.

    [0126] The log of a sum of two exponentials can be rewritten as follows:


    log(e.sup.x+e.sup.y)=max(x,y)+log(1+e.sup.−|x-y|)

    [0127] This can be approximated to max(x, y) with the addition of a correction term and will be referred to herein as max*(x, y). The correction term is


    log(1+e.sup.−|x-y|)

    [0128] The max* value is referred to herein as the corrected maximum value. In some cases, the max value may be used without the addition of the correction term. In other words, the reference to max* encompasses it approximation, which is the maximum value of each pair of values, referred to as max.

    [0129] In some embodiments the expression of max* can be applied recursively to neighbouring pairs of values in a sum in order to calculate the log of a larger sum of exponentials. For example, if there are four terms in the sum:

    [00004] log ( e w + e x + e y + e z ) = log ( e log ( e w + e x ) + e log ( e y + e z ) ) max * ( max * ( w , x ) , max * ( y , z ) ) .

    [0130] Thus, it is possible to calculate the log of a sum of exponentials while accessing the input vector in memory only once. In the case that at a given level, the number of values is not divisible by two, any single values are carried to the next level of recursion.

    [0131] In some embodiments, access can be performed to read successive pairs of values into registers from the memory to be processed by the execution pipeline. Each register could be configured to hold a data chunk representing one or more pairs of values. A chunk is accessed from memory. The execution pipeline performs a max* operation and stores the max* result back to memory, or to a register for a next computation. The next chunk may then be accessed from memory to be processed by the execution pipeline, and the next max* value stored. Where the vector is divided into fragments for processing by respective processing units, such as tiles, each tile may compute the max* value of its fragment. The max* values of the fragments may then be exchanged with other processing units and combined. For example, the max* values of respective fragments may be computed in respective processing units or respective processing threads within a processing unit, in a compute phase of a BSP computing protocol. In an exchange phase of the BSP computing protocol, the max * values may be transmitted to one of the processing units/threads or to an additional processing unit/thread. In a next compute phase, the max* values of the respective fragments may be processed to generate a final max* value for that vector.

    [0132] FIG. 6 for example shows how the process of calculating the expression max* can be applied recursively to a set of 6 input values, including an odd number of values at a given level.

    [0133] For example a first level 601 is the level containing the input values x.sub.i for the function.

    [0134] A second level 603 is a level wherein the input values x.sub.i have been paired and the max* value of each is generated. The second level is shown with an odd number of elements.

    [0135] A third level 605 is a level where the elements of the second level 603 are paired and the max* value of these pair combinations are calculated. It can be seen that as the second level has an odd number of elements one of these is passed to the next level.

    [0136] A fourth level 607 is a level where the elements of the third level 605 are paired and the max* value of these pair combination are calculated. In this example this provides a final value which can be used in the Softmax function implementation.

    [0137] This pairing iterative method is shown in an example with respect to FIG. 7.

    [0138] A first operation is receiving/obtaining the input values x.sub.i as shown in FIG. 7 by step 701.

    [0139] The next operation is pairing the values as shown in FIG. 7 by step 703.

    [0140] The following operation is from each pair to select or pass the max* value of each pair or its approximation given by the max value of each pair. Where the number of elements was odd, then the operation further passes or selects the unpaired value. The selection or passing operation is shown in FIG. 7 by step 705.

    [0141] A further check operation may then determine whether the number of selected elements is one or more as shown in FIG. 7 by step 707.

    [0142] Where there is more than one element remaining then the operation can loop back or iterate to pairing the remaining values as shown by the arrow back to step 703.

    [0143] Where there is only one element or value remaining then the operation is configured to use the value as an approximation of the log sum of exponential values of the input terms as shown in FIG. 7 by step 709.

    [0144] To then determine or calculate the Softmax function, the calculated log sum should be subtracted from the given value x.sub.i of the input vector x.

    [0145] Having determined the natural log of the Softmax value this may be exponentiated to evaluate the Softmax function value. In this case, there is no risk of overflow as the final result is bound to be between 0 and 1 by the definition of the Softmax function. Additionally the combination or sum of the exponential terms is furthermore avoided and any issue of swamping of an input value reduced. The determination of the exponentiate of the natural log of the Softmax value is shown in FIG. 5 by step 505.

    [0146] There are certain applications in which the Softmax function itself is not needed but the natural logarithm of Softmax is. In such applications the step of exponentiating the result is not carried out, but the log sum is calculated and subtracted from each value in the input vector.

    [0147] The result of using this method in favour of calculating the max of the input vector and using it to scale the Softmax calculation is that the Softmax of the input may be calculated while making fewer memory accesses to the input vector, which reduces the computational resources required, particularly in the case of very large input vectors, or many Softmax calculations. In addition, the number of operations required to compute the Softmax function is reduced. Each value of the input vector is accessed when calculating the log of the sum of exponentials by recursively applying the max* reduction, and once more when the calculated log sum is subtracted from the vector and the result is exponentiated.

    [0148] This, for example, may be more efficient than implementing the following five steps: pre-calculating the max of the vector, subtracting the max from each value of the vector, taking the exponential of each value of the vector, summing the exponentials and dividing each exponentiated value of the vector by the sum.

    [0149] In some embodiments the max* reduction as shown in the examples of FIGS. 6 and 7 could be implemented in hardware, by defining a ‘max*’ instruction of the processor's instruction set, or it may be alternatively implemented in software. In both cases, the operation of max* requires two inputs a and b and produces an output which approximates to log(e.sup.a+e.sup.b), which may then be passed as input to another max* operation or, as the next step of the Softmax calculation, subtracted from an input vector and the resulting vector exponentiated.

    [0150] There may be some additional advantages that when processing is carried out over multiple threads, fewer steps may further improve performance. Thus in a simple example, the max needs to be calculated first, and all threads would be synchronised at that point to then subtract the max from the vector values in parallel, and synchronised again after calculating the sum of exponentials in order for each thread to divide by the sum.

    [0151] In the embodiments described above, synchronisation can occur when calculating max* for the vector so that each thread could subtract this value from the vector elements and then exponentiate the result. The subtraction and exponentiation can be implemented in parallel for different vector elements once the max* result is known.

    [0152] This may be particularly advantageous in implementation architectures such as discussed with respect to FIGS. 3 and 4 as once the max* result is determined the tiles within the IPU can operate independently and this results in a significant reduction in data dependency.

    [0153] It will be appreciated that the above embodiments have been described by way of example only.

    [0154] For example although the algorithm has been discussed with a massively parallel compute wherein each thread may implement a vertex it would be appreciated that embodiments may be applicable to a pool of threads/processes whether they are implemented in software on a processing unit comprising one or more processing node or a IPU such as described above.

    [0155] Other variants and applications of the disclosed techniques may become apparent to a skilled person once given the disclosure herein. The scope of the present disclosure is not limited by the described embodiments but only by the accompanying claims.