Machine Learning Computer
20220051095 · 2022-02-17
Inventors
Cpc classification
G06F7/22
PHYSICS
International classification
G06F7/22
PHYSICS
Abstract
A computer comprising a plurality of processing units, each processing unit having an execution unit and access to computer memory which stores code executable by the execution unit and input values of an input vector to be processed by the code, the code, when executed, configured to access the computer memory to obtain multiple pairs of input values of the input vector, determine a maximum or corrected maximum input value of each pair as a maximum result element, determine and store in a computer memory a maximum or corrected maximum result of each pair of maximum result elements as an approximation to the natural log of the sum of the exponents of the input values and access the computer memory to obtain each input value and apply it to the maximum or corrected maximum result to generate each output value of a Softmax output vector.
Claims
1. A computer-implemented method of processing an input vector comprising a plurality of input values, the method comprising: accessing computer memory to obtain the input values from the input vector; determining exponentials for the input values; and computing, using a processor, an activation function of the input vector to generate an output vector, the output vector comprising a plurality of output values, wherein the computing an activation function of the input vector to generate an output vector comprises: determining a natural logarithm of a sum of the exponentials of the input values; and determining each output value for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.
2. The computer-implemented method as claimed in claim 1, wherein determining each output value comprises: subtracting the natural logarithm of the sum of the exponentials of the input values from the respective input value to generate a value based on a natural logarithm of the activation function; and exponentiating the value based on the natural logarithm of the activation function.
3. The computer-implemented method as claimed in claim 1, wherein determining each output value comprises: subtracting the natural logarithm of the sum of the exponentials of the input values from the respective input to generate a value based on a natural logarithm of the activation function.
4. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises: determining an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, until there is only one input value remaining, pairing input values and then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
5. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises: determining an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, selecting from each pair of input values a largest value, applying a correction factor to the largest value based on the pair of input values to generate a max* result value until there remains one max* result value which is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
6. The computer-implemented method of claim 5 comprising padding the values with a null value where there is an odd number of values to create an even number of input values to create pairs.
7. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises determining an approximation of the natural logarithm of the sum of the exponential, including: iteratively until there is only one input value remaining, pairing input values and padding the values with a null value where there is an odd number of values then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
8. The computer-implemented method as claimed in claim 1, wherein determining the natural logarithm of the sum of the exponentials comprises: determining an approximation of the natural logarithm of the sum of the exponentials, including: selecting from the input values a largest value, wherein the largest value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
9. The computer-implemented method as claimed in claim 1, wherein obtaining the input values comprises: receiving a first input array comprising a plurality of first input values; and applying adjustable weightings to respective ones of the first input values to generate the input vector.
10. A computer program embodied on computer-readable storage, the program comprising code configured so as when run on one or more processors to perform a method of processing an input vector comprising a plurality of input values by performing operations including: accessing computer memory to obtain the input values from the input vector; determining exponentials for the input values; and computing an activation function of the input vector to generate an output vector, the output vector comprising one or more output values, wherein the computing an activation function of the input vector to generate an output vector comprises: determining a natural logarithm of a sum of the exponentials of the input values; and determining each output value for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.
11. A computer system comprising: storage comprising one or more memory units, and processing apparatus comprising one or more processing units; wherein the storage stores code arranged to run on the processing apparatus, the code being configured so as when thus run to perform a method of processing an input vector comprising a plurality of input values by performing operations including: accessing computer memory to obtain the input values from the input vector; determining exponentials for the input values; and computing, using the processing apparatus, an activation function of the input vector to generate an output vector, the output vector comprising one or more output values, wherein the computing an activation function of the input vector to generate an output vector comprises: determining a natural logarithm of a sum of the exponentials of the input values; and determining each output value for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.
12. A computer comprising: a plurality of processing units, each processing unit having an execution unit and access to computer memory which stores code executable by the execution unit and input values of an input vector to be processed by the code, and the code, when executed, is configured to cause each processing unit to: access the computer memory to obtain multiple pairs of the input values of the input vector; determine a maximum or a corrected maximum input value of each pair as a maximum result element; store in the computer memory a maximum or a corrected maximum result of each pair of maximum result elements as an approximation to a natural log of a sum of exponents of the input values; and access the computer memory to obtain each input value and apply each input value to the maximum or the corrected maximum result to generate each output value of a Softmax output vector.
13. The computer as claimed in claim 12 wherein the corrected maximum value is determined as
max(x,y)+log(1+e.sup.−|x-y|) where x, y are the input values.
14. The computer as claimed in claim 12 wherein each processing unit is associated with and has access to its own processor memory which is not shared by others of the processing units.
15. The computer as claimed in claim 14 wherein each processor memory stores a set of input values of the input vector constituting a fragment of that input vector.
16. The computer as claimed in claim 12 wherein each processing unit has its own memory that stores a set of input values of the input vector constituting a fragment of that input vector, wherein each processing unit is configured to process the fragment of that input vector in its associated memory and to generate a maximum result for that fragment, each processing unit configured to share its maximum result with other processing units which are configured to process other fragments of that input vector.
17. The computer as claimed in claim 12, wherein each processing unit has its own memory that stores a set of input values of the input vector constituting a fragment of that input vector, wherein each processing unit is configured to process the fragment of that input vector in its associated memory and to generate a corrected maximum result for that fragment, each processing unit configured to share its corrected maximum result with other processing units which are configured to process other fragments of the input vector.
18. An apparatus comprising a processor unit and a memory, the memory including computer program code, wherein processor unit is configured to execute the computer program code, causing the processor unit to: obtain an input vector, the input vector comprising a plurality of input values; determining exponentials for the input values; and apply an activation function to the input vector to generate an output vector, the output vector comprising a plurality of output values, wherein application of the activation function to the input vector to generate an output vector causes the apparatus to: determine a natural logarithm of a sum of the exponentials of the input values; and determine the output values for the output vector based on the natural logarithm of the sum of the exponentials subtracted from a respective input value.
19. The apparatus of claim 18, comprising a plurality of additional processor units, each with an associated memory.
20. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the output values: generate a value based on a natural logarithm of the activation function; and exponentiate the value based on the natural logarithm of the activation function.
21. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the output values: generate a value based on the natural logarithm of the activation function.
22. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the output values: determine an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, until there is only one input value remaining, pairing input values and then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
23. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the natural logarithm of the sum of the exponentials: determine an approximation of the natural logarithm of the sum of the exponentials, including: iteratively, selecting from each pair of input values a largest value, applying a correction factor to the largest value based on the pair of input values to generate a max* result value until there remains one max* result value which is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
24. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the natural logarithm of the sum of the exponentials: determine an approximation of the natural logarithm of the sum of the exponential, including: iteratively until there is only one input value remaining, pairing input values and padding the values with a null value where there is an odd number of values then selecting from each pair of input values a largest value, wherein the remaining one input value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
25. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when determining the natural logarithm of the sum of the exponentials: determine an approximation of the natural logarithm of the sum of the exponentials, including: selecting from the input values a largest value, wherein the largest value is determined to be the approximation of the natural logarithm of the sum of the exponential of each of the input values.
26. The apparatus of claim 18, wherein the processor unit is configured to perform the following actions when obtaining the input values: receive a first input array comprising a plurality of first input values; and apply adjustable weightings to respective ones of the first input values to generate the input vector.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0082] To assist understanding of embodiments of the present disclosure and to illustrate how such embodiments may be put into effect, reference is made, my way of example only, to the accompanying drawings in which:
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089]
[0090]
DETAILED DESCRIPTION OF EMBODIMENTS
[0091] The following will present an example activation function implementation suitable for neural networks.
[0092] First however there are described example systems in which the presently disclosed techniques may be implemented. There is also provided an overview of the principles behind neural networks, based upon which embodiments may be built or expanded.
[0093]
[0094] The computing apparatus 200 comprises at least a controller 202, an interface (e.g. a user interface) 204, and an artificial intelligence (AI) algorithm 206. The controller 202 is operatively coupled to each of the interface 204 and the AI algorithm 206.
[0095] Each of the controller 202, interface 204 and AI algorithm 206 may be implemented in the form of software code embodied on computer readable storage and run on processing apparatus comprising one or more processors such as CPUs, work accelerator co-processors such as GPUs or IPUs, and/or other application specific processors, implemented on one or more computer terminals or units at one or more geographic sites. The storage on which the code is stored may comprise one or more memory devices employing one or more memory media (e.g. electronic or magnetic media), again implemented on one or more computer terminals or units at one or more geographic sites. In embodiments, one, some or all the controller 202, interface 204 and AI algorithm 206 may be implemented on the server. Alternatively, a respective instance of one, some or all of these components may be implemented in part or even wholly on each of one, some or all of the one or more user terminals. In further examples, the functionality of the above-mentioned components may be split between any combination of the user terminals and the server. Again it is noted that, where required, distributed computing techniques are in themselves known in the art. It is also not excluded that one or more of these components may be implemented in dedicated hardware.
[0096] The controller 202 comprises a control function for coordinating the functionality of the interface 204 and the AI algorithm 206. The interface refers to the functionality for receiving and/or outputting data. It may comprise a user interface (UI) for receiving and/or outputting data to and/or from one or more users, respectively. Alternatively the interface may be arranged to collect and/or output data to and/or from an automated function. The interface 204 may additionally or alternatively receive and output data to a different component of the computing apparatus and/or to a different device. The interface 204 may comprise a wired or wireless interface for communicating, via a wired or wireless connection respectively, with an external device. The interface 204 may comprise one or more constituent types of interface, such as voice interface, and/or a graphical user interface. The interface 204 may present a UI front end to the user(s) through one or more I/O modules on their respective user device(s), e.g. speaker and microphone, touch screen, etc., depending on the type of user interface. The logic of the interface may be implemented on a server and output to the user through the I/O module(s) on his/her user device(s). Alternatively some or all of the logic of the interface 204 may also be implemented on the user device(s) 102 its/themselves.
[0097] The controller 202 is configured to control the AI algorithm 206 to perform operations in accordance with the embodiments described herein. It will be understood that any of the operations disclosed herein may be performed by the AI algorithm 206, under control of the controller 202 to collect experience data from the user and/or an automated process via the interface 204, pass it to the AI algorithm 206, receive predictions back from the AI algorithm and output the predictions to the user and/or automated process through the interface 204.
[0098] The AI algorithm 206 comprises a machine-learning model 208, comprising one or more constituent statistical models such as one or more neural networks.
[0099] As discussed above each of the controller 202, interface 204 and AI algorithm 206 may be implemented in the form of software code embodied on computer readable storage and run on a data processing system comprising one or more processors. In some embodiments the processor comprises a plurality of processor tiles. The data processing system may be a so called intelligence processing unit (IPU) or any class of accelerator (XPU). The techniques described herein can be used with the IPUs described in our earlier U.S. application Ser. No. 15/885,925, the contents of which are herein incorporated by reference.
[0100]
[0101] When the processor is executing a machine learning or other complex or graph based application, vectors to be processed are provided to the processor from the host as workloads to be processed. Where vectors are small enough, a single vector may be processed by a single respective tile. It is more common for vectors in ML applications to be extremely large. In that case they may be broken up into fragments, and each fragment processed by an individual tile. Results of the processing are provided by each tile and may be combined at the same tile or other tiles, or supplied to the host for combining.
[0102] The chip 302 has a clock generator 303 which generates a clock signal from an on or off chip clock to control the timing of chip activity. The clock generator is connected to all of the chip's circuits and components. The chip 302 comprises a switching fabric 334 to which all tiles and links are connected by sets of connection wires to enable communication between tiles on the processor. Each tile has its own local memory (described later). The tiles do not share memory.
[0103]
[0104] The data memory holds data supplied to the tile for processing, for example vector values of a whole vector or fragment of a vector, and results of that processing.
[0105] The thread scheduler 424 is schematically represented in the diagram by a sequence of time slots S.sub.0 . . . S.sub.5, but in practice is a hardware mechanism managing program counters of the threads in relation to their time slots. The execution pipeline comprises a fetch stage 414, a decode stage 416, and an execution stage 418 comprising an execution unit (EXU) and a load/store unit (LSU). Each of the contexts 426 comprises a respective set of registers R.sub.0, R.sub.1 . . . for representing the program state of the respective thread.
[0106] The fetch stage 414 is connected to fetch instructions to be executed from the instruction memory 412, under control of the thread scheduler 424. The thread scheduler 424 is configured to control the fetch stage 414 to fetch instructions from the local program for execution in each time slot.
[0107] Note that in normal operation the program loaded into each tile is determined by a processor or compiler to allocate work based on the graph of the machine intelligence model being supported. This graph defines what code (executable instructions) is stored and executed on each tile. Data (inputs and outputs) may be exchanged between tiles and or the host.
[0108] Each thread may in some embodiments be a codelet intended to represent a vertex in the graph and to execute atomically. That is all the data it consumes is available at launch and all the data it produces is not visible to other threads until it exits. It runs to completion (excepting error conditions).
[0109] As briefly mentioned above, data is exchanged between tiles in the chip. In general, there may exist dependencies between the portions of a program running on different tiles. A technique is therefore required to prevent a piece of code on one tile running ahead of data upon which it is dependent being made available by another piece of code on another tile. There are a number of possible schemes for achieving this. One scheme is known as “bulk synchronous parallel” (BSP). According to BSP, each tile performs a compute phase and an exchange phase in an alternating cycle. During the compute phase each tile performs one or more computation tasks locally on tile, but does not communicate any results of its computations with any others of the tiles. In the exchange phase, each tile is allowed to exchange one or more results of the computations from the preceding compute phase to and/or from one or more others of the tiles in the group, but does not yet proceed to the next compute phase. Further, according to the BSP principle, a barrier synchronization is placed at the juncture transitioning from the compute phase into the exchange phase, or transitioning from the exchange phase into the compute phase, or both. That is to say, either. (a) all tiles are required to complete their respective compute phases before any in the group is allowed to proceed to the next exchange phase, or (b) all tiles in the group are required to complete their respective exchange phases before any tile in the group is allowed to proceed to the next compute phase, or (c) both. In some scenarios a tile performing computation may be allowed to communicate with other system resources such as a network card or storage disk, as long as no communication with other tiles in the group is involved. As described further herein, results which are exchanged may be from respective fragments of a vector.
[0110] As discussed earlier
[0111] Each node 102 represents a function of the input value(s) received on its input edges(s), the outputs of the function being output on the output edge(s) of the respective node 102, such that the value(s) output on the output edge(s) of the node 102 depend on the respective input value(s) according to the respective function. The function of each node 102 is also parametrized by one or more respective parameters w, sometimes also referred to as weights (not necessarily weights in the sense of multiplicative weights, though that is certainly one possibility). Thus the relation between the values of the input(s) and output(s) of each node 102 depends on the respective function of the node and its respective weight(s).
[0112] As shown in the example of
[0113] A simple example may be a machine-learning model which comprises a single graph, arranged to take a feature vector X as its input and to output a classification Y as its output. The input feature vector X comprises a plurality of elements x.sub.i, each representing a different feature i=0, 1, 2, . . . etc. E.g. in the example of image recognition, each element of the feature vector X may represent a respective pixel value. For instance one element represents the value of the red channel for pixel (0,0); another element represents the value of the green channel for pixel (0,0); another element represents the blue channel of pixel (0,0); another element represents the red channel of pixel (0,1); and so forth. As another example, where the neural network is used to make a medical diagnosis, each of the elements of the feature vector may represent a value of a different symptom or physical feature of the subject, e.g. body temperature, blood pressure, etc. Other example implementations include natural language processing. It would be understood that the concept as discussed herein is not limited to these applications but could be any suitable application.
[0114] As discussed above an example activation function which can be used in the network is the Softmax function. In order to prevent memory overflow when implementing the Softmax function it is known to first determine the maximum value of the input vector before calculating the Softmax value and to subtract the maximum from each value before taking the exponent, such that the maximum value becomes 0 and the exponent of the maximum value is 1 and there is no risk of overflow.
[0115] Furthermore as also discussed this implementation requires the computer to make a series of memory accesses to retrieve the values of the input vector and a series of comparisons between the input vector values to determine the maximum value of the vector. Having determined the maximum value, one needs to compute the denominator. Finally, the Softmax values are then calculated. This calculation requires for each value an access to get the exponent of the numerator, and a second access to get the previously computed denominator.
[0116] The present systems and methods, as discussed in more detail in the embodiments hereafter, are such that the computation of the Softmax function is implemented in such a manner that there is reduction in memory access. Additionally when implementing the embodiments as described herein there may generally be a reduction of dependencies within a massive parallel compute.
[0117] With respect to
[0118] The activation function implementation may first be configured to receive or obtain at step 501 input vectors as a workload from a host and the input vectors stored into tile memory. The input vectors may comprise the input values (or the weighted input values) as shown in
[0119] Note that the technique described herein may be carried out on complete vectors, or on fragments of vectors. The word ‘vector’ in the following encompasses matrices which represent whole vectors or tensors as well as matrices which represent vector or tensor fragments resulting from ‘chopping up’ a whole vector or tensor. The word ‘vector’ may be considered to denote a set of input values.
[0120] In the following examples the input values are shown as unweighted input values x.sub.i (but it would be understood that weighted input values can be used in some examples and embodiments).
[0121] Having obtained the input values x.sub.i for one vector the activation function implementation is configured to determine or compute the natural log of the Softmax value for that vector.
[0122] The operation of determining the natural log of the Softmax value is shown in
[0123] The calculation of the natural log of the Softmax value eliminates the risk of overflow.
[0124] As can be seen in the expression above the calculation of the log of the Softmax value can itself be implemented as two separate terms, a first term x.sub.i and a second term log(Σ.sub.k=1.sup.Ne.sup.x.sup.
[0125] The second term of the above expression can in some embodiments be computed using a method called max* reduction.
[0126] The log of a sum of two exponentials can be rewritten as follows:
log(e.sup.x+e.sup.y)=max(x,y)+log(1+e.sup.−|x-y|)
[0127] This can be approximated to max(x, y) with the addition of a correction term and will be referred to herein as max*(x, y). The correction term is
log(1+e.sup.−|x-y|)
[0128] The max* value is referred to herein as the corrected maximum value. In some cases, the max value may be used without the addition of the correction term. In other words, the reference to max* encompasses it approximation, which is the maximum value of each pair of values, referred to as max.
[0129] In some embodiments the expression of max* can be applied recursively to neighbouring pairs of values in a sum in order to calculate the log of a larger sum of exponentials. For example, if there are four terms in the sum:
[0130] Thus, it is possible to calculate the log of a sum of exponentials while accessing the input vector in memory only once. In the case that at a given level, the number of values is not divisible by two, any single values are carried to the next level of recursion.
[0131] In some embodiments, access can be performed to read successive pairs of values into registers from the memory to be processed by the execution pipeline. Each register could be configured to hold a data chunk representing one or more pairs of values. A chunk is accessed from memory. The execution pipeline performs a max* operation and stores the max* result back to memory, or to a register for a next computation. The next chunk may then be accessed from memory to be processed by the execution pipeline, and the next max* value stored. Where the vector is divided into fragments for processing by respective processing units, such as tiles, each tile may compute the max* value of its fragment. The max* values of the fragments may then be exchanged with other processing units and combined. For example, the max* values of respective fragments may be computed in respective processing units or respective processing threads within a processing unit, in a compute phase of a BSP computing protocol. In an exchange phase of the BSP computing protocol, the max * values may be transmitted to one of the processing units/threads or to an additional processing unit/thread. In a next compute phase, the max* values of the respective fragments may be processed to generate a final max* value for that vector.
[0132]
[0133] For example a first level 601 is the level containing the input values x.sub.i for the function.
[0134] A second level 603 is a level wherein the input values x.sub.i have been paired and the max* value of each is generated. The second level is shown with an odd number of elements.
[0135] A third level 605 is a level where the elements of the second level 603 are paired and the max* value of these pair combinations are calculated. It can be seen that as the second level has an odd number of elements one of these is passed to the next level.
[0136] A fourth level 607 is a level where the elements of the third level 605 are paired and the max* value of these pair combination are calculated. In this example this provides a final value which can be used in the Softmax function implementation.
[0137] This pairing iterative method is shown in an example with respect to
[0138] A first operation is receiving/obtaining the input values x.sub.i as shown in
[0139] The next operation is pairing the values as shown in
[0140] The following operation is from each pair to select or pass the max* value of each pair or its approximation given by the max value of each pair. Where the number of elements was odd, then the operation further passes or selects the unpaired value. The selection or passing operation is shown in
[0141] A further check operation may then determine whether the number of selected elements is one or more as shown in
[0142] Where there is more than one element remaining then the operation can loop back or iterate to pairing the remaining values as shown by the arrow back to step 703.
[0143] Where there is only one element or value remaining then the operation is configured to use the value as an approximation of the log sum of exponential values of the input terms as shown in
[0144] To then determine or calculate the Softmax function, the calculated log sum should be subtracted from the given value x.sub.i of the input vector x.
[0145] Having determined the natural log of the Softmax value this may be exponentiated to evaluate the Softmax function value. In this case, there is no risk of overflow as the final result is bound to be between 0 and 1 by the definition of the Softmax function. Additionally the combination or sum of the exponential terms is furthermore avoided and any issue of swamping of an input value reduced. The determination of the exponentiate of the natural log of the Softmax value is shown in
[0146] There are certain applications in which the Softmax function itself is not needed but the natural logarithm of Softmax is. In such applications the step of exponentiating the result is not carried out, but the log sum is calculated and subtracted from each value in the input vector.
[0147] The result of using this method in favour of calculating the max of the input vector and using it to scale the Softmax calculation is that the Softmax of the input may be calculated while making fewer memory accesses to the input vector, which reduces the computational resources required, particularly in the case of very large input vectors, or many Softmax calculations. In addition, the number of operations required to compute the Softmax function is reduced. Each value of the input vector is accessed when calculating the log of the sum of exponentials by recursively applying the max* reduction, and once more when the calculated log sum is subtracted from the vector and the result is exponentiated.
[0148] This, for example, may be more efficient than implementing the following five steps: pre-calculating the max of the vector, subtracting the max from each value of the vector, taking the exponential of each value of the vector, summing the exponentials and dividing each exponentiated value of the vector by the sum.
[0149] In some embodiments the max* reduction as shown in the examples of
[0150] There may be some additional advantages that when processing is carried out over multiple threads, fewer steps may further improve performance. Thus in a simple example, the max needs to be calculated first, and all threads would be synchronised at that point to then subtract the max from the vector values in parallel, and synchronised again after calculating the sum of exponentials in order for each thread to divide by the sum.
[0151] In the embodiments described above, synchronisation can occur when calculating max* for the vector so that each thread could subtract this value from the vector elements and then exponentiate the result. The subtraction and exponentiation can be implemented in parallel for different vector elements once the max* result is known.
[0152] This may be particularly advantageous in implementation architectures such as discussed with respect to
[0153] It will be appreciated that the above embodiments have been described by way of example only.
[0154] For example although the algorithm has been discussed with a massively parallel compute wherein each thread may implement a vertex it would be appreciated that embodiments may be applicable to a pool of threads/processes whether they are implemented in software on a processing unit comprising one or more processing node or a IPU such as described above.
[0155] Other variants and applications of the disclosed techniques may become apparent to a skilled person once given the disclosure herein. The scope of the present disclosure is not limited by the described embodiments but only by the accompanying claims.