Communication in a Computer Having Multiple Processors
20230185577 · 2023-06-15
Inventors
Cpc classification
G06F9/3885
PHYSICS
G06F15/80
PHYSICS
H04L45/00
ELECTRICITY
G06F15/17318
PHYSICS
G06F9/4881
PHYSICS
International classification
G06F9/38
PHYSICS
G06F9/30
PHYSICS
G06F9/52
PHYSICS
G06F15/80
PHYSICS
Abstract
A computer comprising a plurality of processors, each of which are configured to perform operations on data during a compute phase for the computer and, following a pre-compiled synchronisation barrier, exchange data with at least one other of the processors during an exchange phase for the computer, wherein of the processors in the computer is indexed and the data exchange operations carried out by each processor in the exchange phase depend upon its index value.
Claims
1. A system comprising: a plurality of chips, the plurality of chips including a first chip and a second chip, wherein the first chip includes: at least one storage comprising a program having a set of executable instructions and an index value identifying the first chip among the plurality of chips; at least one data memory configured to store data for use in calculations specified by the set of executable instructions; and at least one execution unit configured to execute the set of executable instructions, wherein the at least one execution unit is configured to execute the set of executable instructions to: generate a data packet having first data from the at least one data memory, including selecting the second chip as a destination to which the data packet is to be sent in dependence upon the index value identifying the first chip, and providing an address of the second chip in the data packet; and dispatch the data packet via routing hardware configured to route the data packet to the second chip.
2. The system of claim 1, wherein the plurality of chips are configured to participate in a collective operation, the collective operation including dispatching a plurality of data packets, the plurality of data packets including the data packet.
3. The system of claim 1, wherein the at least one execution unit is configured to, for the data packet: calculate the address of the second chip by performing arithmetic operations on the index value identifying the first chip.
4. The system of claim 1, wherein the at least one execution unit is configured to, for the data packet: determine the address for the data packet by selecting an instruction from the at least one storage based on the index value identifying the first chip.
5. The system of claim 1, wherein generating the data packet comprises: in dependence upon the index value of the first chip, determining an address in data memory of the second chip; and providing the address in the data memory of the second chip in a header of the data packet.
6. The system of claim 1, wherein the at least one execution unit is further configured to: select, in dependence upon the index value identifying the first chip, the first data held in the at least one data memory for transmission in the data packet.
7. The system of claim 1, wherein the plurality of chips includes a third chip comprising a further at least one execution unit configured to execute a further set of executable instructions to: generate a second data packet that includes second data from a further at least one data memory of the third chip; and dispatch the second data packet via the routing hardware to the first chip, wherein generating the second data packet comprises: in dependence upon a further index value identifying the third chip, selecting the first chip as a destination to which the second data packet is to be sent; and providing a second address of the first chip in the second data packet.
8. The system of claim 1, wherein the routing hardware comprises a routing table including fixed routing information for routing a plurality of data packets from the first chip.
9. The system of claim 1, wherein the first chip comprises a first system on chip, and wherein the second chip comprises a second system on chip.
10. The system of claim 1, wherein the at least one execution unit is configured to execute the set of executable instructions to: generate the first data for inclusion in the data packet by combining data received from at least one other chip of the plurality of chips with further data stored in the at least one data memory of the first chip.
11. The system of claim 10, wherein the at least one execution unit is configured to select the further data from the at least one data memory in dependence upon the index value identifying the first chip.
12. The system of claim 1, wherein the at least one execution unit is further configured to: perform operations on input data held in the at least one data memory to generate results, wherein the first data included in the data packet comprises the results.
13. The system of claim 12, wherein the operations comprise operations to derive updates to weights of a neural network.
14. The system of claim 13, wherein the results comprise delta weights.
15. The system of claim 12, wherein the at least one execution unit is configured to execute the set of executable instructions to: perform the operations on the input data during a compute phase, the compute phase being seperated from an exchange phase by a barrier synchronisation between the plurality of chips; and dispatch the data packet during the exchange phase.
16. A computer implemented method of generating multiple programs, each of which is suitable for execution by a processor of one of a plurality of chips, the method comprising: compiling a single set of executable instructions; determining a first index value for a first one of the chips and a second index value for a second one of the chips; and generating, for the first one of the chips, a first local program comprising the single set of executable instructions and the first index value, and generating, for the second one of the chips, a second local program comprising the single set of executable instructions and the second index value, wherein the single set of executable instructions, when allocated to the first one of the chips is scheduled to execute on a first processor of the first one of the chips to cause: in dependence upon the first index value, selecting a further one of the chips as a destination and providing an address of the further one of the chips in a data packet; and dispatching the data packet via routing hardware to the further one of the chips.
17. The computer implemented method of claim 16, wherein the single set of executable instructions, when allocated to the second one of the chips is scheduled to execute on a second processor of the second one of the chips to cause: in dependence upon the second index value, selecting the first one of the chips as a destination and providing an address of the first one of the chips in a further data packet; and dispatching the further data packet via routing hardware to the first one of the chips.
18. A non-transitory computer readable medium storing a computer program comprising multiple local programs each of which is suitable for execution by a processor of one of a plurality of chips, each of the local programs comprising: a set of executable instructions; and an index value, associated with one of the chips on which the local program is scheduled to run, wherein a first of the local programs, when allocated to a first one of the chips is scheduled to execute on a first processor of the first one of the chips to cause: in dependence upon a first of the index values that is associated with the first one of the chips, selecting a further one of the chips as a destination and providing an address of the selected further one of the chips in a data packet; and dispatching the data packet via first routing hardware to the further one of the chips.
19. The non-transitory computer readable medium of claim 18, wherein a second of the local programs, when allocated to a second one of the chips is scheduled to execute on a second processor of the second one of the chips to cause: in dependence upon a second of the index values that is associated with the second one of the chips, selecting the first one of the chips as a destination and providing an address of the first one of the chips in a further data packet; and dispatching the further data packet via second routing hardware to the first one of the chips.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0038] For a better understanding of the present disclosure to show how the same may be carried into effect, reference will now be made by way of example to the accompanying drawings.
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
DETAILED DESCRIPTION
[0055] Aspects of the present disclosure have been developed in the context of a computer comprising multi-tile processors, which are designed to act as accelerators for machine learning workloads. However, the disclosure is not limited to the machine learning context. The accelerator comprises a plurality of interconnected processors. In some embodiments, each processor may take the form of a multi-tile processor. The multi-tile processors, which may be used to implement embodiments of the disclosure are described in U.S. Pat. Application no: 15/886315, which is incorporated herein by reference. Alternatively, each processor may simply take the form of a single monolithic processor.
[0056] Example embodiments of the disclosure will now be described in more detail with reference to the accompanying Figures.
[0057] Reference is made to
[0058] Each of the plurality of processors 2 comprises at least one instruction memory 410 storing instructions executable by at least one execution unit 420 of the respective processor 2. Each processor 2 also comprises at least one storage 415 storing an index value. Although each of the units of storage 405, 410, 415 is shown as a separate storage element in
[0059] Together, the set of instructions in the instruction memory 410 and index in the index storage 415 constitute a program for the respective processor 2. Each processor 2 comprises at least one execution unit configured to execute instructions held in instruction memory 410 to modify data values held in the data memory 405 and to perform operations to transfer data to and from other processors 2. The instructions in instruction memory 410 are the same in each of the processors 2. However, the index value held in index storage 415 differs between processors 2. Each processor 2 of computer 700 stores a different index value in its respective index storage 415. The index value held in storage 415 by each processor 2 uniquely identifies that processor 2 in the processors 2 of the computer 700. Each of the processors 2 is configured to perform a different set of data transfer operations in dependence upon its stored index value so as to appropriately transfer data between the processors 2 of the computer 700. The index value controls which data from the memory 405 is transferred, and controls the operations that are performed with respect to received data.
[0060] Routing hardware (not shown in
[0061] The computer 700 alternates between compute phases and exchange phases. During a compute phase, each of the processors 2 in the computer performs computations until reaching a pre-complied synchronisation barrier. After the barrier, the computer 700 moves into the exchange phase, where the processors 2 exchange data with one another.
[0062] Some of the instructions in the instruction memory 410, when executed by the execution unit 420, cause the execution unit 420 to perform calculations using the data held in data memory 405. Some of the instructions in the instruction memory 410, when executed by the execution unit 420, cause transfer of data held in data memory 405 to another one of the processors 2. Some of the instructions in the instruction memory 410, when executed by the execution unit 420, cause storage of the data received from a further one of the processors 2 in data memory 405 of the processor.
[0063] Therefore, instructions are executed by the execution unit 420 to control data transfer between the processor 2 comprising the execution unit 420 and a further processor of the computer 700. The index stored in index storage 415 is used by the execution unit 420 to control the data transfer. This may be implemented is different ways. In some embodiments, an instruction from the instruction memory 410 takes the index value as an input and, when executed, causes the execution unit 420 to determine how the data transfer is to be performed in dependence upon calculations performed on the index value. In some embodiments, the index value is used to select a particular instruction from the instruction memory 410 to perform a certain data transfer operation.
[0064] The index value may be used by the execution unit 420 to select the particular processor of the plurality of processors 2 to which data is to be transferred from the data memory 405. In one example, processor 2i selects, on the basis of the index value stored in index storage 415, the processor 2ii from the plurality of processors. The execution unit 420 then causes the data to be transferred to the selected processor 2ii. In order to do so, the execution unit 420 processes the data to produce data packets containing an address of the processor 2ii, and then dispatches the packets, which are provided by routing hardware of the computer 2 to the processor 2ii.
[0065] The index value may be used by the execution unit 420 to select the address on the destination processor in which the data is to be stored. For example, the execution unit 420 of processor 2i determines, on the basis of the index value of processor 2i, an address in data memory 205 of processor 2ii. The execution unit 420 then processes the data to be transferred to processor 2ii to produce one or more data packets comprising one or more headers containing the address in memory 405 of processor 2ii at which the data is to be stored. The execution unit 420 the causes these data packets to be sent to processor 2ii, where they are stored in the memory 405 at the address indicated in the headers of the data packets.
[0066] The index value may be used by the execution unit 420 to select the particular data from data memory 405 that is to be transferred. The execution unit 420 determines an address in data memory 405 of the data to be transferred. The execution unit 420 then causes the selected data to be transferred to one of the processors, e.g. processor 2ii. The execution unit may determine the address in data memory 405 of the data to be transferred by performing a calculation taking the address of the index value as an input or by selecting an instruction from the instruction memory 405 for causing the particular data to be transferred.
[0067] The index value may be used by the execution unit 420 to control the handling of received data at a processor 2. For example, when processor 2ii receives the data from processor 2i, the execution unit 420 of the processor 2ii may use the respective index value to control where in data memory 405 the data is stored. Additionally or alternatively, when the processor 2ii receives the data from the processor 2i, the execution unit 420 of the processor 2ii may use the respective index value to select data from data memory 405 and to perform an operation (e.g. combining the data) involving both the received data and the data selected from memory 405. The processor 2ii may then store the result of the operation in memory 405.
[0068] Although in
[0069] Embodiments of the disclosure may be applied for the exchange of data between processors of a computer when training a machine learning model. In order to explain such an example application of the disclosed techniques, reference is made to
[0070] The aim with the architecture of
[0071] One way in which the exchange of data may be efficiently implemented between processors is through the use of collectives, which are routines commonly used when processing data in a computer. They are routines which enable data to be shared and processed across multiple different processes, which may be running on the same processor or different processors. For example, if one process reads data from a data store it can use a “broadcast” process to share that data with other processes. Another example is when the result of a particular function is needed on multiple processes. A “reduction” is a result which has required the application of a compute function to a data value from each of multiple processes. “Gather” and “Scatter” collectives handle more than one data item. Certain collectives have become increasingly important in processing machine learning applications.
[0072] MPI (Message Passing Interface) is a message passing standard which can be applied to many parallel computing architectures. MPI defines a number of collectives applicable to machine learning. One such collective is termed “all-reduce”. An all-reduce operation enables a result of a compute function acting on multiple data values from different source processes to be provided at a receiving process. Note that a receiving process may be one of the source processes, and that there may be multiple receiving processes. The all-reduce collective reduces the data values from multiple source processes and distributes the results to all the source processes, (which are acting as receiving processes for the reduced result). According to the MPI Standard, the all-reduce collective is implemented by reducing the data values from all source processes in a reduce collective (e.g. at one of the processes) and then broadcasting the result to each source process.
[0073]
[0074] The all-reduce collective, which may be used to exchange the delta weights between processors, is illustrated in
- which could include independent operators (e.g. max) or associative operators = P.sub.1 (N.sub.0) * P.sub.1(N.sub.1) * P.sub.1(N.sub.2) * P.sub.1(N.sub.3). Then, in an all-gather pass, each reduction is provided to all processors to activate a state S3 wherein each processor now holds all four reductions. Note that in S1, the “corresponding” partials, e.g. P.sub.0(0), P.sub.0(1), P.sub.0(2) and P.sub.0(3) may all differ whereas, in state S3, each reduction, e.g. r.sub.0 is the same at all processors, where r.sub.i = f{(P.sub.i(0), P.sub.i(1), P.sub.i(2) and P.sub.i(3))}. In machine learning, the set of partial values P.sub.0, P.sub.1, P.sub.2, P.sub.3 is a vector. A vector of partials (e.g. updated weights) is produced on each pass of the model during training. The reduction r.sub.0, r.sub.1, r.sub.2, r.sub.3 on each processor in state S3 is the full reduction vector. In the context of machine learning, each partial could be a set of updating deltas for a parameter in the model. Alternatively (in an arrangement not described further herein) it could be an updated parameter.
[0075] Therefore, as noted an all-reduce operation consists of a reduce-scatter operation, followed by an all-gather operation. During the reduce-scatter operation, each node exchanges different elements of the partial. When the reduce-scatter is complete, all nodes have one nth of the final all-reduce. During the all-gather, each node receives an additional ⅟n of the final all-reduce until, after n-1 steps, all of the nodes have the complete data set.
[0076]
Similarly for the Y, G, B, P and L fragments.
[0077]
[0078] The notation in
[0079] In step one, the first fragment (the A.sub.0) in each virtual ring is transferred from its processor to the next adjacent processor where it is reduced with the corresponding fragment at that processor. That is, RA.sub.0 moves from N.sub.0 to N.sub.1 where it is reduced with RA.sub.1 to form RA.sub.0,1. The notation 0, 1 indicates that the fragment is formed by the reduction of the first and second fragments in the virtual ring. Note that, in the same step, the A.sub.0 fragments of each virtual ring are simultaneously transmitted. That is, the link between N.sub.1 and N.sub.2 is used to transmit YA.sub.0, the link between N.sub.2 and N.sub.3 is used to transmit GA.sub.0, et cetera. In the next step, the corresponding reduced fragments are transmitted over the forward links to their next adjacent processor. For example, RA.sub.0,.sub.1 is transmitted from N.sub.1 to N.sub.2, and YA.sub.0,.sub.1 is transmitted from N.sub.2 to N.sub.3. Note that for reasons of clarity, not all fragments in
[0080] The beginning of the all-gather phase starts by a transmission from the last to the first processor in each virtual ring. Thus, the final reduction for the R fragments ends on processor N.sub.5 ready for the first step of the all-gather phase. The final reduction of the Y fragments correspondingly ends up on the processor N.sub.0. In the next step of the all-gather phase, the reduced fragments are transmitted again to their next adjacent processor. Thus the fully reduced R fragment is now also at N.sub.2, the fully reduced Y fragment is now also at N.sub.3 and so on. In this way, each processor ends up at the end of the all-gather phase with all fully reduced fragments R, Y, G, B, P, L of the partial.
[0081] Example embodiments of the disclosure can be applied to control the exchange of data in a machine learning context. Specifically, example embodiments can be applied to control the exchange of data during a reduce-scatter operation described above with respect to
[0082] Reference is made to
[0083] As shown in
[0084] In addition to each processor 2 using its index value to select data to be transferred, upon receiving data packets from another processor 2, the execution unit 420 of each processor 2 determines where in memory 405, data fragments derived from the received data packets are to be stored in dependence upon its index value stored in the index storage 415. The execution unit 420 of each processor 2 also selects any other data fragments with which to combine the received data. For example, the execution unit 420 of processor N.sub.0 receives the data labelled LA.sub.0 from processor N.sub.5 and, in dependence upon its index, executes instructions to reduce this with the data LA.sub.1 held at location L in memory 405. The execution unit 420 of processor N.sub.1 receives the data labelled RA.sub.0 from processor N.sub.0 and, in dependence upon its index, executes instructions to reduce this with the data RA.sub.1 held at location R in memory 405. The execution unit 420 of processor N.sub.2 receives the data labelled YA.sub.0 from processor N.sub.1 and, in dependence upon its index, executes instructions to reduce this with the data YA.sub.1 held at location Y in memory 405. The execution unit 420 of processor N.sub.3 receives the data labelled GA.sub.0 from processor N.sub.2 and, in dependence upon its index, executes instructions to reduce this with the data GA.sub.1 held at location G in memory 405. The execution unit 420 of processor N.sub.4 receives the data labelled BA.sub.0 from processor N.sub.3 and, in dependence upon its index, executes instructions to reduce this with the data BA.sub.1 held at location B in memory 405. The execution unit 420 of processor N.sub.5 receives the data labelled PA.sub.0 from processor N.sub.4 and, in dependence upon its index, executes instructions to reduce this with the data PA.sub.1 held at location P in memory 405. The execution unit 420 of processor N.sub.0 receives the data labelled LA.sub.0 from processor N.sub.5 and, in dependence upon its index, executes instructions to reduce this with the data LA.sub.1 held at location L in memory 405.
[0085] Each of the transferred data fragments may correspond to the data fragments shown in
[0086] Reference is made to
[0087] The at least one execution unit 420 of each processor 2 is configured to select and pass a reduced fragment in dependence upon the index value it stores. The execution unit 420 of processor N.sub.0 selects and transfers the data labelled
RA.sub.i to processor N.sub.1 in dependence upon the index value held by processor N.sub.0. The execution unit 420 of processor N.sub.1 selects and transfers the data labelled
Y A.sub.i to processor N.sub.2 in dependence upon the index value held by processor N.sub.1. The execution unit 420 of processor N.sub.2 selects and transfers the data labelled
GA.sub.i to processor N.sub.3 in dependence upon the index value held by processor N.sub.2. The execution unit 420 of processor N.sub.3 selects and transfers the data labelled
BA.sub.i to processor N.sub.4 in dependence upon the index value held by processor N.sub.3. The execution unit 420 of processor N.sub.4 selects and transfers the data labelled
PA.sub.i to processor N.sub.5 in dependence upon the index value held by processor N.sub.4. The execution unit 420 of processor N.sub.5 selects and transfers the data labelled
lA.sub.i to processor N.sub.0 in dependence upon the index value held by processor N.sub.5. Each of these data transfers completes the first step of the all-gather operation. By performing the subsequent steps, each processor 2 is provided with each reduced fragment.
[0088] In addition to each processor 2 using its index value to select data to be transferred, upon receiving data from another processor 2, the recipient processor determines where in memory the data is to be stored in dependence upon the index value stored in the index storage 415. For example, the execution unit 420 of processor N.sub.0 receives the data labelled
lA.sub.i from processor N.sub.5 and, in dependence upon its index, executes instructions to store this data at location L in memory 405. The execution unit 420 of processor N.sub.1 receives the data labelled
RA.sub.i from processor No and, in dependence upon its index, executes instructions to store this data at location R in memory 405. The execution unit 420 of processor N.sub.2 receives the data labelled
YA.sub.i from processor N.sub.1 and, in dependence upon its index, executes instructions to store this at location Y in memory 405. The execution unit 420 of processor N.sub.3 receives the data labelled
GA.sub.i from processor N.sub.2 and, in dependence upon its index, executes instructions to store this data at location G in memory 405. The execution unit 420 of processor N.sub.4 receives the data labelled
BA.sub.i from processor N.sub.3 and, in dependence upon its index, executes instructions to store this data at location B in memory 405. The execution unit 420 of processor N.sub.5 receives the data labelled
PA.sub.i from processor N.sub.4 and, in dependence upon its index, executes instructions to store this data at location P in memory 405. The execution unit 420 of processor No receives the data labelled
lA.sub.i from processor N.sub.5 and, in dependence upon its index, executes instructions to store this data at location L in memory 405.
[0089] Each processor comprises the full set of instructions for transferring data, such as that shown in memory in
[0090] In some embodiments, the index value held by each processor determines the instructions in the set of instructions that are executed by each execution unit 420 to perform the transfer of the appropriate data fragment from memory 405 to another processor 2. Each processor 2 also comprises the full set of instructions for receiving and storing fragments at the appropriate location in memory 405. The index value held by each processor 2 determines the instructions in the set of instructions that are executed by the at least one execution unit 420 of the processor 2 to store a received fragment at the appropriate location. The execution unit 420, in this case, performs a branch operation that depends upon the index value to select a particular set of instructions that are executed for performing the relevant data transfer.
[0091] In some embodiments, the at least one execution unit 402 of each processor 2 performs arithmetic operations using the index value as an input to determine the address in data memory 405 from which data is to be read or written to.
[0092] In
[0093] Reference is made to
[0094] In embodiments, each processor 2 also comprises one or more external links 8, enabling the processor 2 to be connected to one or more other processors (e.g. one or more other instances of the same processor 2). These external links 8 may comprise any one or more of: one or more processor-to-host links for connecting the processor 2 to a host processor, and/or one or more processor-to-processor links for connecting together with one or more other instances of the processor 2 on the same IC package or card, or on different cards. In one example arrangement, the processor 2 receives work from a host processor (not shown) which is connected to the processor via one of the processor-to-host links in the form of input data to be processed by the processor 2. Multiple instances of the processor 2 can be connected together into cards by processor-to-processor links. Thus a host accesses a computer having multiple processors 2, each of which is architected as a multi-tile system on a chip, depending on the workload required for the host application.
[0095] The interconnect 34 is configured to enable the different tiles 4 in the array 6 to communicate with one another. However, as well as there potentially being dependencies between threads on the same tile 4, there may also be dependencies between the portions of the program running on different tiles 4 in the array 6. A technique is therefore used to prevent a piece of code on one tile 4 running ahead of data upon which it is dependent being made available by another piece of code on another tile 4.
[0096] Each tile 4 is itself a processor capable of executing instructions (code) from a local instruction memory and handling data in local data memory. A tile 4 may comprise a respective instance of a barrel-threaded processor and a memory. For instance, by way of illustration the processor 2 may comprise of the order of hundreds of tiles 4, or even over a thousand. For completeness, note also that an “array” as referred to herein does not necessarily imply any particular number of dimensions or physical layout of the tiles 4.
[0097] Communication between tiles 4 on the processor 2 occurs in a time deterministic fashion. However, other forms of inter tile exchange are possible. There may be dependencies between the portions of the program running on different tiles 4 in the array 6. That is, processing data on one tile may depend on results from another tile, e.g. may provide results on which another tile depends. A technique is, therefore, used to prevent a piece of code on one tile 4 running ahead of data upon which it is dependent being made available by another piece of code on another tile 4.
[0098] Parallel programming models for Al and Data Science usually follows a 3-phase iterative execution model: Compute, Barrier, and Exchange. The implications are that data transfer to and from a processor is usually barrier dependent to provide data-consistency between the processors and between each processor and a host. Typically used data consistency models are Bulk Synchronous Parallel (BSP), Stale Synchronous Parallel (SSP) and Asynchronous. Embodiments described herein use a BSP model, but it will be apparent that the other synch models could be utilised as an alternative.
[0099] Reference is made to
[0100] According to the BSP principle, a barrier synchronization 30 is placed at the juncture transitioning from the compute phase 33 into the exchange phase 32, or the juncture transitioning from the exchange phase 32 into the compute phase 33, or both. That is to say, either: (a) all tiles 4 are required to complete their respective compute phases 33 before any in the group is allowed to proceed to the next exchange phase 32, or (b) all tiles 4 in the group are required to complete their respective exchange phases 32 before any tile in the group is allowed to proceed to the next compute phase 33, or (c) both of these conditions are enforced. In all three variants, it is the individual tiles which alternate between phases, and the whole assembly which synchronizes. The sequence of exchange and compute phases may then repeat over multiple repetitions. In BSP terminology, each repetition of exchange phase and compute phase is sometimes referred to as a “superstep” (though note that in the literature the terminology is not always used consistently: sometimes each individual exchange phase and compute phase individually is called a superstep, whereas elsewhere, as in the terminology adopted herein, the exchange and compute phases together are referred to as a superstep).
[0101] Note also, it is not excluded that multiple different independent groups of tiles 4 on the same processor 2 or different processors could each form a separate respective BSP group operating asynchronously with respect to one another, with the BSP cycle of compute, synchronize and exchange being imposed only within each given group, but each group doing so independently of the other groups. I.e. a multi-tile array 6 might include multiple internally synchronous groups each operating independently and asynchronously to the other such groups (discussed in more detail later). In some embodiments there is a hierarchical grouping of sync and exchange, as will be discussed in more detail later.
[0102]
[0103] The communication between tiles 4 on a processor 2 occurs in time deterministic fashion in which data packets are transmitted without headers. This is explained in our earlier application U.S. Pat. Application No: 15/886315.
[0104] In embodiments, multiple instances of the processor 2 are connected together to form an even larger array of tiles 4 spanning multiple processors 2. This is illustrated in
[0105]
[0106] At the physical layer, the interconnect mechanism is lossy, but at the transaction layer, the mechanism is not lossy due to the architecture of the link layer: if a packet is not acknowledged it will be resent automatically by the hardware in the interconnect 72. The possibility for loss and resending at the data link layer, however, means that the delivery of data packets over the external interconnect 72 is not time-deterministic. Further, all the packets of a given exchange may arrive together or separated apart in time, and in any order, so the external interconnect employs flow control and queuing. Further, the interconnect may use clock-data-recovery (CDR) technology to infer a clock from a received data stream having sufficient data signal transitions to maintain bit-lock. This inferred clock will be of unknown phase relationship to the sending clock and hence represent an additional source of non-determinism.
[0107] As illustrated, the external interconnect 72 comprises an external exchange block (XB) 78. The compiler nominates one of the tiles 4 to send an external exchange request (XREQ) to the exchange block 78 (operation S1). The XREQ is a message comprising one or more control packets, indicating which of the tiles 4 have data packets (content) to send to another tile or tiles 4 on another processor 2. This is illustrated schematically in
[0108] Reference is made to
[0109] The instructions held in the instruction memory 910 may be the same in corresponding tiles 4 of each processor 2. Reference is made to
[0110] Referring back to
[0111] The tile 4 comprises an index value held in index value store 920. The operations performed by the execution unit 905 during the compute phase to manipulate the data held in data memory 915 are independent of the index value held in the index value store 920. However, the operations performed by the execution unit 905 during the exchange phase with other processors 2 depend upon the index value. Although the index storage 920 is shown as being separate to the instruction memory 910, in some embodiments the index storage 920 and instruction memory 910 may form part of a single memory array.
[0112] As shown, the data held in data memory 915 is divided into different portions/fragments (shown as slices). Prior to the data being exchanged with other tiles, the at least one execution unit 905 is configured to execute instructions to transfer data from one or more of the portions to a send buffer 925. The execution unit 905 selects the data to transfer in dependence upon the index value held in index storage 920. The execution unit 905 then passes the selected data to the send buffer 925. During an exchange phase, the execution unit 905 executes instructions to send the data via interface 8. Sending the data via interface 8 comprises appending headers to the data packets with destination addresses for the data in another processor 2. The data packets are sent to that processor 2 in accordance with the scheme discussed above with respect to
[0113] During an exchange phase, the tile 4 is configured to receive one or more data packets. These data packets are received from tiles on other processors 2. Upon receiving the one or more data packets, the data packets are stored in the receive buffer 930. The at least execution unit 905 executes instructions to handle the received data in dependence upon the index value held in storage 920. The at least one execution unit 905 is configured to store data derived from the data packets at locations in data memory 915 in dependence upon the index value held in storage 920. The at least one execution unit 905 may also perform operations, such as a reduction operation, with the data from the received data packets and the data stored in memory 915 prior to storing the result of the operation in memory 915.
[0114] Therefore, the index value held in storage 920 is used by the at least one execution unit 905 to at least one of: select data at certain addresses from memory 915 for sending, select data for performing operations on received data (e.g. reduction operations), and storing results from derived from received data at certain addresses in memory 915 that depend on the index value.
[0115] There are different ways in which the index value may be used by the execution unit 905 to select an address in memory 915 for storing the data.
[0116] In some embodiments, the different data portions are arranged contiguously in memory 915. The execution unit 905 is configured to calculate the address in memory 915 at which data is to be read from or written to in dependence upon the index value. The execution unit 905 calculates the address by performing operations defined in the instructions in instruction memory 910. The operations are arithmetic operations.
[0117] In some embodiments, the instruction memory 910 stores a plurality of portions of code, each configured to control data transfer differently. For example, one portion of code may cause data at a certain memory location in memory 915 to be transmitted to another processor 2, whilst another portion of code may cause the execution unit 905 to cause data at a different memory location in memory 915 to be transmitted to another processor 2. Another portion of code may cause received data to be stored at a certain location in memory 915 or cause a certain operation to be carried out with respect to the received data. The execution unit 905 executes code from the instruction memory 915 and, at a point in the execution sequence at which data is to be read from or written to the memory 915, the execution unit 905 performs a branch operation to select a portion of code for performing read or write operations for the data. The portion of code is selected in dependence upon the index value.
[0118] According to an exemplary application of the techniques disclosed herein, each processor 2 is provided with different set of training data for producing delta weights so as to train a machine learning model. In this case, each tile 4 is provided with a different set of training data for producing one or more of the delta weights. Together, all of the tiles 4 of each processor 2 together produce a full set of delta weights, which are averaged with the delta weights produced on other processors 2.
[0119] In some embodiments, the execution unit 905 is configured to switch between processing different worker threads. The execution unit 905, in this case, is part of a barrel-threaded processor as described in U.S. Pat. Application No: 15/886315. In this case, each worker thread is programmed to perform the computations associated with a respective individual one of the processors in a machine intelligence graph. In this case, at least some of the edges between processors correspond to the exchanges of data between threads. The threads between which data is exchanged may be threads running on the same execution unit 905 or may be threads running on execution units of different tiles 4. Some may involve exchanges between different tiles of the processor 2. The slices shown in memory 915 may each correspond to a delta value associated with a particular edge between processors, with the delta values being calculated by the execution unit 905 during training. The memory 915 is also shown as including further data. This further data may include data for producing the delta values, such as the training data, the current values of the weights and any further data defining the machine learning model, such as activation functions, number of processors in each layer, etc.
[0120] Reference is made to
[0121] The method 1100 is performed by a compiler, which may execute on any suitable computing apparatus comprising at least one execution unit and at least one memory holding computer code for execution by the at least one execution unit.
[0122] At step S1110, the compiler compiles a single set of executable instructions for providing to each processor 2. The single set of executable instructions is in the form of an executable image. The set of executable instructions may comprise a plurality of subsets of instructions, with each subset being for execution by a different tile 4 of the processor 2.
[0123] At step S1120, the compiler determines for each processor 2 in the computer 700, an index value associated with the processor 2. Each index value that is determined uniquely identified a different processor 2 within the computer 700.
[0124] A step S1130, the compiler generates, for each processor 2, a local program comprising the single set of instructions and the index value associated with the processor 2. The compiler does so by, for each processor 2, taking the compiled set of instructions produced in S1110 and patching this set of instructions with the index value for the processor determined in S1120.
[0125] It will be appreciated that the above embodiments have been described by way of example only. While particular embodiments have been described, other applications and variants of the disclosed techniques may become apparent to a person skilled in the art once given the disclosure herein.