Methods and Apparatus for Sharing Nodes in a Network with Connections Based on 1 to K+1 Adjacency Used in an Execution Array Memory Array (XarMa) Processor
20200250131 ยท 2020-08-06
Inventors
Cpc classification
G06F9/3885
PHYSICS
G06F9/3836
PHYSICS
G06F15/17393
PHYSICS
G06F9/30145
PHYSICS
International classification
G06F15/80
PHYSICS
G06F15/173
PHYSICS
Abstract
An Execution Array Memory Array (XarMa) processor is described for signal processing and internet of things (IoT) applications, (pronounced sharma, that means happiness in Sanskrit). The XarMa processor uses a 1 to K+1 adjacency network in an array of execution units. The 1 to K+1 adjacency refers to connections separately made in rows and in columns of execution unit and local file nodes, where the number of R.sub.owsK>1 and of C.sub.olumnsK>1 and K is an odd integer. Instead of a large central multi-ported register file, a distributed set of storage files local to each execution unit is used. The instruction set architecture uses instructions that specify forwarding of execution results to execution units associated with destination instructions. This execution array is scalable to support cost effective and low power high-performance application specific processing focused on target product requirements.
Claims
1. A method of executing a sequence of instructions in an execution unit (EU) node in an array of EUnits, the method comprising: receiving a first instruction and a destination instruction having a dependency on the first instruction, wherein the first instruction identifies the destination instruction in a sequence of instructions from a program and specifies that a result generated by execution of the first instruction by a first EU node is to be forwarded to a destination EU node that is to execute the destination instruction; executing the first instruction on the first EU.sub.r,c node to generate the result for delivery through an EU network to the destination EU node associated with the identified destination instruction, wherein according to a Row by Column (RC) matrix, an RC array of EU row(r),column(c) nodes are interconnected by the EU network, the EU network comprising (K+1) by (K+1) array of EU.sub.r,c nodes, a first stage (K+1)(K+1) array of R.sub.r,c nodes for a first direction of communication, a second stage (K+1)(K+1) array of S.sub.r,c nodes for a second direction of communication, and in each stage having wiring configured according to a 1 to K+1 adjacency of connections between nodes which includes wrapping around data paths at the edges of the (K+1)(K+1) arrays, K is an odd integer, K>1, R(K+1), C(K+1), re {0, 1, . . . , K}, and c{0, 1, . . . , K}, and wherein connections exist between each EU.sub.r,c node and R.sub.r,c nodes with the same row number in the first direction of communication, the first EU.sub.r,c node generates the result for a selectable first data path that connects to an R.sub.r,c+1 node and for a selectable second data path that connects to an R.sub.r,c1 node for single step adjacency and for a selectable third data path that connects to an R.sub.r,c+2 node for two step adjacency, and for a selectable fourth data path that connects to an R.sub.r,c node in the same r,c position in the RC matrix as the connecting EU.sub.r,c node, and wherein connections exist between each R.sub.r,c node and S.sub.r,c nodes with the same column number in the second direction of communication, wherein an R.sub.r,c node, associated with a selected path in the first direction of communication, produces the result for a selectable first data path that connects to an S.sub.r+1,c node and for a second data path that connects to an S.sub.r1,c node for single step adjacency and for a third data path that connects to an S.sub.r+2,c node for two step adjacency, and for a fourth data path that connects to an S.sub.r,c node in the same r,c position in the RC matrix as the connecting R.sub.r,c node, wherein an S.sub.r,c node, associated with the selected data path in the second direction of communication, produces the result on a destination data path that connects to the destination EU node to be received at the destination EU node; and executing the destination instruction in the destination EU node based on the received result to produce a destination result for use by the program.
2. The method of claim 1, wherein the R.sub.r,c nodes are 44 crossbars having four inputs and four outputs and the Sr,c nodes are 41 multiplexers having four inputs and one output.
3. The method of claim 1 further comprising: wrapping around when R.sub.r,c+1=R.sub.r,K+1 in the first direction of communication to R.sub.r,0 for single step adjacency; wrapping around when R.sub.r,c1=R.sub.r,1 in the first direction of communication to R.sub.r,K for single step adjacency; and wrapping around when R.sub.r,c+2=R.sub.r,K+2 in the first direction of communication to R.sub.r,1 for two step adjacency.
4. The method of claim 1 further comprising: wrapping around when S.sub.r+1,c=S.sub.K+1,c in the second direction of communication to S.sub.0,c for single step adjacency; wrapping around when S.sub.r1,c=S.sub.1,c in the second direction of communication to S.sub.K,c for single step adjacency; and wrapping around when S.sub.r+2,c=S.sub.K+2,c in the second direction of communication to R.sub.1,c for two step adjacency.
5. The method of claim 1 further comprising: executing a second instruction on a second EU.sub.r,c node to generate a second result for a selectable fifth data path that connects to the R.sub.r,c node, associated with the selected path in the first direction of communication; producing the second result on the R.sub.r,c node, associated with the selected path in the first direction of communication, for a selectable fifth data path that connects to the S.sub.r,c node, in the same r,c position in the RC matrix as the connecting R.sub.r,c node; and producing the second result, by the S.sub.r,c node associated with the selected data path in the second direction of communication, on a second destination data path that connects to the destination EU node to be received at the destination EU node.
6. The method of claim 5, wherein the R.sub.r,c nodes are 45 crossbars having four inputs and five outputs and the S.sub.r,c nodes are 52 multiplexers having five inputs and two outputs.
7. The method of claim 1 further comprising: setting a program counter mode control to master mode: and controlling the instruction sequence from the program for operation of the K+1 rows of the RC array of EU.sub.row(r),column(c) nodes by using the program counter for row 0 and making program counters for rows 1 to row K to be in a not used state.
8. The method of claim 1 further comprising: setting a program counter mode control to not master mode; and controlling the instruction sequence from the program for each row of the RC array of EU.sub.row(r),column(c) nodes using K+1 program counters for separate control of rows 0 to row K to be in an active state.
9. A network organized according to a 1 by Column (1C) matrix, the network comprising: a 1C array of EU.sub.1,column(c) nodes interconnected by an EU network, the EU network comprising 1 by (K+1) array of EU.sub.1,c nodes connected to a 1(K+1) array of R.sub.1,c nodes for a first direction of communication, and having wiring configured according to a 1 to K+1 adjacency of connections between the EU.sub.1,c nodes and the R.sub.1,c nodes which includes wrapping around data paths at the edges of the 1(K+1) arrays, K is an odd integer, K>1, C(K+1), and c{0, 1, . . . , K} and wherein connections exist between each EU.sub.1,c node and R.sub.1,c nodes in the first direction of communication, a first EU.sub.1,c node is connected by a first data path to an R.sub.1,c+1 node and by a second data path to an R.sub.1,c1 node for single step adjacency and by a third data path to an R.sub.1,c+2 node for two step adjacency, and by a fourth data path to an R.sub.1,c node in the same 1,c position in the 1C matrix as the first EU.sub.1,c node, wherein the R.sub.1,c1 node is connected by a first outputA path to its associated EU.sub.1,c1 node, the R.sub.1,c node is connected by a second outputA path to its associated EU.sub.1,c node, the R.sub.1,c+1 node is connected by a third outputA path to its associated EU.sub.1,c+1 node, and the R.sub.1,c+2 node is connected by a fourth outputA path to its associated EU.sub.1,c+2 node.
10. The network of claim 9, wherein the R.sub.r,c nodes comprise: 41 multiplexers, in the R.sub.r,c nodes, having four inputs and one output.
11. The network of claim 9, wherein the R.sub.1,c1 node is connected by a first outputB path to its associated EU.sub.1,c1 node, the R.sub.1,c node is connected by a second outputB path to its associated EU.sub.1,c node, the R.sub.1,c+1 node is connected by a third outputB path to its associated EU.sub.1,c+1 node, and the R.sub.1,c+2 node is connected by a fourth outputB path to its associated EU.sub.1,c+2 node.
12. The network of claim 11, wherein the R.sub.r,c nodes comprise: 42 crossbars, in the R.sub.r,c nodes, having four inputs and two outputs.
13. The network of claim 9 further comprising: the first data path is wrapped around when R.sub.1,c+1=R.sub.1,K+1 in the first direction of communication to R.sub.r,0 for single step adjacency; the second data path is wrapped around when R.sub.1,c1=R.sub.1,1 in the first direction of communication to R.sub.r,K for single step adjacency; and the third data path is wrapped around when R.sub.1,c+2=R.sub.1,K+2 in the first direction of communication to R.sub.r,1 for two step adjacency.
14. The network of claim 9 further comprising: connecting two 1C arrays of EU.sub.1,column(c) nodes by a second stage (K+1)(K+1) array of S.sub.r,c nodes for a second direction of communication, wherein each R.sub.r,c node is connected by a selectable first data path to an S.sub.r+1,c node and by a second data path to an S.sub.r1,c node for single step adjacency and by a third data path to an S.sub.r+2,c node for two step adjacency, and by a fourth data path to an S.sub.r,c node in the same r,c position in the RC matrix as the connecting R.sub.r,c node, wherein each S.sub.r,c node is connected by a destination data path to a corresponding destination EU.sub.r,c node.
15. The network of claim 14, wherein the R.sub.r,c nodes and S.sub.r,c nodes comprise: 42 multiplexers in the Rr,c nodes having four inputs and one output; and 21 multiplexers in the Sr,c nodes having two inputs and one output.
16. A system apparatus comprising: a load unit having a source of data values external to an array of execution unit (EU) nodes that are interconnected by an EU network; a first multiplexing element in the load unit to connect externally received data values to an EU located in the EU network for processing by one or more program instructions; a store unit having a source of data values internal to the array of EU nodes; a second multiplexing element in the store unit to connect to the EU network to receive data values from an EU source and connect the internally received data values to a destination node located external to the EU network for processing by the destination node, wherein the load unit is combined with the store unit as a single node of the array of EU nodes.
17. The system apparatus of claim 16, wherein the source of data values comprises: a memory unit having a read port providing the source of data values.
18. The system apparatus of claim 16 wherein the destination node comprises: a memory unit having a write port to receive the data values from the EU source and store the received data values in the memory.
19. The system apparatus of claim 16 wherein the EU network comprises: an RowColumn (RC) array of EU.sub.row(r),column(c) nodes interconnected by the EU network, the EU network comprising (K+1) by (K+1) array of EU.sub.r,c nodes, a first stage of (K+1)(K+1) array of R.sub.r,c nodes for a first direction of communication, a second stage of (K+1)(K+1) array of S.sub.r,c nodes for a second direction of communication, and in each stage having wiring configured according to a 1 to K+1 adjacency of connections between nodes which includes wrapping around data paths at the edges of the (K+1)(K+1) arrays, K is an odd integer, K>1, R(K+1), C(K+1), r{0, 1, . . . , K}, and c{0, 1, . . . , K}, and wherein connections exist between each EU.sub.r,c node and R.sub.r,c nodes with the same row number in the first direction of communication, and wherein connections exist between each R.sub.r,c node and S.sub.r,c nodes with the same column number in the second direction of communication, and wherein each S.sub.r,c node is connected to corresponding EU.sub.r,c nodes.
20. The apparatus of claim 19, wherein the load unit connects to the EU network as an EU.sub.r,c node connects in the first direction of communication to an R.sub.r,c node to send a load supplied value to the EU network and the store unit connects to the EU network as an EU.sub.r,c node to receive an EU network provided value from an S.sub.r,c node output.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0013]
[0014]
[0015]
[0016]
[0017]
DETAILED DESCRIPTION
[0018] While the present invention is disclosed in a presently preferred context, it will be recognized that the teachings of the present invention may be variously embodied consistent with the disclosure and claims. It will be recognized that the present teachings may be adapted to other present and future architectures to which they may be beneficial.
[0019] In order to amortize development costs for such devices across multiple products targeted for different applications, a scalable architecture with multiple design points using the same instruction set architecture is proposed. To address low power, high performance, and scalability, a new architecture is presented that reduces storage of temporary variables lowering power usage, provide efficient processor and shared memory transfers, and is scalable.
[0020]
[0021] To illustrate an exemplary data path, the node Nb11 102 is designed to be an execution unit, so is referenced here in this description as Xb11. The execution unit Xb11 102 generates a result upon executing an instruction which is programmatically directed to use one or more selectable data buses 135-138, such as the data bus 135. The data buses 135-138 comprise data buses 135 and 137 having connections between the Xb11 node 102 and the R.sub.1,0 node 130 and the R.sub.1,2 node 132 with the same row number in the first direction of communication of single step adjacency between next door adjacent neighbors. The first direction of communication of single step adjacency for the Xb11 node is communication in the east and west horizontal direction. The single step adjacency for Xb11 is to R nodes having an integer column number of the starting node, in this case column 1 for the Xb11 node 102, increased by a value of 1 for single step adjacency in the east direction to R.sub.1,2 node 132 and decreased by the value 1 for single step adjacency in the west direction to R.sub.1,0 node 130. Wraparound is also in effect, in this case, after the increase of a starting column number 3 by 1 for a value of K+1=4, the starting column number 3 wraps around to column 0 and after the decrease of a starting column number 0 by 1 for a value of 1, the starting column number 0 wraps around to column 3.
[0022] The data bus 136 has a connection between Xb11 node and R.sub.1,1 node 131 having the same position in the RC matrix. The data bus 138 has a connection between Xb11 node 102 and the R.sub.1,3 node 133 representing one additional connection in the first direction of communication of two step adjacency. The one additional connection in the first direction of communication of two step adjacency for the Xb11 node 102 may be communication in either the east direction or communication in the west horizontal direction. The east direction of communication of two step adjacency for the Xb11 node 102 is to an R node having an integer column number of the starting node, in this case column 1 for the Xb11 node 102, increased by a value of 2 in the east direction to R.sub.1,3 node 133. With wrap around, an increased column number of 4 wraps around to column 0 and an increased column number of 5 wraps around to column 1. The west direction of communication of two step adjacency for the Xb11 node 102 is to an R node having an integer column number of 1 for the starting node Xb11 node 102, is decreased by a value of 2 in the west direction to a 1 value and is directed to R.sub.1,3 node 133 due to wraparound. With wrap around, a decreased column number of 2 wraps around to column 2.
[0023] The data travels across the data bus 135 and reaches node R10 130 which is configured with four 4to1 multiplexers, such as shown R.sub.r,c 44 crossbar node 177. Each of the four 4to1 multiplexers receives control signals that cause each multiplexer to select none or one of that multiplexer's four input signals to pass to its associated output of the R10 130 44 crossbar. There are three types of R.sub.r,c node to S.sub.r,c node connection paths. The first type of connection path is for data buses 160 and 168 having connections between the R.sub.1,0 node 130 and the S.sub.0,0 node 140 and the S.sub.2,0 node 148 with the same column number in a vertical second direction of communication of single step adjacency between next door adjacent neighbors. The second type of connection path is for data bus 164 which has a connection between R.sub.1,0 node 130 and S.sub.1,0 node 144 having the same position in the RC matrix. The third type of connection path is for data bus 172 which has a connection between the R.sub.1,0 node 130 and the S.sub.3,0 node 152 representing one additional connection in the second direction of communication of two step adjacency. The first direction of communication and the second direction of communication can be reversed, with the first direction of communication being in a vertical North/South direction and the second direction of communication being is a horizontal East/West direction.
[0024]
[0025] In
[0026]
[0027] To illustrate an exemplary data path, the execution unit Mq11 306 generates a result upon executing an instruction which is programmatically directed to use one or more data buses 356-359, such as the data bus 356. The data buses 356-359 comprise data buses 356 and 358 having connections between the Mq11 306 and the R.sub.1,0 node 335 and the R.sub.1,2 node 337 with the same row number in the first direction of communication of single step adjacency between next door adjacent neighbors. The first direction of communication of single step adjacency for the Mq11 306 node is communication in the east and west horizontal direction. The single step adjacency for Mq11 306 is to R nodes having an integer column number of the starting node, in this case column 1 for the Mq11 306, increased by a value of 1 for single step adjacency in the east direction to R.sub.1,2 node 337 and decreased by the value 1 for single step adjacency in the west direction to R.sub.1,0 node 335. Wraparound is also in effect, in this case, after the increase of a starting column number 3 by 1 for a value of K+1=4, the starting column number 3 wraps around to column 0 and after the decrease of a starting column number 0 by 1 for a value of 1, the starting column number 0 wraps around to column 3.
[0028] The data bus 357 has a connection between Mq11 306 and R.sub.1,1 node 336 having the same position in the RC matrix. The data bus 359 has a connection between Mq11 306 and the R.sub.1,3 node 338 representing one additional connection in the first direction of communication of two step adjacency. The one additional connection in the first direction of communication of two step adjacency for the Mq11 306 node is communication in either the east direction or communication in the west horizontal direction. The east direction of communication of two step adjacency for Mq11 306 is to an R node having an integer column number of the starting node, in this case column 1 for the Mq11 306, increased by a value of 2 in the east direction to R.sub.1,3 node 338. With wrap around, an increased column number of 4 wraps around to column 0 and an increased column number of 5 wraps around to column 1. The west direction of communication of two step adjacency for Mq11 306 is to an R node having an integer column number of 1 for the starting node Mq11 306, is decreased by a value of 2 in the west direction to a 1 value and is directed to R.sub.1,3 node 338 due to wraparound. With wrap around, a decreased column number of 2 wraps around to column 2.
[0029] The data travels across the data bus 356 and reaches node R10 335 which is configured with five 4to1 multiplexers, such as shown R.sub.r,c 45 crossbar node 391. Each of the five 4to1 multiplexers receives control signals that cause each multiplexer to select none or one of that multiplexer's four input signals to pass to its associated output of the R10 335 45 crossbar. There are three types of R.sub.r,c node to S.sub.r,c node connection paths. The first type of connection path is for data buses 360 and 368 having connections between the R.sub.1,0 node 335 and the S.sub.0,0 node 340 and the S.sub.2,0 node 348 with the same column number in a second vertical direction of communication of single step adjacency between next door adjacent neighbors. The second type of connection path is for data buses 364 and 376 which have a connection between R.sub.1,0 node 335 and S.sub.1,0 node 344 having the same position in the RC matrix. The third type of connection path is for data bus 372 which has a connection between the R.sub.1,0 node 335 and the S.sub.3,0 node 352 representing one additional connection in the second direction of communication of two step adjacency. The first direction of communication and the second direction of communication can be reversed, with the first direction of communication being in a vertical North/South direction and the second direction of communication being is a horizontal East/West direction.
[0030]
[0031]
[0032] To minimize the storage of temporary variables, an instruction is formatted to specify that a result is to be forwarded to one or more destination instructions in a chain of execution instructions instead of a destination register in a central register file. The forwarding of the result to the destination instruction is decoded by internal logic to be an operand input port register (OIPR) of an associated execution unit thereby eliminating the storage of the temporary result variable in a central register file. For the 14 XarMa processor 502 of