MULTIFUNCTIONAL DATA REORGANIZATION NETWORK
20220209975 · 2022-06-30
Inventors
- Tian XIA (Xi'an, CN)
- Lingfeng CHEN (Xi'an, CN)
- Wenzhe ZHAO (Xi'an, CN)
- Pengchen ZONG (Xi'an, CN)
- Pengju REN (Xi'an, CN)
- Nanning ZHENG (Xi'an, CN)
Cpc classification
International classification
Abstract
A multifunctional data reorganization network includes a binary switching unit and a recursive shuffle network (RSN), wherein both the binary switching unit and the recursive shuffle network can enable bidirectional transmission of data, and the data reorganization network completes data reorganization by controlling the transmission direction of a signal in the network. The network may serve as a data transfer path between a storage unit and a computation unit to perform multiple data reorganization functions while transferring data, thereby enabling flexible data structure adjustment of non-regular data, and thus improving data transfer efficiency and computational efficiency of non-regular computation.
Claims
1. A multifunctional data reorganization network, comprising a binary switching unit and a recursive shuffle network (RSN), wherein both the binary switching unit and the RSN enable bidirectional transmission of data, and the data reorganization network completes data reorganization by controlling the transmission direction of a signal in the network.
2. The data reorganization network according to claim 1, wherein the binary switching unit comprises a basic switching unit and a reduction switching unit.
3. The data reorganization network according to claim 1, wherein an input signal of the binary switching unit comprises a tag and a data payload, and the data payload is the data content actually needed to be transmitted and the tag is the corresponding routing information.
4. The data reorganization network according to claim 1, wherein the binary switching unit has two modes of operation: a self-routing mode and routing fulfilled by following routing information input externally, wherein the self-routing mode is that the routing method is determined according to the value of the tag or the data payload of the input signal.
5. The data reorganization network according to claim 1, the RSN is obtained by successively superimposing a smaller scale bidirectional perfect shuffle network.
6. The data reorganization network according to claim 1, a sort recorder multicast (SOM) transport network is recursively built based on the RSN, all the RSNs in each recursive level being treated as a whole functional Block.
7. The data reorganization network according to claim 1, wherein on the basis of the SOM transport network, for each Block, a SOM reduction network is constructed by replacing the first binary switching unit of the first level of each RSN network comprised therein with a reduction switching unit.
8. The data reorganization network according to claim 6, wherein a SOM transport network or a SOM reduction network respectively provides an independent configuration signal for each Block.
9. The data reorganization network according to claim 6, wherein the SOM transport network or the SOM reduction network can further configure the data stream direction between Blocks by configuring a selector at the input port of each Block.
10. The data reorganization network according to claim 7, wherein the SOM reduction network is formed by embedding a complete k-input adder tree in the SOM network, wherein the adder tree is provided with k−1 adders, and k is a positive integer power of 2.
11. The data reorganization network according to claim 7, wherein a SOM transport network or a SOM reduction network respectively provides an independent configuration signal for each Block.
12. The data reorganization network according to claim 7, wherein the SOM transport network or the SOM reduction network can further configure the data stream direction between Blocks by configuring a selector at the input port of each Block.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0037] The present disclosure is described in further detail below with reference to
[0038] In one embodiment, a multifunctional data reorganization network is disclosed, the network includes a binary switching unit and a recursive shuffle network (RSN), wherein both the binary switching unit and recursive shuffle network can enable bidirectional transmission of data, and the data reorganization network completes data reorganization by controlling the transmission direction of a signal in the network.
[0039] In terms of this embodiment, referring to
[0040] In another embodiment, the binary switching unit includes a basic switching unit and a reduction switching unit.
[0041] In terms of this embodiment, the SOM network is constructed based on binary non-blocking switching network technology, whose basic functional modules are multifunctional binary switching units. Each switching unit has two input ports and two output ports. The switching unit may route the two input signals onto the two output ports in different ways. As shown in
[0042] In another embodiment, the input signal of the binary switching unit includes a tag and a data payload, wherein the data payload is the data content actually needed to be transferred, and the tag is the corresponding routing information.
[0043] In another embodiment, the binary switching unit has two modes of operation: a self-routing mode and routing fulfilled by following routing information input externally, wherein the self-routing mode is that the routing method is determined according to the value of the tag or the data payload of the input signal.
[0044] In terms of this embodiment,
[0045] The specific meaning of the individual signals of the binary switching unit is listed in Table 1. Both the input signal and the output signal contain two parts: the tag and the data payload. The bit width of both can be adjusted according to the actual application scenario. The bit width of the tag may generally be set to log.sub.2 k+1, k being the number of whole network input signals. The bit width of the data payload may be adjusted depending on the type of data being transferred, and typical data types include a 8-bit fixed-point number, a 32-bit fixed-point number, a 32-bit single precision floating point number, and the like. When the CW_en signal is 1, the binary switching unit selects the routing method using the signal of the CW_in input. When the CW_en signal is 0, the binary switching unit operates in a self-routing mode, performs routing calculations according to the configuration of the Mod signal, and selects a routing method according to the calculation results. For example, when the Mod signal is set to 010, the binary switching unit compares the tag values of the two input signals and sends the input signal with the larger tag value to the first output port and the input signal with the smaller tag value to the second output port. Thus, if the tag value of the first input port is greater than the tag value of the second input port, the binary switching unit selects a pass-through route; conversely, a “cross-over” route is selected.
TABLE-US-00001 TABLE 1 Bit width Signal Direction (bit) Function Xi/Yi/Mi/Ni Input Tag Input signals in two directions, including input Payload tags and input data payloads Xo/Yo/Mo/No Output Tag Output signals in two directions, including Payload output tags and output data payloads Direct Input 1 Indicating transmission direction: 0: from a left input port (Xi, Yi) to a right output port (Xo, Yo) 1: from a right input port (Mi, Ni) to a left output port (Mo, No) CW_en Input 1 Indicating whether an external routing signal is used: 0: No external routing signal is used, the switching unit routes according to the routing method configured by the Mod signal 1: The external routing signal is used, the switching unit ignores the configuration of the Mod signal, and directly uses a CW_in signal for routing CW_in Input 2 Indicating the route mode: 01: pass-through; 10: cross-over; 00: up-casting; 11: down-casting CW_out Output 2 Output the routing mode: 01: pass-through; 10: cross-over; 00: up-casting; 11: down-casting Mod Input 3 Self-routing mode: 10: Compare the tag values of the input signals and send the larger tag and its corresponding data to the first output port and the smaller to the second output port 001: Compare the tag values of the input signals and send the larger tag and its corresponding data to the second output port and the smaller to the first output port 11: Compare the tag values of the input signals and send the larger tag and its corresponding data simultaneously to the first and second output ports 000: Send an input signal with the Tag highest bit being 1 and Tag [s_id] being 0 to the first output port, and an input signal with the Tag highest bit being 1 and Tag [s_id] being 1 to the second output port, wherein s_id is a number inherent to each switching unit to indicate at which level of the overall network it is. 110: Compare the data payload values of the input signals and send the larger data payload and its corresponding tag to the first output port and the smaller to the second output port 101: Compare the data payload values of the input signals and send the larger data payload and its corresponding tag to the second output port and the smaller to the first output port Reduce Input 1 Indicating whether reduced routing is used: 0: Reduced routing is not used; 1: Reduced routing is used If reduced routing is used, the CW_in signal needs to be configured simultaneously to complete up-reduction and down-reduction: {Reduce, CW_in} = 001: up-reduction, the sum of the data payloads of the two input signals is sent to the first output port {Reduce, CW_in} = 010: down-reduction, the sum of the data payload of the two input signals is sent to the second output port
[0046]
[0047] In another embodiment, the recursive shuffle network RSN is obtained by successively superimposing a smaller-scale bidirectional perfect shuffle network.
[0048] In terms of this embodiment, a plurality of binary switching units constitute a Recursive Shuffle Network (RSN) by way of a hierarchical recursive topology. The topology of the RSN is recursive, with its basic topology in the form of “Perfect Shuffle”. As shown in
[0049] An RSN of k=2n scale may be constructed by cascading a bidirectional perfect shuffle network of k=2n scale and two parallel RSNs of k=n scale as shown in
[0050] In another embodiment, a SOM transport network is built up recursively based on RSNs, with all RSN networks in each recursion level being treated as a whole functional Block.
[0051] With this embodiment, the SOM transport network may be built up recursively based on RSNs. As shown in
[0052] In this solution, all the RSN networks in each recursive level are treated as a whole functional block, called Block. As shown in
[0053] The internal structure of each Block is shown in
[0054] It is further noted that the SOM transport network does not involve reduction function, so the binary switching unit components used in the base network are all basic switching units.
[0055] In another embodiment, the SOM transport network or the SOM reduction network respectively provides independent configuration signals for each Block.
[0056] For this embodiment, a SOM transport network of scale k=2.sup.r contains r Blocks (Block 0, Block 1, Block r−1). For any Block-i, it contains 2.sup.r-i−1 parallel 2.sup.i+1-scale RSN networks which are independent of each other, and each RSN network contains i+1 stages of switching units (S.sub.0, S.sub.1, . . . , S.sub.i), each stage of switching units containing 2.sup.i switching units for a total of (i+1) 2.sup.i switching units. The SOM network provides independent configuration signals for each Block separately. For the Block-i, the required control signals are shown in Table 2.
TABLE-US-00002 TABLE 2 Signal Direction Bit width (bit) Configuration Method Direct Input 1 Direct signal is shared by all the RSN networks within the Block Mod Input If i > 0: 2.sup.r − i − 1 * Two sets of Mod signals are configured 2 * 3 = 2.sup.r − i * 3 for each RSN network in the Block: If i = 0: 2.sup.r − 1 * 3 The first set of signals is used for configuring a first stage of switching units (S0) of the RSN network, the Mod signals being shared by all the switching units in the stage; and the second set of signals is used to configure the remaining stage switching units (S1-Si) of the RSN network, and all the switching units in these stages share the Mod signals. Note that Block 0 contains only one stage of switching units, so only one set of Mod signals is needed CW_en Input If i > 0: 2.sup.r − i − 1 * Two sets of CW_en signals are 2 * 1 = 2.sup.r − i configured for each RSN network in the If i = 0: 2.sup.r − 1 Block: The first set of signals is used for configuring first stage switching units (S0) of the RSN network, the CW_en signals being shared by all the switching units in the stage; The second set of signals is used to configure the switching units (S1-Si) of the remaining stages of the RSN network, and all the switching units in these stages share the CW_en signals. Note that the Block 0 contains only one stage of switching units, so only one set of CW_en signals is needed CW_in Input (i + I) 2.sup.i * 2 = Configure a CW_in signal for each (i + I) 2.sup.i + 1 switching unit in the Block CW_out Output (i + I) 2.sup.i * 2 = Output the routing signal for each (i + I) 2.sup.i + 1 switching unit in the Block
[0057] In another embodiment, the SOM transport network or the SOM reduction network further configures the data stream direction between Blocks by configuring a selector at the input port of each Block.
[0058] With this embodiment, in addition to the configuration signal of each Block, the OM network also needs to configure the data stream direction between Blocks. As can be seen from
TABLE-US-00003 TABLE 3 Bit width Signal Direction (bit) Quantity Function Flow_Cfg_Bin Input 4 r-2 Configure input signal selectors for Block 1-Block r-2: 0001: Select an input signal for the SOM network 0010: Select an output signal of a Local Buffer 0100: Select an output signal of a forward Block 1000: Select an output signal of a backward Block Flow_Cfg_Bin_Edge Input 3 2 Select input signal selectors for Block 0 and Block r-1: 001: Select an input signal of a SOM network 010: Select an output signal of a Local Buffer 100: Select the output signal of the forward (backward) Block Flow_Cfg_Out Input r 2 Select the signal source of the input port of the Local Buffer and the output port of the SOM network, use one-hot encoding, selects the output signal of the Block corresponding to the position encoded as 1 by the selector.
[0059]
[0060] In another embodiment, a specific implementation of an 8-input SOM transport network is shown.
[0061]
[0062] By flexibly configuring the direction of transmission of each Block, as well as the direction of data stream between Blocks, a number of different data reorganization functions can be implemented on the input signal.
[0063] In another embodiment, numerical sorting is shown.
[0064]
[0065] In another embodiment, numerical resorting is shown.
[0066]
[0067] In another embodiment, numerical multicasting is shown.
[0068]
[0069] In another embodiment, compression and decompression of non-zero numerical values is displayed.
[0070]
[0071] In an embodiment of non-zero numeric compression, the upper bit of the tag of each non-zero input data is set to 0 and the lower bit is set to the position of that element in the vector. The tag upper bits of the remaining input data are set to 1. The SOM network sorts the tags in an ascending order, thereby rearranging the non-zero elements to the front of the output port, and the relative order of the respective non-zero elements to each other remains unchanged. The data stream goes through the SOM network in order of Block 0-1-2. The propagation direction of each Block is forward. All binary switching units of the SOM network are set in a self-routing mode of operation, the tags of the input signals are compared, and routing is based on the comparison results. The externally input routing signal CW_in is not used.
[0072] In embodiments of non-zero numerical value decompression, the tag of each non-zero numerical value is set to its position in the original vector.
[0073] In another embodiment, post-multicast packet resorting is shown.
[0074]
[0075] In another embodiment, synchronization of multiple SOM networks is shown.
[0076] By sharing the routing signals, multiple SOM networks may simultaneously complete the same data reorganization.
[0077] In such a synchronization relationship, the SOM network that provides the routing information is referred to as the Actor Network, and the SOM network that receives the routing information is referred to as the Tracker Network. Since the transmission of Actor Network's routing signals to the Tracker Network requires a one clock cycle delay, the Actor Network's data stream is always one clock cycle earlier than the Tracker Network's data stream. It is further noted that one Actor Network may correspond to multiple Tracker Networks simultaneously. An example of a matrix change is given in
[0078] In another embodiment, on the basis of the SOM transport network, for each Block, the SOM reduction network may be constructed by replacing the first switching unit of the first stage of each RSN network contained therein with a reduction switching unit.
[0079] With respect to this embodiment, based on the SOM transport network, a SOM reduction network may be further constructed. It is constructed by replacing, for each Block on the basis of the SOM transport network, the first switching unit of the first stage of each RSN network contained therein with a reduction switching unit. This is shown in
[0080] In another embodiment, the SOM reduction network is a SOM network with an input size of k having k−1 adders embedded therein, and the position of each adder is shown in
[0081] For this embodiment, as shown in
[0082] The SOM reduction network may enable packet reduction at different scales by adjusting the data stream order between Blocks. Specifically, for a randomly distributed set of input signals, the SOM network may aggregate elements that are divided into different groups and distributed at random locations by group, then perform reduction calculations for each group separately, and output the calculation results for each group at a specified location. In this “aggregation-reduction” computing mode, the “aggregation” function is achieved by reverse use of RSN and the “reduction” is achieved by forward use of adder trees embedded in the SOM network. By adjusting the RSN scale used for “aggregation” and “reduction”, group reduction at different scales can be achieved.
[0083] In another embodiment, a concrete implementation of an 8-input SOM reduction network is shown.
[0084]
[0085] In another embodiment, a group reduction of scale 4 is shown.
[0086]
[0087] In another embodiment, a group reduction of scale 2 is shown.
[0088]
[0089] In another embodiment, the solution is extremely scalable and is mainly embodied in the following points:
[0090] (1) The SOM network is recursive setup and its scale can be arbitrarily extended to 2.sup.r size.
[0091] (2) The configuration signals and routing methods of each binary switching unit in a SOM network may be extended to enable more complex routing methods. For example, the bit width of the tag can be expanded and compressed according to requirements. It is also possible to provide that each switching unit decides its routing method based on the value of a bit in the input signal label, thereby supporting each input data to save its complete routing path in the label bit without requiring each binary switching unit to route based on the comparator result.
[0092] (3) The data stream of the SOM network may be further flexibilized. By implementing a flexible data path configuration between stages of switch units similar to that between Blocks, the data stream may be allowed to pass through the Block instead of in a fixed direction, both “forward” and “backward”, but rather the order of passing through the various binary switch unit stages in the Block may be selected more flexibly. With this extension, more topology types can be implemented, thus supporting more complex data rearrangement function.
[0093] (4) Reduction function of the SOM network may be further extended. The reduction function of the reduction switching unit can be extended to other operations than addition, such as multiplication, shift, max/min, or AND, OR, NOT, etc. logical operations.
[0094] In another embodiment, the present solution is applied broadly, mainly in terms of the following:
[0095] (1) The SOM network can be used to transfer data between a Cache and registers in a general SIMD computing architecture. The data during transfer from the Cache to the registers, through the data reorganization function provided by the SOM network, can better adapt to the data format required by SIMD instructions, thus reducing the number of SIMD instructions required for computation. Moreover, the results of the computations are written back from the registers to the Cache, and through the data reassembly and flexible reduction functions provided by the SOM network, flexible post-processing is allowed, such as group summation or non-zero element compression of the results of the computations. The post-processing is done in data transfer, so the SIMD instruction number can be further reduced.
[0096] (2) The SOM network may be used for data pre- and post-processing modules of a dedicated Domain-Specific Architecture (DSA). Depending on the data stream needed for specialized computations, a SOM network may be specially adapted and adapted to remove certain unwanted functions and simplify its circuit complexity while better adapting to certain types of specialized data structures.
[0097] (3) The SOM network may be used for visit pre- and post-processing of bulk data storage media such as DDR. Due to the high scalability of the SOM, its scale can be scaled to comply with data transfer processing in large blocks. For example, the dynamic compression and decompression functions of the SOM may effectively reduce storage access bandwidth and improve access efficiency.
[0098] Although embodiments of the present disclosure have been described above with reference to the accompanying drawings, the disclosure is not limited to the specific embodiments and fields of application described above, which are merely illustrative, instructive, and not restrictive. Those of ordinary skill in the art, in light of the present description and without departing from the scope of the present disclosure as claimed, can take many forms, all of which fall within the scope of the present disclosure.