High performance merge sort with scalable parallelization and full-throughput reduction
11249720 · 2022-02-15
Assignee
Inventors
- Fazle Sadi (Pittsburgh, PA, US)
- Larry Pileggi (Pittsburgh, PA, US)
- Franz Franchetti (Pittsburgh, PA, US)
Cpc classification
G06F7/49
PHYSICS
International classification
G06F7/499
PHYSICS
G06F9/30
PHYSICS
Abstract
Disclosed herein is a novel multi-way merge network, referred to herein as a Hybrid Comparison Look Ahead Merge (HCLAM), which incurs significantly less resource consumption as scaled to handle larger problems. In addition, a parallelization scheme is disclosed, referred to herein as Parallelization by Radix Pre-sorter (PRaP), which enables an increase in streaming throughput of the merge network. Furthermore, high performance reduction scheme is disclosed to achieve full throughput.
Claims
1. A parallel multi-way merge network comprising: a radix-based pre-sorter for sorting records based on a radix of a key of each record; and a plurality of parallel multi-way merge networks, each multi-way merge network only operating on records with keys having a certain radix, wherein the radix-based pre-sorter sorts each input record for input to a multi-way merge network operating on records having key with a radix matching the input record; wherein each record comprises a value-key pair and the radix of the key of each record is a predetermined number of least significant bits of the key portion of each record; and wherein each multi-way merge network is a hybrid network comprising: an independent register FIFO-based merge tree; one or more multi-stage comparison look-ahead merge networks; and an asynchronous FIFO coupled to the output of each multi-stage comparison look-ahead merge network, the asynchronous FIFO using a clock of the multi-stage comparison look-ahead merge network for writes to the FIFO and a clock of the independent register FIFO-based merge tree for reads from the FIFO, each asynchronous FIFO being an input to the independent register FIFO-based merge tree.
2. The network of claim 1 further comprising: a plurality of pre-fetch buffers, each pre-fetch buffer associated with a multi-way merge network, each pre-fetch buffer accepting inputs for the associated multi-way merge network from the radix-based pre-sorter.
3. The network of claim 1 wherein the radix is a q number of least significant bits of the key of each record and further wherein the number of multi-way merge networks is 2.sup.q.
4. The network of claim 1 wherein each hybrid network further comprises missing key check logic that, when no input record with a key having a radix matching an operating radix of the hybrid network exists, injects an artificial record having a key with a value of 0, and delays one or more following records.
5. The parallel multi-way merge network of claim 1, wherein each of the plurality of multi-way merge networks further comprises a full-throughput reducer comprising: a multiplier; a series of log.sub.2 F adders; and a final adder; wherein F is a number of internal pipelines in each adder; wherein each adder in the series of adders: takes an input record and determines if a collision exists between the input record and a previous input record, a collision being indicated by a match of row indices of the input record and the previous record; if a collision is indicated, outputting a new record having as a key a row index and having as a value a sum of the values of the input record and the previous record; if no collision is indicated, outputting the previous record and holding the input record for comparison with a next record; and wherein the final adder: accumulates a sum with an incoming record via a feedback path, where the sum represents an accumulated result for all records having matching indices; and outputs the sum when no further matches are found between the input record and the sum.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
DETAILED DESCRIPTION
(19) Many important kernels, such as Sparse Matrix dense Vector multiplication (SpMV) and Sparse General Matrix-Matrix multiplication (SpGEMM), require a large parallel and scalable multi-way merge network. Disclosed herein is a novel, multi-way merge network, referred to herein as “Hybrid Comparison Look Ahead Merge” (HCLAM), which incurs significantly less resources as it is scaled to handle larger problems. Also disclosed is a parallelization scheme, referred to herein as “Parallelization by Radix Pre-sorter” (PRaP), which enables an increased streaming throughput of the merge network without prohibitive demand of on-chip memory.
(20) Scalability
(21) From the perspective of hardware implementation and performance, scalability can be viewed from two distinct aspects, which are (1) problem scaling and (2) technology scaling.
(22) On the other hand, the design of any algorithm or hardware is expected to take full advantage of new technologies with extended capabilities. For example, 3D stacked high-bandwidth memory (HBM) technology enables extreme off-chip bandwidth. As many sparse matrix kernels should ideally be memory bandwidth bound, accelerators in this domain are expected to properly utilize the extreme bandwidth offered by this new technology. For a single multi-way merge tree hardware, where the maximum output rate is one element per cycle, delivering enough throughput to saturate such high bandwidth is another major challenge. Additionally, maintaining balanced throughput for multiple DRAM channels poses further challenges.
(23) A multi-way merge solution that can practically address both of these scalability issues is unknown. The multi-way merge hardware design disclosed herein address both problem and technology scaling, while being practically feasible for custom hardware platforms, such as Application Specific Integrated Circuits (ASIC) and Field Programmable Gate Arrays (FPGA).
(24)
(25) The basic building blocks of this tree are sorter cells (comparators) and FIFOs. Each sorter cell compares the keys from the two connected FIFOs and dequeues the record (key-value pair) with the smaller key. To improve clock frequency, the merge tree is further divided into pipelined stages by storing the output of sorter cells in pipeline registers. The total number of stages can be calculated as S=log K+1 and the stage number starts from 0 at leaf level. The most straight-forward hardware implementation uses register-based FIFOs and a total of K−1 sorter cells. For any particular pipeline stage and at any given clock cycle, only one sorter cell remains active in steady state.
(26) A record is dequeued from a FIFO only if the connected sorter in the next stage is active and has the smaller key. Similarly, a record is queued to a FIFO only if it is not entirely full. This implementation is referred to as Independent Register FIFO based Merge (IRFM).
(27) For IRFM, D.sub.FIFO is defined as the FIFO depth and L.sub.d as the total number of clock cycles required beginning from the issuance of read requests to a pair of FIFOs to when the output record is ready to be queued in the destination FIFO of next pipeline stage. Additionally, the number of clock cycles to generate the read address, raddr.sub.i of stage i is defined as L.sub.a. As raddr.sub.i is dependent on the output record of stage i+1, a minimum of L.sub.d+L.sub.a clock cycles are required for a record to travel from one pipeline stage to the next. Hence, maximum throughput of this multi-way merge implementation, R.sup.MAX, is 1 element per L.sub.d+L.sub.a cycles as shown in Eq. (1). Henceforth, the duration of L.sub.d+L.sub.a clock cycles will be referred to as a working cycle of period (T.sub.W).
(28) To maintain R.sub.MAX in a steady state, D.sub.FIFO has to be a minimum of 2. The reason is that it takes T.sub.W time to dequeue a record from any FIFO and it takes another T.sub.W time to replenish that FIFO. As a result, for consecutive accesses to any particular FIFO without introducing a bubble, a minimum of two records have to be queued in all FIFOs during initialization. For purposes herein, it is assumed that each FIFO has one read port and one write port independent from each other.
(29) In IRFM, read address generation is trivial as a “not-full” status of any FIFO in stage i+1 can be independently propagated to stage i to generate raddr.sub.i. For many practical implementations using register FIFOs L.sub.d+L.sub.a=1 and a throughput of 1 element per cycle is achievable using D.sub.FIFO=2. However, it is assumed that the data stream is not interrupted at the leaf level of the binary tree (i.e. the input stage of the pipeline). Due to various technical reasons, in practical implementations it is common to have occasional interruptions in the data stream at the leaf level of the tree. In such cases, it is beneficial to have D.sub.FIFO>2 to maintain R.sup.MAX.
(30)
(31) Block Memory-Based Multi-Way Merge
(32) As k grows, the logic required for the sorter cells and FIFOs grows exponentially. As hardware resources are limited, this becomes one of the key prohibiting factors in implementing a large multi-way merge network. As a solution to this, implementation design as depicted in
(33) In any given working cycle t.sub.W, two records, namely d.sub.i,tw.sup.out0 and d.sub.i,tw.sup.out1 are read from the same address of both memory blocks in stage i. However, only the record with the smaller key, d.sub.i,tw.sup.min=min{d.sub.i,tw.sup.out0, d.sub.i,tw.sup.out1}, is stored in the pipeline register. This record works as the input data d.sub.i+1,tw+1.sup.in to stage i+1 in the next working cycle t.sub.w+1. Therefore, for every pipeline stage, Scheme-1 requires 2 reads and 1 write in the pertinent memory blocks. It is important to note that the FIFO that dequeued the record with smaller key in stage i at working cycle t.sub.w needs to be replenished at the working cycle t.sub.w+1 to maintain steady throughput.
(34) Here, it is assumed that every block memory has one read port and one write port for depiction purpose, which is common for Static Random Access Memory (SRAM). However, this scheme can be generalized, and the fundamental principles do not depend on the number ports of the memory blocks.
(35) One advantage of Scheme-1 is that now only log k sorter cells are required instead of K−1. Secondly, a single SRAM block can be used instead of multiple separate register based FIFOs. This significantly reduces the silicon real estate required for buffer storage needed in a multi-way merge network as SRAM cells (8 transistors) are much smaller than registers (19 transistors). However, as described below, there are a number of potential issues that can render Scheme-1 to be inefficient.
(36) Performance: The block memory in each stage, as depicted in
(37) Scalability: With increasing k, the FIFO buffer requirement grows exponentially for all multi-way merge binary trees. To make it worse, the depth of each logical FIFO in Scheme-1 is required to be increased to partially compensate for the latency and additional control logic delay described above. Therefore, efficient on-chip memory management is imperative to scale multi-way merge network.
(38) Latency: Due to monolithic decoder and SRAM technology, the read and write latency of on-chip block memory is generally much higher than of register based FIFOs. This latency significantly reduces the performance of Scheme-1 based multi-way merge implementations.
(39) Comparison Look Ahead Merge (CLAM)
(40) Disclosed herein is a first embodiment of a multi-way merge implementation scheme, referred to as Comparison Look Ahead Merge (CLAM). CLAM provides better performance (records per cycle) through an efficient address generation scheme and is more scalable due to less demand for buffer storage. Further disclosed herein is a second embodiment of a method, referred to as Hybrid Comparison Look Ahead Merge (HCLAM), to hide the block SRAM latency by pragmatically using both SRAM and registers as merge tree buffers. Both CLAM and HCLAM implementations are novel and no similar methods are found in the prior art.
(41) Before describing the embodiments, the following terminologies are defined for clarity of explanations.
(42) In
(43) The frontier record of list l(i,j) is represented as r(i,j). The frontier record of a list is the top most record that hasn't been dequeued from the list yet.
(44) All sorted lists pertaining to any pipeline stage are numbered starting with 0. Comparison between the frontier records of two consecutive lists l(i,j) and l(i,j+1) will always imply that j is an even number. The notation min{r(i,j),r(i,j+1)} indicates the record with the smaller key between the frontier records of two consecutive lists starting with j at stage i. The notation max{r(i,j),r(i,j+1)} indicates the record having the larger key with the rest being the same.
(45) The main idea of CLAM comes from the observation that in any stage of Scheme-1, records are read from the FIFOs and the read address in the next cycle is generated from the comparison results of these records. This sequential dependence of the address on data read in previous cycle is inevitable as this is the fundamental operation of multi-way merge. However, without violating this data dependency, the following operations can be conducted. First, instead of comparing the keys of records when they are dequeued from the FIFOs, the keys can be compared while being queued to the FIFOs. As consecutive lists are compared before they are actually needed, the results of this operation referred to as “comparison look ahead”. Second, this comparison information can be stored using a single bit, namely ‘tag’ (g), and used later to generate an address while the pertinent record is actually dequeued.
(46) These two concepts represent the core of CLAM. The main benefit of CLAM is that it is not necessary to wait for the reading of records from block memory and the comparison to be completed before beginning to generate the address. As the comparison result is already available (i.e., the tag g is pre-computed) the next cycle read address can begin to be generated parallel with the initiation of the read of the current cycle. Therefore, the working cycle duration T.sub.W in CLAM is max(L.sub.a,L.sub.d) unlike the case of Scheme-1 (i.e. L.sub.a+L.sub.d).
(47)
(48)
(49) Data and Address Storage
(50) As depicted in
(51) Another important difference between CLAM and Scheme-1 is that in every work cycle there are two writes in a stage instead of one. In both CLAM and Scheme-1, one record moves from one pipeline stage to the next in a work cycle. In Scheme-1, the incoming record is queued to one of the memory blocks to replenish one logical FIFO in one of the memory blocks. Similarly, in CLAM, after advanced (look ahead) comparison with the incoming record we need to queue a min{r(i,j),r(i,j+1)} in B.sub.i.sup.min to replenish a FIFO. Additionally, it is also necessary to write max{r(i,j),r(i,j+1)} in B.sub.i.sup.max to conduct the advanced comparison in the future. Thus, in CLAM there are two reads and two writes of records in every stage at each work cycle.
(52) CLAM also has a read queue to store the read address. This read queue serves the same purpose of storing read request addresses from the following stage and helps to avoid bubbles. A read queue in stage i−1 provides the read address raddr.sub.i−1 for B.sub.i−1.sup.min and B.sub.i.sup.max. B.sub.i.sup.max handles half of the number of lists then does and the LSB of raddr.sub.i−1 is excluded before using it as read address of B.sub.i.sup.max. Therefore, the read address of B.sub.i.sup.max is defined as rBX.sub.i=raddr.sub.i−1, excluding LSB.
(53) Tag Array
(54) Tag (g) is a single bit that stores the result of advanced (look ahead) comparison. An additional buffer is required in CLAM to store the tags that is named as “Tag Array”. Only a single tag bit is required per FIFO of B.sup.min. Hence, tag array in stage i has k.sub.i/2 bits only. As tag array memory requirement is trivial and it can be implemented using registers instead of SRAM for fast access. In every work cycle, one tag bit is updated and utilized to generate address per stage using the following rules.
(55) Tag update rule: Assume r(i,j) and r(i,j+1) are the inputs to the comparator of stage i. If r(i,j)=max{r(i,j), r(i,j+1)}, the tag bit g.sub.i.sup.z is updated to ‘1’. Otherwise, if r(i,j+1)=max{r(i,j), r(i,j+1)}, the tag bit gf is updated to ‘0’. Here z=j/2 and j is always an even number in this context. The value of z ranges from 0 to (k.sub.i/2−1).
(56) Tag usage rule: Tag gf is used to generate the LSB of the read request address whenever a r(i,j) is dequeued from the Z.sup.th FIFO of B.sup.min. This request is stored in the read queue of stage (i−1). The entire request address is formed as {z, g.sub.i.sup.z}, where z=raddr.sub.i is the read address for B.sub.i.sup.min at current working cycle.
(57) Initializing Operation
(58)
(59) Steady State Operation
(60)
(61) In
(62) During clock cycle 1 raddr.sub.i,t.sub.w is also used to read the previously computed comparison result g.sub.i,tw.sup.out from the tag array. At the rising edge of clock cycle 1 f4 also latches advanced comparison result g.sub.i,tw.sup.in that is latched into the tag array at the rising edge clock cycle 2 at address waddr.sub.i,tw=rBX.sub.i,tw−1. During clock cycle 1, if raddr.sub.i,t.sub.w happens to be the same as waddr.sub.i,t.sub.w then g.sub.i,tw.sup.in is used as g.sub.i,tw.sup.out instead of what is actually read from the tag array. This is because the tag value pertaining to the latest record queued to the FIFOs in B.sub.i.sup.min must be used for the read request address generation. In any case, at the rising edge of clock cycle 2, request address {raddr.sub.i,tw,g.sub.i,tw.sup.out} is stored in the read queue of stage (i−1) through register f5. Hence, the overall address generation process also takes 2 clock cycles, i.e. L.sub.a=2.
(63) For efficient implementation of CLAM, data write must be overlapped with read or address generation and finished within min{L.sub.d,L.sub.a} clock cycles so that no extra time is spent for write. In
(64) CLAM Performance: Because the data read and address generation is parallelly conducted in CLAM, duration of a work cycle T.sub.w can be derived by max{L.sub.d,L.sub.a}. As both L.sub.d and L.sub.a are 2, the duration of T.sub.w is also 2 cycles. Unlike in Eq(1), the maximum throughput of CLAM can be calculated as Eq(2). Due to the overlap of data read and address generation R.sub.CLAM.sup.max is two times faster than R.sub.Scheme−1.sup.max, which is the highest throughput possible by block memory-based multi-way merge implementations.
(65)
(66) CLAM Scalability: Scarcity in fast on-chip memory is one of main reasons that multi-way merge implementations cannot scale. We have seen that in Scheme-1, both the memory blocks in a pipeline stage comprise of logical FIFOs. However, in CLAM, only one of the memory blocks comprise logical FIFOs. Hence, if the FIFO depth increases only by 50% of the memories in CLAM increase, whereas in Scheme-1b 100% of the memories increase in size. Furthermore, as CLAM work cycle is half of the work cycle for Scheme-1, relatively less FIFO depth is required in CLAM to avoid bubbles. The additional resources that CLAM needs is the storage for tag array. However, only single bit per logical FIFO is required, which is trivial.
(67) Hybrid CLAM (HCLAM)
(68) One of the main drawbacks of using SRAM blocks instead of registers for multi-way merge tree is that SRAM read latency is significantly larger. CLAM utilized a deeply pipelined design to improve clock frequency. However, the clock period is bounded by the SRAM block read latency. Furthermore, the throughput of deeply pipelined CLAM's is one element in two cycles, whereas register FIFO based merge can provide a throughput of one element per cycle. Hybrid CLAM is a pragmatic way of utilizing both SRAM and register based implementation.
(69)
(70) The two different multi-way merge schemes in HCLAM are implemented using two independent clocks as shown in
(71)
(72) For an ASIC implementation clk.sub.IRFM is almost twice as fast as clk.sub.CLAM. In that case, Ratio.sub.HCLAM is 4 as computed by Eq(3). Furthermore, the number of stages required for IRFM can be calculated as log.sub.2 (Ratio.sub.HCLAM)+1=3.
(73) It is also important to have enough depth in the Ratio.sub.HCLAM number of asynchronous FIFOs that interfaces the two different networks. If the data set is not heavily skewed towards any particular set of input lists, a FIFO depth of 8 to 32 works reasonably well. Even if data is heavily skewed, the depth can be increased without considerably affecting overall scalability as Ratio.sub.HCLAM is expected to be small (on the order of 2 to 8).
(74) Parallel Multi-Way Merge
(75) Even though HCLAM itself increases the performance of a single multi-way merge network by almost 3 times, it still not enough to saturate the system streaming bandwidth properly. Therefore, a parallel implementation of the merge network that can output multiple records to match the available streaming bandwidth is needed. A scalable parallelization method that can effectively address technology scaling is disclosed and referred to herein as Parallelization by Radix Pre-sorter (PRaP).
(76) Parallelization by Radix Pre-Sorter (PRaP)
(77) From the discussion above it is apparent that a parallelization scheme is need that doesn't require increasing prefetch buffer with more parallel multi-way merge networks. PRaP is a solution to this problem, which is depicted in
(78) Radix Pre-Sorter Implementation
(79) Without any loss of generality, assume that the DRAM interface width is also p records. Whenever the i.sup.th list l(i) is streamed from DRAM, records r(i,j) to r(i,j+1) is transferred in a single clock cycle as a part of the prefetched data. These p records are then passed through pipelined radix based pre-sorter as shown in
(80) During the pre-sort, it is mandatory to maintain the original sequence of the records that possess the same radix. For example, as shown in
(81) Load Balancing and Synchronization
(82) It is possible for the incoming lists to have keys that are imbalanced in terms of the radices. In such case, the data are unevenly distributed among the MCs and potential load imbalance will occur. More importantly, as the independent MCs work only on a particular radix, further sorting and synchronization among the output of cores should have been required to generate a single sorted final output. Both of these issues can be effectively resolved from the observation that the final output list is a dense vector. It is guaranteed that each MC will sequentially deliver records with monotonously increasing keys (assuming sort in ascending order). Additionally, is also mandatory that each possible key, which is the row index of the sparse intermediate vector in Two-Step SpMV, is present in the resultant dense vector. For example, as shown in
(83) The insertion of missing keys, necessitated by the dense output vector, solves both the load imbalance and synchronization problem. First, even though data are unevenly distributed among the cores, at the output each MC produces same number of records at similar rate. The effect of load imbalance is practically hidden even if it occurs. Second, output from the p cores, y(cp+0) to y(cp+p−1), can be independently queued in a store queue and synchronously streamed out (dequeued) to DRAM. The records y(cp+0) to y(cp+p 1) are consecutive elements of the dense output vector. Furthermore, records dequeued at cycle c and (c+1) are also consecutive segments of the dense vector. Thus, more sorting logic to synchronize the outputs is not required from p independent multi-way merge cores. Therefore, in the parallelization method PRaP the design can be scaled to multiple cores without increasing on-chip buffer requirements and will achieve the required throughput to match the streaming main memory bandwidth. Only q=4 bit radix pre-sorting (i.e., 2.sup.4=16) cores, is enough to saturate the extreme HBM bandwidth that is in the order of hundreds of GBs.
(84) Up to this point, a scalable multi-way merge merge scheme that can effectively and practically handle large problem size (thousands of lists) at extremely high throughput (hundreds of GBs or several TBs) has been disclosed. The entire method can be termed as HCLAM with PRaP parallelization. HCLAM mainly handles problem scaling and PRaP parallelization handles technology scaling. The method is scalable because it doesn't prohibitively require more on-chip memory or logic as the problem size grows. Because of PRaP, the throughput can also be increased by incrementing the number of merge cores without increasing on-chip memory, which is the most critical resource for scaling.
(85) Reductions for High Performance Merge Operation
(86) Reductions (i.e. accumulating values for more than one record if their indices match) are very common and an integral operation in many multi-way merge implementations. Reductions also need to be done at full throughput to match the performance of high throughput merge. However, latency in floating point adders create challenge in maintaining full throughput during reductions. There are various scenarios where full throughput reductions are necessary.
(87) For example, for the SpMV operation, P sets of floating point (FP) multiplier and adder chains parallelly work on separate sub-stripes of A.sup.K and sub-segments of x.sup.k, as shown in
(88) A P-way IMN is capable of delivering P records (i.e. key-value pairs) per cycle. As shown in
(89) IMN-based schemes are preferred for PSU rather than the shared scratchpad-based one as it provides more computational efficiency with same level of hardware complexity.
(90) Full-Throughput Reducer
(91) One of the important aspects of the above computation is the reducer. Accumulation or reduction is required for collisions (i.e. when row-indices of two consecutive records match). However, a pipelined adder cannot handle more than one collision without introducing stalls in the entire computational pipeline. These stalls are due to internal pipelines of the adder. F is denoted as the number of internal pipelines of an adder. An entire addition, which takes F cycles, has to be completed before start resolving the next if another collision is found for the same row index. To overcome this, a chain of log.sub.2F adders can be used, along with another final adder, as shown in
(92) Full Throughput Shift-Reduction Chain
(93) While the Reducer resolves all collisions within a sub-stripe of the matrix, there can still be collisions among the P sub-stripes. Therefore, the sorted outputs are required to be checked for collisions and resolved when needed. To achieve this in a pipelined manner, a P-way shift-reduction chain maybe used as depicted in