HARDWARE ACCELERATOR FOR COMPUTING A SCALAR DOT PRODUCT
20240080197 ยท 2024-03-07
Assignee
Inventors
- Daniel SHTERMAN (Petach Tikva, IL)
- Omer SHLOMOVITS (Petach Tikva, IL)
- Michael ASA (Kfar Sava, IL)
- Yuval DOMB (Raanana, IL)
Cpc classification
G06F17/16
PHYSICS
H04L9/3218
ELECTRICITY
International classification
H04L9/32
ELECTRICITY
G06F17/16
PHYSICS
Abstract
A hardware accelerator computes a scalar dot product given by .sub.i=0.sup.N1d.sub.iP.sub.i where d.sub.i is a scalar of length b bits and P.sub.i is an element in a group. The hardware accelerator includes a plurality A of accumulators addressed by corresponding contiguous partitions of the scalar d.sub.i, each partition being of length c such that
and each accumulator containing a plurality B of buckets where B=2.sup.c. The value of P.sub.i is entered into each empty accumulator bucket whose value corresponds to the weight of the respective partition associated with the corresponding accumulator or is added to a non-zero value that is already in the bucket, the sum replacing the previous value. An accumulator sums the values in the respective buckets of each accumulator so as to derive A sums, and sums the A computed sums to derive the scalar dot product.
Claims
1. A hardware accelerator for computing a scalar dot product given by:
2. The hardware accelerator according to claim 1, further including: a respective bucket status memory associated with each of the buckets, wherein each bucket status memory is set to a first value indicating that the corresponding bucket is available and to a second value indicating that the corresponding bucket contains valid data.
3. The hardware accelerator according to claim 1, wherein the final accumulator comprises: a second adder coupled to the output of the first adder for summing the values in the respective buckets of each accumulator so as to derive A sums, and a third adder coupled to the output of the second adder for summing the A sums computed by the second adder.
4. The hardware accelerator according to claim 1, wherein P.sub.i is a point on an elliptic curve and each first adder is an elliptic curve adder.
5. The hardware accelerator according to claim 4, when used to compute a scalar dot product during implementation of proof generation or proof verification for the zk-SNARK protocol.
6. The hardware accelerator according to claim 1, wherein the at least one first adder comprises at least two adders configured for operating simultaneously in a pipelined manner.
7. A system configured for computing a scalar dot product given by:
8. The system according to claim 7, further including: a respective bucket status memory associated with each of the buckets, wherein each bucket status memory is set to a first value indicating that the corresponding bucket is available and to a second value indicating that the corresponding bucket contains valid data.
9. The system according to claim 7, wherein the final accumulator comprises: a second adder coupled to the output of the first adder for summing the values in the respective buckets of each accumulator so as to derive A sums, and a third adder coupled to the output of the second adder for summing the A sums computed by the second adder.
10. The system according to claim 7, wherein P.sub.i is a point on an elliptic curve and each of the at least one first adder is an elliptic curve adder.
11. The system according to claim 10, when used to compute a scalar dot product during implementation of proof generation or proof verification for the zk-SNARK protocol.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] In order to understand the invention and to see how it may be carried out in practice, embodiments will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which:
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION OF EMBODIMENTS
[0024]
[0025] The host CPU 120 runs a user application program, which builds the data needed for the compilation of the MSM and stores the data in the host memory 130 usually implemented by DDR (double data rate). Usually zk-SNARK protocol proofs need the MSM to be executed multiple times when for some of the times the P.sub.i's (coordinates) of a particular MSM are known ahead of time and in some cases these coordinates are functions of the previously calculated MSM. If the coordinates are known ahead of time, they can be preloaded into the low latency memory 150 of the MSM subsystem. All the communication between the host system and the MSM system is done by using PCIe 140, 170. PCIe enables high bandwidth data transfer between the host memory 130 and the MSM subsystem 110.
[0026] The manner in which the MSM subsystem 110 is used in the zk-SNARK protocol is well-described in US20210266168, although they are both implemented using a combination of hardware and software. The present invention achieves improved performance using a novel hardware configuration which employs logic elements to direct the flow of data in a pipelined manner, thus allowing parallel processing without the overhead of conventional software.
[0027] In the MSM subsystem 110 use of the non-blocking bridge 160 and low-latency memory 150 is preferable because of the high bandwidth typically needed in MSM applications particularly when used for computationally-demanding applications such as zk-SNARK. Specifically, the memory is used to store data that must be accessed by the MSM subsystem and speed of operation is therefore dependent to some extent on the speed of the memory. However, the invention may also be used for less computationally-demanding applications where high-speed memory access is less critical. The MSM module 180 gets all the data needed for the MSM calculation from PCIE/DMA 170 and/or low-latency memory 150. There are queues on the MSM module inputs in order to absorb burst accesses of data.
[0028]
[0029] After the input with the last indication arrives from the bridge, the buckets accumulators perform their last aggregation calculation and start the partial sums calculation phase. In this phase the calculation is performed in a way that a weight of each one of the buckets in the buckets accumulator is taken into account.
[0030] Upon completion of the partial sums phase the results from the buckets accumulators are transferred to the final accumulator module for final calculations of all the buckets accumulators when the weights of the buckets accumulator (i.e., hundreds, ten, units etc.) are taken into account.
[0031] When the number of the buckets accumulators is greater than the number of elliptic curve adders 250 then there is a need for the scheduler 210 and the muxer 220 in order to select the next buckets accumulator(s) to be served. The elliptic curve adder receives two operands G.sub.a and G.sub.b as its inputs and generates after some number of clocks the output G.sub.c. Along with the operands G.sub.a and G.sub.b the elliptic curve adder receives the index of the buckets accumulator that generated the transaction and the bucket's address of this buckets accumulator. The elliptic curve adder output, G.sub.c is provided to one of the buckets accumulators (based on the index of the buckets accumulator provided to the elliptic curve adder input) or to the final accumulator. When provided to the particular buckets accumulator, the entry that is addressed by the pointer provided to the elliptic curve adder as its input is checked to determine if it holds valid coordinate data. If the entry is not valid then the data is just written into the entry. Otherwise, the entry data together with the result data are fed to the elliptic curve adder together with the index and address of the buckets accumulator, and the cycle is repeated recursively. The index specifies which buckets accumulator is selected and the address specifies which bucket in the selected bucket accumulator is being accessed. Alternatively (as a backup), the final accumulator 240 may access the data in the buckets accumulators by using the optional multiplexer 230. More generally, the final accumulator 240 receives its inputs directly from the elliptic curve adder(s) 250.
[0032]
[0033] Host configures the DMA module (3). This module will be used later for a fast movement of the data between Host Memory, MSM system Low Latency Memory and the MSM module. Based on preconfigured blocks descriptions, the DMA transfers the data from the Host Memory to the desired destination (either Low Latency Memory or MSM module) (4), (5) and (6). If needed the MSM module loads additional data from the Low Latency Memory that was previously preloaded by using the DMA (7). The MSM module performs all the required calculations (8). These calculations could be performed in Affine, Jacobian, Projective and any other suitable type of coordinates. Processes (7) and (8) are repeated as needed until the end of calculations for the particular MSM. Upon completion of the calculations the MSM module informs the host by using an interrupt that the MSM calculation result is available to be read (9). In order to better use the MS system resources there is a possibility to use more than a single interrupt while an earlier interrupt informs the host that a new MSM calculations cycle could start even before the final accumulator completed its task. The host reads the MSM calculation result from the MSM subsystem (10).
[0034]
[0035] It is seen that the host runs application software, which in one application of the invention may be proof generation and proof verification for the zk-SNARK protocol. The host loads values of d.sub.i and P.sub.i into the host memory 130 and configures the PCIe/DMA module 170. The PCIe/DMA module 170 reads the values of d.sub.i and P.sub.i from the host memory 130 and copies these values to the low latency memory 150 and, when necessary, also to the MSM module 180. The MSM module 180 reads the values of d.sub.i and P.sub.i as needed from the low latency memory 150 and performs intermediate and final calculations as described above, this being repeated as necessary until all values of d.sub.i and P.sub.i have been completely processed. When complete, the MSM module 180 informs the host, which reads the result from the MSM module 180.
[0036]
[0037]
Algorithm Detailed Description by Example
[0038] Thus, in order to clarify the generality of the MSM module 180, we will describe an implementation for computing the scalar dot product of two vectors each having four elements using base-10 arithmetic. Let us suppose that we have two vectors as follows: [0039] d=132, 125, 75, 30 [0040] P=11, 23, 31, 67
[0041] Thus, [0042] d.sub.0=132, d.sub.1=125, d.sub.2=75 and d.sub.3=30P.sub.0=11, P.sub.1=23, P.sub.2=31 and P.sub.3=67
[0043] The scalar dot product is given by:
[0044] Therefore:
R=132*11+125*23+75*31+30*67=8,662
[0045] This is easily implemented in software using a nested for loop, but is very time-consuming when N is large. Nevertheless, it will facilitate clearer understanding of the invention to demonstrate how the computation is performed using the hardware module according to the invention. Note that there is no multiplication operation in Elliptic Curve arithmetic. There is only one operation defined: Elliptic Curve Add. Therefore, in order to implement multiplication by N there is a need for N1 Elliptic Curve Additions.
[0046] We can rewrite the above equation as:
R=1*11*100+3*11*10+2*11*1+1*23*100+2*23*10+5*23*1+0*31*100+7*31*10+5*31*1+0*67*100+3*67*10+0*67*1
[0047] Since all elements d.sub.i of the vector d are smaller than 999, we can represent the i-th multiplication d.sub.iP.sub.i in the scalar dot product as the sum of three partial products that relate to the hundreds, tens and units, respectively for each value of d.sub.i. These values are stored in respective accumulators each having ten separate memories, known as buckets, into which are deposited the cumulative partial products for each iteration. Once this is done, the values in each of the ten buckets for each accumulator are summed while taking into account the weight of each bucket. For example, in the tens-accumulators the weight of bucket at address 9 is 9 while the weight of the bucket at address 5 is 5. After each of the accumulators has summed up all its buckets all three accumulators are summed in order to yield the scalar dot product. This summation should take into account the weights of the accumulators. Thus, in our example, the weight of the hundreds-accumulator is 100 while the weight of the tens-accumulator is 10.
[0048] Since this example represents a very specific application, which while useful for the purpose of explanation is far removed from a practical application of the invention, we should explain why three accumulators each having ten buckets suffice for this example. We need three accumulators because we have elected to group the scalars d.sub.i into three separate partitions or segments, corresponding in this case to hundreds, tens and units. And we need ten buckets in each accumulator because for each of these groups there are ten different values (i.e., digits) associated with each partition.
[0049] However, this is specific to this example. If d.sub.i were a decimal number not exceeding 9,999, then we could represent the i-th multiplication d.sub.iP.sub.i in the scalar dot product as the sum of four partial products that relate to the thousand, hundreds, tens and units, respectively for each value of d.sub.i. This could be done using four accumulators each having ten buckets. Alternatively, we could group the thousands and hundreds as a first partition and the tens and units as a second partition represented by only two accumulators each having 10.sup.2 i.e., 100 buckets to accommodate all possible combinations.
[0050] In a practical implementation of the invention used to implement the zk-SNARK protocol, the scalar, d.sub.i is a 253-bit binary value, which is partitioned into 29 segments each of which requires 9 bits since anything less would not be able to represent the complete 253-bit value 29*8=232 and is too small. The 29 segments require 29 accumulators each having 2.sup.9=512 buckets. But it will be understood that this could also be realized with fewer accumulators each having more buckets to represent fewer larger partitions having more than 29 bits. Alternatively, we could employ more accumulators each with fewer buckets to represent a larger number of smaller partitions having less than 29 bits.
[0051] The decision as to whether to partition the scalar into fewer partitions each with more buckets or into larger partitions each with fewer buckets is basically a tradeoff between performance and the accumulators' memory size. The larger the number of the accumulators, the more partial products can be computed in parallel since all the accumulators can be addressed together in a single clock cycle.
[0052] Reverting to the above decimal example, where d=132, 125, 75, 30 and P=11, 23, 31, 67, we have three accumulators, which are shown in
[0053] Computation of the scalar dot product requires iteratively populating the buckets as will now be explained, it being first noted that because separate, mutually independent accumulators are used to keep tally of the hundreds, they can be addressed in parallel during the same clock cycle. Thus, in the case where:
R.sub.i=100*11+30*11+2*11
we start by placing 11 corresponding to P.sub.1 in the first bucket of the hundreds-accumulator, the third bucket of the tens-accumulator and the second bucket of the units-accumulator. This is done by a direct write access to the corresponding buckets, after which their corresponding validity flags are set to 1 to indicate that these buckets now contain valid data. For the second line, we need to add 23 corresponding to P.sub.2 to the first bucket of the hundreds-accumulator, the second bucket of the tens-accumulator and the fifth bucket of the units-accumulator. In the case of the tens and units, the corresponding buckets are both empty and so 23 can be placed directly into the second bucket of the tens-accumulator and the fifth bucket of the units-accumulator, after which their corresponding validity flags are set to 1. But we cannot directly place 23 into the first bucket of the hundreds-accumulator because its validity flag is set to 1, indicating that it contains valid data, i.e., 11 from the previous recursion.
[0054] When we encounter such a situation that we need to enter data into a bucket that already contains data, we need to add the new value to the existing value. To do this, we set the buffer to the current value, in this case 11, and then initialize the bucket either by emptying it or simply by setting its validity flag to zero. The new value of 23 corresponding to P.sub.2 is now conveyed together with the data in the buffer, currently equal to P.sub.1 to the Elliptic Curve Adder, which adds the two values 11+23 and feeds the sum back to the hundreds-accumulator. Referring to the schematic implementation in
[0055] At the end of this process, it can be seen that the hundreds-accumulator has a single entry 34 in bucket 1. This sum of all values in this accumulator is therefore equal to 10034. The tens-accumulator has three entries 23, 78 and 31 in buckets 2, 3 and 7, respectively. The sum of these values is therefore equal to 2023+3078+7031. The units-accumulator has two entries 11 and 54 in buckets 2 and 5, respectively. The sum of these values is therefore equal to 211+554. Note that when dealing with a large number of inputs the probability to have an empty bucket is small.
[0056] The last operation requires accumulating all the buckets accounting their respective weights and is performed in two phases. The first phase is accumulating all the ten buckets in each accumulator and the second phase is accumulating all the three results into the solution of the MSM problem.
[0057] The algorithm to sum up the buckets implements the following pseudocode: [0058] buckets_sum=0 [0059] weighted_buckets_sum=0 [0060] for idx 9->0
buckets_sum=buckets_sum+bucket[idx]
weighted_buckets_sum=weighted_buckets_sum+buckets_sum
[0061] At the end of this process, the final result will be at weighted_buckets_sum. Note that the number of times each bucket was added to weighted_buckets_sum is exactly the weight of the buckets. Running the algorithm on our example will provide for the hundreds-accumulator: 134=34. For the tens-accumulator: 223+378+731=497, and for the units-accumulator: 211+554=292.
[0062] The algorithm may be executed in firmware, which constitutes a second adder that may be part of the MSM module 180. However, the number of computations required for each accumulator is equal to the number of buckets in each accumulator less 1. So even in the case where there are 512 buckets in each accumulator as proposed for use in the proof generation and proof verification for the zk-SNARK protocol, the cost overhead in implementing this phase in software is not critical.
[0063] The algorithm to sum up the accumulator values implements the following pseudocode: [0064] final_msm_result=accumulator[2] [0065] for idx 1->0
final_msm_result=10*final_msm_result
final_msm_result=final_msm_result+accumulator[idx]
[0066] The algorithm may be executed in firmware, which constitutes a third adder that may be part of the MSM module 180. However, at this stage the number of iterations required to sum up the accumulator values is equal to the number of accumulators less 1. So even in the case where there are 29 accumulators as proposed for use in the proof generation and proof verification for the zk-SNARK protocol, the cost overhead in implementing this phase in software is not critical. At the end of this process the result will be at final_msm_result. At the beginning accumulator_sum will be initialized to 34. At the end of the first iteration, it will be equal to 1034+497=837. At the end of the second iteration, it will be equal to 10837+292=8,662 which is the result of the MSM calculation.
[0067] When the accumulators send their data to the final accumulator, a new MSM calculation can commence whereby data is fed to the buckets from the memory in parallel with the final accumulator summing the results of the previous computation, thereby saving time since two different computer resources are used simultaneously.
[0068] As noted above, one practical application of the invention is proof generation and proof verification for the zk-SNARK protocol, wherein the scalar d.sub.i is of length 253 bits and the MSM is performed using 29 accumulators each having 512 buckets. The hardware implementation according to the invention allows for the respective buckets in each accumulator to be written in parallel, thus allowing for up to 29 operations to be performed simultaneously. Multiplication is performed by repeated addition to those buckets in each accumulator for which the coefficient of the multiplicand for the respective bucket is non-zero. It is reiterated that the division between accumulators and buckets is a tradeoff between the number of operations that can be performed in parallel and the memory size of the accumulators.
[0069] It will also be understood that while the invention has been described with reference to proof generation and proof verification for the zk-SNARK protocol, the invention has more general application. Specifically, while the zk-SNARK protocol utilizes the ECP algorithm based on the multiplication of a scalar d.sub.i with a point P.sub.i on an elliptic curve, the invention is not restricted for use with elliptic addition and the point P.sub.i can more generally be an element in a group. In mathematics, a group is a set and an operation that combines any two elements of the set to produce a third element of the set, in such a way that the operation is associative, an identity element exists and every element has an inverse.
[0070] It will also be appreciated that the invention relates principally to the construction and operation of the MSM module 180 shown in