IMPLEMENTING FUNCTIONS IN HARDWARE
20230028953 · 2023-01-26
Inventors
Cpc classification
G06F1/0356
PHYSICS
International classification
Abstract
Methods for implementing or synthesizing functions in hardware and fixed-function hardware include generating a look-up table, LUT, representing the function and then applying a transform to the LUT to transform the LUT into a plurality of derived LUTs. The transform may be applied recursively. A hardware design implementing each of the derived LUTs in fixed-function hardware logic, along with a logic unit that performs the inverse transform, is then created.
Claims
1. A method of implementing a function in fixed-function hardware, the method comprising: generating a look-up table, LUT, representing the function; applying a transform to the LUT to transform the LUT into a plurality of derived LUTs; and generating a hardware design implementing each of the derived LUTs in fixed-function hardware logic and further implementing an inverse transform to generate an output value by selectively combining one or more values output from the fixed-function hardware logic dependent on the input value.
2. The method according to claim 1, wherein the transform is applied recursively to the LUT.
3. The method according to claim 1, further comprising: after performing one or more recursions, evaluating hardware metrics for the plurality of derived LUTs; and selectively performing a further recursion dependent upon the evaluation.
4. The method according to claim 1, wherein the plurality of derived LUTs comprises at least three LUTs and the method further comprising: combining two or more of the derived LUTs to form a new derived LUT.
5. The method according to claim 1, further comprising: subtracting a fixed or linear term from a derived LUT to form a new derived LUT.
6. The method according to claim 1, wherein generating a look-up table, LUT, representing the function comprises: approximating the function with a different, prediction function; and generating a LUT representing the prediction function.
7. The method according to claim 6, wherein generating a look-up table, LUT, representing the function further comprises: generating a LUT of differences between the function and the prediction function.
8. The method according to claim 7, further comprising: implementing the LUT of differences in fixed-function hardware logic.
9. The method according to claim 7, further comprising: applying a transform to the LUT of differences to transform the LUT of differences into a plurality of derived difference LUTs; and implementing each of the derived difference LUTs in fixed-function hardware logic.
10. The method according to claim 1, wherein the LUT representing the function comprises a LUT that approximates the function.
11. The method according to claim 1, wherein generating a look-up table, LUT, representing the function comprises: generating a LUT based on pre-defined characteristics of the function.
12. The method according to claim 1, wherein the transform is a Haar transform.
13. The method according to claim 1, further comprising: creating the fixed-function hardware logic from the generated design.
14. A hardware logic arrangement configured to implement a function in fixed-function hardware, the hardware logic comprising: an input for receiving an input value for input to the function; a plurality of hardware logic blocks, each hardware logic block implementing one of a plurality of look-up tables, LUTs, in fixed-function hardware and at least two of the plurality of LUTs being derived from a LUT representing the function by applying a transform; and a logic unit arranged to perform an inverse transform to generate an output value by selectively combining one or more values output from the plurality of hardware logic blocks dependent on the input value.
15. The hardware logic arrangement according to claim 14, wherein the logic unit is further arranged, as part of performing the inverse transform to: add a fixed or linear term to the selective combination of one or more values output from the plurality of hardware logic blocks dependent on the input value.
16. The hardware logic arrangement according to claim 14, wherein the at least two of the plurality of LUTs derived from a LUT representing the function are derived from a LUT that represents a prediction function that approximates the function.
17. The hardware logic arrangement according to claim 16, wherein the plurality of LUTs further comprise a difference LUT, the difference LUT storing differences between the function and the prediction function.
18. The hardware logic arrangement according to claim 16, wherein the plurality of LUTs further comprise two or more LUTs derived from a difference LUT, the difference LUT storing differences between the function and the prediction function.
19. The hardware logic arrangement according to claim 16, wherein the logic unit comprises: a first logic unit arranged to selectively combine one or more values output from the plurality of hardware logic blocks dependent on the input value; a second logic unit arranged to selectively combine one or more different values output from the plurality of hardware logic blocks dependent on the input value; and a third logic unit arranged to perform an inverse transform based on values output from the first and second logic units and the input value.
20. A non-transitory computer readable storage medium having stored thereon computer readable code that when executed at a computer system, causes the computer system to perform a method of implementing a function in fixed-function hardware, the method comprising: generating a look-up table, LUT, representing the function; applying a transform to the LUT to transform the LUT into a plurality of derived LUTs; and generating a hardware design implementing each of the derived LUTs in fixed-function hardware logic and further implementing an inverse transform to generate an output value by selectively combining one or more values output from the fixed-function hardware logic dependent on the input value.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] Examples will now be described in detail with reference to the accompanying drawings in which:
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032] The accompanying drawings illustrate various examples. The skilled person will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the drawings represent one example of the boundaries. It may be that in some examples, one element may be designed as multiple elements or that multiple elements may be designed as one element. Common reference numerals are used throughout the figures, where appropriate, to indicate similar features.
DETAILED DESCRIPTION
[0033] The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The present invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.
[0034] Embodiments will now be described by way of example only.
[0035] As described above, many processors (e.g. CPUs, GPUs, etc.) include fixed-function hardware logic which implements commonly used functions, such as reciprocals, logarithms and other transcendental functions. One way to implement a function in fixed-function hardware logic is to directly hard-code the function (or some approximation thereof) as a look-up table (LUT) in hardware. Such an implementation is simple, has low latency and the accuracy is pre-defined (i.e. is hard coded into the LUT), which may result in significantly reduced design effort, especially in the case of more exotic functions. Also, by using a LUT, since entries are specified directly, other attributes of a function may be guaranteed, e.g. such that the function is monotonic, has optimal error (for a given output precision) or possesses some pattern of values that lend themselves to efficient logical representation. However, as the size of the look-up table increases (e.g. as the number of entries and/or the number of bits in each entry increases), then the resulting hardware implementation can quickly grow to a size which is infeasible, i.e. it is too large given the size constraints of processors, especially where these processors are for use in devices which are space-constrained (e.g. wearables, mobile and/or portable devices such as smart phones, tablet computing devices, laptops, gaming devices, etc.). For example, whilst it may be possible to implement a LUT in hardware logic for a function to 8-bit precision, if the precision that is required increases to 10-bits, then the resulting hardware logic will be significantly larger. If the precision increases further (e.g. to 14 or 16-bits), then the resulting hardware logic may be infeasibly large and/or the synthesis tool may simply fail because a limit in the complexity/size of the resultant hardware has been reached.
[0036] Described herein is improved fixed-function hardware that implements a function (e.g. a mathematical function). Also described herein are improved methods of implementing a function in fixed-function hardware logic. The methods described herein may also be used as part of a synthesis tool to provide an improved method of synthesizing a function and hence the methods described herein provide an improved method of generating a hardware layout for a function. As described in more detail below, the fixed-function hardware that implements the function (which may be referred to as the ‘target function’) comprises fixed-function hardware logic that implements at least two (and more generally a plurality of) LUTs and one or more arithmetic logic blocks that combine outputs from the two (or more) LUTs. These two or more LUTs are derived from the single, raw LUT that represents the target function and are generated by recursively applying a transform (e.g. a wavelet transform) to the raw LUT. By transforming a single LUT that implements (or approximates) the target function into a plurality of LUTs in this way and then implementing each of the plurality of LUTs in fixed-function hardware, the overall area of hardware required to implement the target function (in fixed-function hardware) can be significantly reduced.
[0037] By increasing the number of LUTs used from a single LUT to more than one LUT, the latency does increase; however, given the speed of LUTs when implemented in fixed-function hardware, the increase in latency is small and likely to still be well within any latency limits. Furthermore, depending upon the target function and the transforms used to generate the plurality of LUTs, there may be no sequential dependency between some or all of the LUTs and as a result some or all of the LUTs may be implemented substantially or often entirely in parallel, thereby reducing the increase in latency as a result of using more than one LUT.
[0038] One or more pre- and/or post-processing optimisations may also be applied to further reduce the overall size of the hardware and/or increase the range of functions to which the methods described herein can be applied. In an example of a pre-processing operation, the target function may be represented by a prediction function. The raw LUT for the prediction function is then input to the methods described herein and transformed into a plurality of LUTs that can be more efficiently implemented in fixed-function hardware logic.
[0039]
[0040] A first method of generating fixed-function hardware that implements a function can be described with reference to the flow diagram shown in
[0041] The single, raw LUT for the target function that is generated (in block 202) is generated at the required precision for the output of the fixed-function hardware (i.e. the required precision for the result when calculating the function).
[0042] The transform that is applied (in block 204) may be a wavelet transform, such as the Haar wavelet or a linear wavelet, or other type of transform (e.g. involving integration or extrapolation, as described below), and the transform may then be applied recursively (e.g. one or more recursive steps in block 206). The generation of two LUTs (in block 204) involves applying the transform to the raw LUT (denoted L0) and generates two separate LUTs: a base LUT (denoted L1) and a first LUT comprising transform coefficients (denoted D0). As demonstrated by the examples described below, the nature of the coefficients in the coefficient LUT, D0, will depend upon the particular transform that is applied. Where one or more recursive steps are then applied (in block 206), the transform is applied to the base LUT generated in the immediately previous step (i.e. the first recursive step in block 206 is implemented on the base LUT generated in block 204, the second recursive step in block 206 is implemented on the base LUT generated in the first recursive step in block 206, etc.).
[0043] Where the Haar wavelet is used (in block 206), the base LUT (L1) comprises alternate entries from the raw LUT (e.g. only the even entries from the raw LUT, where the first entry is denoted entry 0 and hence is considered even) and the first coefficient LUT (D0) comprises entries where each entry comprises a value which is the difference between an odd entry from the raw LUT and a preceding even entry from the raw LUT, as shown in the example below:
TABLE-US-00001 Base LUT, L1 A C E G
TABLE-US-00002 Raw LUT, L0 A B C D E F G H
TABLE-US-00003 Coefficient LUT, D0 B − A D − C F − E H − G
[0044] It can be seen that where the Haar wavelet is used the two LUTs that are generated (in block 204) each comprise half the number of entries as the original raw LUT, although as described below, in some implementations, after the final recursion, the derived LUTs may be expanded so that they comprise the same number of entries as the raw LUT:
TABLE-US-00004 Base LUT, L1 (expanded) A A C C E E G G
TABLE-US-00005 Raw LUT, L0 A B C D E F G H
TABLE-US-00006 Coefficient LUT, D0 (expanded) 0 B − A 0 D − C 0 F − E 0 H − G
[0045] Where the Haar wavelet is used and the method of
y=L1(x[M−1:1])+(x[0]×D0(x[M−1:1]))
where, [0046] x is an M-bit input where the M bits are denoted x[M−1:0] with x[0] being the LSB [0047] D0(x[M−1:1]) is shorthand notation to indicate the entry of table D0 indexed by the M−1 most significant bits (MSBs) of x [0048] (x[0]×D0(x[M−1:1])) is equal to D0(x[M−1:1]) when the least significant bit (LSB) of x (x[0]) is ‘1’ and is otherwise zero [0049] L1(x[M−1:1]) is shorthand notation to indicate the entry of table L1 indexed by the M−1 MSBs of x
It will be appreciated that where the LUTs are expanded so they comprise the same number of entries as the raw LUT, the inverse transform stays the same but entries in the LUT are indexed using all the bits of x and no logical operation is required to determine whether to add the coefficient from D0 or not (because where the LSB of x is 0, the value in the entry of D0 is 0).
[0050] Where a linear wavelet is used, the base LUT (L1) comprises alternate entries from the raw LUT (i.e. only the even entries from the raw LUT) and the first coefficient LUT (D0) comprises entries which are the difference between an odd entry from the raw LUT (e.g. B, D, . . . ) and the average of the values either side in the raw LUT, as shown in the example below:
TABLE-US-00007 Raw LUT, L0 A B C D E F G H
TABLE-US-00008 Base LUT, L1 A C E G
TABLE-US-00009 Coefficient LUT, D0 B − ((A + C)/2) D − ((C + E)/2) F − ((E + G)/2) H − ((G + A)/2) or H − G
Where the linear wavelet is used it can be seen that the inverse transform is significantly more complicated than for the Haar wavelet because there is a more complex dependency between entries in the different LUTs (e.g. you need multiple reads from the base LUT, L1 to invert the transform). To reduce the complexity of the inverse transform, the base LUT, L1 may be split into two separate tables, one comprising the odd values of the base LUT, L1, and one comprising even values of the base LUT, L1, and this results in a single look up from each of the newly formed tables as shown below.
TABLE-US-00010 Raw LUT, L0 A B C D E F G H
TABLE-US-00011 L1 (even) A E
TABLE-US-00012 L1 (odd) C G
TABLE-US-00013 Coefficient LUT, D0 B-((A + C)/2) D-((C + E)/2) F-((E + G)/2) H-((G + A)/2) or H-G
[0051] Having generated the two LUTs (in block 204), these may be implemented in fixed-function hardware logic (e.g. as two separate blocks 102, 104, one for the base LUT, L1 and one for the first coefficient LUT, D0) along with an arithmetic unit 106 that performs the inverse transform, as shown in
[0052] The method of
[0053] The raw LUT comprises 16 4-bit entries (i.e. such that M=N=4 where as detailed above M is the width of the input in bits and N is the width of the output in bits) as shown below:
TABLE-US-00014 Raw LUT, L0 4 4 4 4 5 5 6 6 7 7 8 9 10 11 12 14
[0054] Using the Haar wavelet for the function above, the expanded transform coefficients that are generated in each recursion (i.e. for each successive value of m, which is the level of recursion and starts at m=0), including redundant zero entries for each alternate entry (block), are given by:
[0055] By applying the Haar wavelet to the table above, the single, raw LUT above is initially divided into two LUTs (in block 204): the base LUT, L1, and the coefficient LUT, D0, as shown below (in non-expanded form, i.e. without the redundant zero entries). The coefficient LUT, D0, contains the transform coefficients terms for m=0.
TABLE-US-00015 L1 4 4 5 6 7 8 10 12
TABLE-US-00016 D0 0 0 0 0 0 1 1 2
[0056] As shown above, each of the LUTs now comprises 8 entries, rather than the 16 entries of the first table and whilst the base LUT, L1, still requires 4-bits per entry, the coefficient LUT, D0, only requires 2-bits per entry. As described above, the inverse transform comprises, for even values of x, the respective value from L1 and for odd values of x, the sum of the respective values from L1 and D0, where in either case, the three MSBs of x can be used as the index to select the correct value from each LUT. If the four input bits of x are denoted x[3:0] with x[0] being the LSB, then the inverse transform can be implemented as:
y=L1(x[3:1])+(x[0]×D0(x[3:1]))
[0057] As noted above, whilst calculating an inverse transform (e.g. in arithmetic unit 106) relies upon looking up values in multiple LUTs, these may be performed substantially or often entirely in parallel. Furthermore, depending upon the nature of the transform used, the reconstruction operation may involve summation of terms (e.g. where Haar is used) or a more complex polynomial expression of the coefficients.
[0058] As shown in
[0063] The first optimization listed above comprises performing additional recursive steps (block 206) which apply a transform (e.g. the same transform as in block 204 or a different transform) to one or both of the LUTs that have been previously generated (in block 204). By performing additional recursive steps (in block 206), additional LUTs are generated such that the method of
[0064] Referring back to the example above, the Haar wavelet may be applied to the larger of the two LUTs that were previously generated (i.e. to the base LUT, L1), thereby generating two new LUTs: a new base LUT, L2, and a second coefficients LUT, D1 (which comprises the transform coefficients for m=1). This increases the total number of LUTs to three—the new base LUT, L2 and two coefficient LUTs D0, D1:
TABLE-US-00017 L2 4 5 7 10
TABLE-US-00018 D0 0 0 0 0 0 1 1 2
TABLE-US-00019 D1 0 1 1 2
[0065] As shown above, each of the two newly generated LUTs, L2, D1 only comprises 4 entries, rather than the 16 entries of the raw LUT (or 8 entries of L1, from which they were formed) and whilst the first new LUT, L2, still requires 4-bits per entry, the second LUT, D1, only requires 2-bits per entry. In this case, the inverse transform comprises (using the same notation as above):
y=L2(x[3:2])+(x[1]×D1(x[3:2]))+(x[0]×D0(x[3:1]))
[0066] This may then be repeated (by repeating block 206), to apply a further recursive step to the base LUT, L2, resulting in:
TABLE-US-00020 L3 4 7
TABLE-US-00021 D0 0 0 0 0 0 1 1 2
TABLE-US-00022 D1 0 1 1 2
TABLE-US-00023 D2 1 3
[0067] As shown above, each of the newly generated LUTs, L3, D2 only comprise 2 entries, rather than the 16 entries of the raw LUT (or 4 entries of L2, from which they were formed). The new base LUT, L3, now only requires 3-bits per entry, whilst the second LUT, D2 (the third coefficient LUT), requires 2-bits per entry. In this case, the inverse transform comprises (using the same notation as above):
y=L3(x[3])+(x[2]×D2(x[3]))+(x[1]×D1(x[3:2]))+(x[0]×D0(x[3:1]))
[0068] This may then be repeated (by repeating block 206), to apply yet a further recursive step to the base LUT, L3, resulting in:
TABLE-US-00024 L4 4
TABLE-US-00025 D0 0 0 0 0 0 1 1 2
TABLE-US-00026 D1 0 1 1 2
TABLE-US-00027 D2 1 3
TABLE-US-00028 D3 3
In this case, the inverse transform comprises (using the same notation as above):
y=L4+(x[3]×D3)+(x[2]×D2(x[3]))+(x[1]×D1(x[3:2]))+(x[0]×D0(x[3:1]))
[0069] Referring to the examples above and in particular the expressions for the inverse transforms, it can be seen how multiplexers may be used to select terms to include in the addition, with the multiplexers controlled by bits from the input value.
[0070] Looking back at the examples, it is clear that the method of
[0071] One of many different evaluations may be performed based on the determined hardware metrics to select a final (or base) level of recursion, for example: [0072] Recursion may stop when the hardware metrics (as determined in block 208) satisfy a predefined target (e.g. the hardware metric is below a predefined threshold) [0073] Recursion may be performed until the final base LUT comprises only one entry (or another pre-defined minimum number of entries) and then the hardware metrics for each level of recursion compared to determine the level of recursion that provides the best hardware metrics (e.g. smallest hardware area) [0074] Recursion may be performed to identify a level of recursion with at least a local optimum hardware metric (e.g. recursion may continue whilst the hardware metrics improve and until they reach a point where the metric gets worse, and then the final level of recursion may be the penultimate level performed)
[0075] Although in the examples above, the number of entries in the generated LUTs halve at each recursion, in some examples, the LUTs may be expanded after termination of the recursion and prior to implementing them in fixed-function hardware so that they all comprise the same number of inputs, as shown below:
TABLE-US-00029 L4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
TABLE-US-00030 D0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 2
TABLE-US-00031 D1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 2 2
TABLE-US-00032 D2 0 0 0 0 1 1 1 1 0 0 0 0 3 3 3 3
TABLE-US-00033 D3 0 0 0 0 0 0 0 0 3 3 3 3 3 3 3 3
The expansion of the number of terms in the LUTs simplifies the indexing of the LUTs since all are now indexed in the same way. In this case, the inverse transform comprises:
y=L4(x)+D3(x)+D2(x)+D1(x)+D0(x)
[0076] Using this simple example, it can be seen how each of the individual LUTs may be implemented efficiently in fixed-function hardware (although for more realistic, larger LUTs, the implementation is determined using a logic synthesis tool):
TABLE-US-00034 D0 Unsigned x[3] AND x[2] AND & x[3] AND (x[2] XOR x[1] AND x[0] x[1]) AND x[0] D1 Unsigned x[3] AND x[2] AND & (x[3] XOR x[2]) AND x[1] x[1] D2 Unsigned x[3] AND x[2] & x[2] D3 Unsigned x[3] & x[3] L4 4 y L4(x) + D3(x) + D2(x) + D1 (x) + D0(x)
In the table above ‘&’ indicates that bits are being concatenated and ‘AND’ is the Boolean operator.
[0077] In the example described above where, after recursion as stopped, all resultant LUTs are expanded to comprise the same number of entries, irrespective of which level of recursion the method stops at, the arithmetic unit 106 implements the inverse transform by summing one selected value from each of the LUTs.
[0078] Another optimization shown in
[0079] Referring back to the example above, the D2 and D3 LUTs may be combined and the result is:
TABLE-US-00035 D0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 2
TABLE-US-00036 D1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 2 2
TABLE-US-00037 D2′ = D2 + D3 0 0 0 0 1 1 1 1 3 3 3 3 6 6 6 6
TABLE-US-00038 L4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
In this case, the inverse transform comprises:
y=L4(x)+D2′(x)+D1(x)+D0(x)
[0080] Using this simple example, it can be seen how each of the individual LUTs may be implemented efficiently in fixed-function hardware:
TABLE-US-00039 D0 Unsigned x[3] AND x[2] AND & x[3] AND (x[2] XOR x[1] AND x[0] x[1]) AND x[0] D1 Unsigned x[3] AND x[2] AND x[1] & (x[3) XOR x[2]) AND x[1] D2′ Unsigned x[3] AND x[2] & x[3] & x[3] XOR x[2] L4 4 y L4(x) + D2′(x) + D1(x) + D0(x)
[0081] To determine which of the coefficient LUTs to combine, various combinations may be calculated (in block 210) and the hardware metric of each combination evaluated (in block 208) and compared. In an example, where the recursion stops at the third level, i.e. L3, such that there are three coefficient LUTs, D0, D1, D2, then the following combinations may be evaluated and the option which results in the best hardware metrics selected:
TABLE-US-00040 LUT 1 LUT 2 LUT 3 LUT 4 D0 D1 D2 L3 D0 + D1 D2 L3 — D0 + D1 + D2 L3 — — D0 D1 + D2 L3 —
Although the table above shows only adjacent levels (e.g. Dn and Dn+1) being combined, non-adjacent levels (e.g. D0+D3) may be combined. Furthermore, combinations may include combining one or more coefficient LUTs back with the base LUT (e.g. D0+L3).
[0082] Another optimization shown in
[0083] In this example, the input, x, is an F1.8.12 floating point value (1 sign bit, 8 exponent bits with a bias of 127, 12 mantissa bits) and the output, y, is an S8.12 signed fixed point value (8 integer bits, 12 fractional bits, where the function being computed (i.e. the target function) is:
y=└log.sub.2|└x┐.sub.F1.8.12,FTO|┐.sub.S8.12,RNP
Where └⋅┐.sub.F1.8.12,FTO defines a custom rounding operation “force to odd” (FTO), that replaces the mantissa of the input (thought of as an integer) with the nearest odd number no less than the original value. The expression └⋅┐.sub.S8.12,RNP rounds the exact result to the S8.12 signed fixed point format, using round to nearest, with ties resolved in the positive direction. Since the floating-point value can be written as:
x=(−1).sup.s.Math.2.sup.e.Math.(1+a)
Where s is the sign, e is the exponent (without bias), a is the mantissa in the half open interval [0,1), and
log.sub.2|x|=log.sub.2(2.sup.e.Math.(1+a))=e+log.sub.2(1+a)
Then the target function, y, comprises a portion that depends upon 11 bits of mantissa input (11 bits, rather than 12, as a consequence of the FTO rounding described above) and gives 12 bits of fractional output which can be encoded in a LUT, and an exponent integer portion that can be mapped directly to the output. This LUT in its raw form, is relatively large and comprises 2048 12-bit entries (i.e. with M=11 and N=12).
[0084] Using the method of
And the base LUT (L4) also has 128 12-bit entries that are simply the original raw LUT sampled at intervals of 2.sup.7-11= 1/16. As shown in the graph of
[0089] The resulting hardware therefore comprises fixed-function hardware blocks implementing five LUTs: D0, D1, D2, D3, (L4-LT). The last of these, L4-LT represents the base level of the Haar wavelet transformation, after four levels of recursion, but with a linear term extracted from the result (as explained above with reference to
y=(L4−LT)+LT+D3+D2+D1+D0
where the linear term LT that is added back is a linear term that comprises the integer portion of the result (i.e. the exponent portion) concatenated with the 7 most significant bits of the mantissa (as subtracted from L4 for the reasons described above) with zeros in the remaining 5 bit positions (since the output has 12-bits of fractional precision). Again, the MSBs of the input, x, are used to index into the various LUTs, however, as described above, a consequence of the FTO rounding is that the least significant mantissa bit is always set to 1, and therefore does not affect the value of y. This bit may therefore be discarded to give a reduced size input word x′ which can be used in place of x:
TABLE-US-00041 D0 D0 (x′[10:1]) when x′[0] = 1, else 0 D1 D1 (x′[10:2]) when x′[1] = 1, else 0 D2 D2 (x′[10:3]) when x′[2] = 1, else 0 D3 D3 (x′[10:4]) when x′[3] = 1, else 0 (L4-LT) (L4-LT) (x′[10:4]) LT Unsigned (NOT x′[18] & x′[17:4] & 00000) + 2**12 (where 2**12 = 2.sup.12). Note that the linear term given in the above table includes the exponent (subtracting the bias i.e. negating the exponent most significant bit and adding 1 to its least significant bit (2.sup.12 when concatenated with the 12 fractional bits, interpreted as an integer)) as featured in the above definition of the logarithm (with a total of 8 integral bits and 12 fractional bits).
[0090] As shown in this example, a fixed or linear term may not only be subtracted from one of the generated LUTs, e.g. from the base LUT, but may in addition, or instead, be subtracted from the original, single LUT for the target function before the transform is applied (e.g. between blocks 202 and 204).
[0091] In a variation of the methods described above with reference to
[0092] As with the method of
[0093] In some examples where a prediction function is used as an approximation to the target function, the prediction function may be selected such that the differences between the prediction function and the target function are within a pre-defined, acceptable, error-bound (e.g. the error does not exceed 1 Unit of Least Precision (ULP) in the result) and so need not be considered.
[0094] In other examples, a second LUT may be generated as part of the pre-processing step which comprises difference terms corresponding to each entry in the single LUT for the prediction function (block 503), i.e. each difference term is the difference between the predicted value, using the prediction function, and the actual value of the target function. This LUT of difference terms may be implemented in a single LUT hardware logic block 608 and then the relevant term added to the result of the inverse transform in the arithmetic unit 606A, as shown in
[0095] The use of the pre-processing step (block 502 and optionally block 503) can be described with reference to a third example. In this example, the target function, which is to be evaluated at relatively high precision (i.e. such that it is prohibitive to tabulate the function directly using a LUT) is:
This function may be approximated by linearly interpolating pairs of adjacent points (x.sub.n,f(x.sub.n)) on the curve by considering the following equation:
Where t is an interpolation parameter in the open interval (0,1). Since the first two terms in the last line are the expression for the linear interpolation of the function at points x.sub.n and x.sub.n+1, the last term provides a bound on the linear approximation error:
since each of the products in the denominator of the ‘error’ term are bounded below by one and the product t(1−t) has its maximum at t=0.5. By selecting an appropriate spacing for adjacent points on the x-axis (x.sub.n and x.sub.n+1) based on the target precision of the output (which may be defined in terms of a number, N, of fractional bits), then the error term may be kept below 1 ULP and an LUT of difference terms may not be created (i.e. block 503 may be omitted).
[0096]
[0097] Given a target precision of N fractional (or mantissa) bits, then a spacing of the points may be selected to be:
where the ceiling function is used to provide enough headroom for further rounding error when N is odd. Given this:
And hence there are 2.sup.┌N/2┐+1 sample points and hence 2.sup.┌N/2┐+1 entries in the single, raw LUT that is generated for the prediction function. If the input x is encoded with M bits of fractional precision, then x.sub.n is identified from the most significant
bits of x and the interpolation parameter, t, is identified from the remaining
least significant bits of x.
[0098] Given this, the prediction function that is used to approximate the target function may be, for example, given by:
y=1+└y.sub.n−(y.sub.n−y.sub.n+1).Math.t┐.sub.U1.N
Where y.sub.n=└f(x.sub.n)−1┐.sub.U1.(N+1) (where the −1 is an encoding convenience because the function is in the interval (1,2]) and └⋅┐.sub.Ua,b indicates the output of an exact operation has been rounded to an unsigned fixed point format with a integral bits of precision and b fractional bits of precision. In this example, N+1 precision has been chosen for the LUT so that additional error in the linear interpolation rounding can be accommodated without exceeding the error target for the overall result. A single, raw LUT for y.sub.n may be generated (in block 202) and fed into the subsequent steps in the methods described above to generate a plurality of LUTs which can be implemented more efficiently in hardware logic. In such an example, the fixed-function hardware 700 that implements the prediction function and is shown in
least significant bits of x from which the interpolation parameter, t, is identified.
[0099] The prediction function above requires two look-ups in the LUTs 702, one for y.sub.n and one for y.sub.n+1 and the second of these look-ups requires the index to be offset by one (i.e. to obtain the value of y.sub.n+1). Consequently, in a variation on the methods described above, the pre-processing step (block 502) may result in the subsequent steps in the method being performed separately in relation to two raw LUTs. These two raw LUTs may comprise one for y.sub.n and a rotated version of the same LUT for y.sub.n+1; however, as this still requires calculation of y.sub.n−y.sub.n+1, the two raw LUTs may instead comprise one for y.sub.n and one for d.sub.n, where d.sub.n=y.sub.n−y.sub.n+1. This results in hardware logic 900 that implements the prediction function, y, as shown in
least significant pits or x from which the interpolation parameter, t, is identified.
[0100] In an alternative implementation, as described in relation to the simple linear transform example above, to avoid two look-ups in the same LUT, the LUT may be split into two (one comprising even entries and the other comprising odd entries) and then only a single look-up is required from any one LUT.
[0101] As described above, one of the transforms that may be used (in blocks 204, 206) may be the Haar wavelet transform, which may be written:
where m refers to the level of recursion and hence the m.sup.th table (with the base level being m=0) and n refers to the entry in the table, the n.sup.th entry (with the first entry being n=0) and the coefficient terms, d.sub.m,n, including the zero terms (i.e. every even term). As described above, multiple adjacent levels may be combined:
where w(m) counts the number of combined levels in the m.sup.th table.
[0102] In some examples, the transform below may be used to provide K differentiated tables (indicated by the superscript k):
l.sub.0,n.sup.k+1=l.sub.0,n+1−l.sub.0,n.sup.k wherein l.sub.0,n.sup.0=l.sub.0,n is the n.sup.th entry of the original LUT. As implied above, the encoding of one or more differentiated tables allows some form of polynomial interpolation to be applied between LUT entries, thereby effectively increasing the resolution (i.e. the input precision), without having to dramatically increase the size of the LUT (albeit potentially sacrificing accuracy/control over intermediate values). Transforms, such as above, may then be applied to one or more of the differentiated tables so that:
where again m is used to refer to the level of recursion and n is used to refer to the entry number for the k.sup.th LUT l.sub.m,n.sup.k and derived (e.g. Haar) coefficients table d.sub.m,n.sup.k. Note that here w(m) is not a function of k, such that consistent level combinations are applied across transformed differentiated tables, but in some examples, it may be useful to combine levels independently.
In some examples, the transform coefficients of the table l.sub.m,n.sup.k+1 may be used to assist in the encoding of table l.sub.mn.sup.k, using a form of integration technique, as given by the following equation:
In the above, the first term is the k.sup.th differentiated LUT l.sub.m,n.sup.k sampled at intervals of 2.sup.w(m), the second term is the next differentiated LUT l.sub.m,n.sup.k+1 sampled at intervals of 2.sup.w(m), extrapolated linearly over this interval, and the final term is the integrated residual (e.g. Haar when w(m)=1) coefficients d.sub.m,n.sup.k+1 of the differentiated LUT (which may be pre-computed in a LUT of its own). As is implied from the above expressions, the individual terms can be further decomposed into subsequent wavelet/differential recursion levels and/or further differentiated tables (with potentially higher order extrapolation terms).
[0103] In one example, which can be described with reference to the graph in
where two adjacent levels of Haar transform of the first differentiated table have been combined (w(0)=2) and no further recursion has been applied. Note that the base level of the transformed differentiated LUT is used in both the extrapolation prediction of the original LUT and the interpolation between adjacent LUT entries. The effect of adding the different terms in the final expression is shown in the graph of
[0104] In an alternative example, shown in
[0105] In examples in which the discrete wavelet transform has been extended into the continuous domain, the inverse transform is given by:
where the linear wavelet coefficients, d.sub.m,n, are given in terms of the m.sup.th base LUT coefficients, l.sub.m,n (defined as in the Haar example, except that an additional boundary value l.sub.m,2.sub.
and the terms L.sub.0,n.sup.0 and L.sub.0,n.sup.1 represent the coarse tables (i.e. after μ stages of recursions, where μ≤M), given by:
L.sub.0,n.sup.0=l.sub.μ,nL.sub.0,n.sup.1=l.sub.μ,n+1−l.sub.μ,n
which it may be noted, as in other examples, can be recursively encoded using the previous method for optimisation where both a table and its derivatives are present.
[0106] In many examples described above, the number of entries (or number of bits) in each of the plurality of derived LUTs may be smaller than the single, original raw LUT for the target or prediction function. In other examples, the LUTs may be of similar size but in a form that can be implemented in hardware in a less complex manner, e.g. the derived LUTs may comprise fewer discrete/different values, fewer non-zero bits, etc.
[0107] By using the methods described herein, LUTs may be used to implement many more functions in fixed-function hardware than would otherwise be feasible/practical (e.g. as a result of constraints of hardware size or power consumption). Use of LUTs provides many benefits including low latency and a fully specified accuracy.
[0108] The arrangement of hardware that implements the target function using the techniques described herein may provide additional benefits when it is integrated with other hardware in a processor. For example, the final hardware logic unit within the hardware implementing the function (e.g. arithmetic unit 106, 606A, 606B, logic unit 710, 910) may generate the bits of the output result sequentially (e.g. starting with the LSB) and these may be fed into subsequent hardware logic blocks (e.g. into a subsequent arithmetic block) in advance of determination of the full result (e.g. all the bits of the value of f(x) for an input x), thereby further reducing the latency of the overall hardware arrangement.
[0109] In the description above, it is assumed that the function to be implemented in the LUT is known at the outset and that the actual values of the function, in the form of the LUT, are used to transform the LUT and to evaluate the resultant hardware metrics (in block 208). In other examples, the techniques described above may also be applied to an LUT which is not known in advance but where characteristics of the LUT are known, e.g. the number of bits in each entry in the original LUT, the target accuracy for the output values, a bound on the differences between entries etc. In particular, it may be possible to impose some bounds on the number of bits N of output precision for each of the plurality of tables generated by the above methods such that a fixed transform may be determined a priori and then the best approximating entries selected for a target function based on the given constraints.
[0110]
[0111] The hardware logic of
[0112] The fixed-function hardware described herein may be embodied in hardware on an integrated circuit. The fixed-function hardware described herein may be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques, or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry), or any combination thereof. The terms “module,” “functionality,” “component”, “element”, “unit”, “block” and “logic” may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block, or logic represents program code that performs the specified tasks when executed on a processor. The algorithms and methods described herein could be performed by one or more processors executing code that causes the processor(s) to perform the algorithms/methods. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that can be accessed by a machine.
[0113] The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for processors, including code expressed in a machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, bytecode, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in a programming language code such as C, Java, or OpenCL. Executable code may be, for example, any kind of software, firmware, script, module or library which, when suitably executed, processed, interpreted, compiled, executed at a virtual machine or other software environment, cause a processor of the computer system at which the executable code is supported to perform the tasks specified by the code.
[0114] A processor, computer, or computer system may be any kind of device, machine or dedicated circuit, or collection or portion thereof, with processing capability such that it can execute instructions. A processor may be any kind of general purpose or dedicated processor, such as a CPU, GPU, System-on-chip, state machine, media processor, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), physics processing units (PPUs), radio processing units (RPUs), digital signal processors (DSPs), general purpose processors (e.g. a general purpose GPU), microprocessors, any processing unit which is designed to accelerate tasks outside of a CPU, etc. A computer or computer system may comprise one or more processors. Those skilled in the art will realize that such processing capabilities are incorporated into many different devices and therefore the term ‘computer’ includes set top boxes, media players, digital radios, PCs, servers, mobile telephones, personal digital assistants and many other devices.
[0115] It is also intended to encompass software which defines a configuration of hardware as described herein, such as HDL (hardware description language) software, as is used for designing integrated circuits, or for configuring programmable chips, to carry out desired functions. That is, there may be provided a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition dataset that when processed (i.e. run) in an integrated circuit manufacturing system configures the system to manufacture a fixed-function hardware configured to perform any of the methods described herein, or to manufacture a fixed-function hardware comprising any apparatus described herein. An integrated circuit definition dataset may be, for example, an integrated circuit description.
[0116] Therefore, there may be provided a method of manufacturing, at an integrated circuit manufacturing system, a fixed-function hardware as described herein. Furthermore, there may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, causes the method of manufacturing a fixed-function hardware to be performed.
[0117] An integrated circuit definition dataset may be in the form of computer code, for example as a netlist, code for configuring a programmable chip, as a hardware description language defining an integrated circuit at any level, including as register transfer level (RTL) code, as high-level circuit representations such as Verilog or VHDL, and as low-level circuit representations such as OASIS® and GDSII. Higher level representations which logically define an integrated circuit (such as RTL) may be processed at a computer system configured for generating a manufacturing definition of an integrated circuit in the context of a software environment comprising definitions of circuit elements and rules for combining those elements in order to generate the manufacturing definition of an integrated circuit so defined by the representation. As is typically the case with software executing at a computer system so as to define a machine, one or more intermediate user steps (e.g. providing commands, variables etc.) may be required in order for a computer system configured for generating a manufacturing definition of an integrated circuit to execute code defining an integrated circuit so as to generate the manufacturing definition of that integrated circuit.
[0118] An example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system so as to configure the system to manufacture a fixed-function hardware will now be described with respect to
[0119]
[0120] The layout processing system 1304 is configured to receive and process the IC definition dataset to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art, and for example may involve synthesising RTL code to determine a gate level representation of a circuit to be generated, e.g. in terms of logical components (e.g. NAND, NOR, AND, OR, MUX and FLIP-FLOP components). A circuit layout can be determined from the gate level representation of the circuit by determining positional information for the logical components. This may be done automatically or with user involvement in order to optimise the circuit layout. When the layout processing system 1304 has determined the circuit layout it may output a circuit layout definition to the IC generation system 1306. A circuit layout definition may be, for example, a circuit layout description.
[0121] The IC generation system 1306 generates an IC according to the circuit layout definition, as is known in the art. For example, the IC generation system 1306 may implement a semiconductor device fabrication process to generate the IC, which may involve a multiple-step sequence of photo lithographic and chemical processing steps during which electronic circuits are gradually created on a wafer made of semiconducting material. The circuit layout definition may be in the form of a mask which can be used in a lithographic process for generating an IC according to the circuit definition. Alternatively, the circuit layout definition provided to the IC generation system 1306 may be in the form of computer-readable code which the IC generation system 1306 can use to form a suitable mask for use in generating an IC.
[0122] The different processes performed by the IC manufacturing system 1302 may be implemented all in one location, e.g. by one party. Alternatively, the IC manufacturing system 1302 may be a distributed system such that some of the processes may be performed at different locations and may be performed by different parties. For example, some of the stages of: (i) synthesising RTL code representing the IC definition dataset to form a gate level representation of a circuit to be generated, (ii) generating a circuit layout based on the gate level representation, (iii) forming a mask in accordance with the circuit layout, and (iv) fabricating an integrated circuit using the mask, may be performed in different locations and/or by different parties.
[0123] In other examples, processing of the integrated circuit definition dataset at an integrated circuit manufacturing system may configure the system to manufacture fixed-function hardware without the IC definition dataset being processed so as to determine a circuit layout. For instance, an integrated circuit definition dataset may define the configuration of a reconfigurable processor, such as an FPGA, and the processing of that dataset may configure an IC manufacturing system to generate a reconfigurable processor having that defined configuration (e.g. by loading configuration data to the FPGA).
[0124] In some embodiments, an integrated circuit manufacturing definition dataset, when processed in an integrated circuit manufacturing system, may cause an integrated circuit manufacturing system to generate a device as described herein. For example, the configuration of an integrated circuit manufacturing system in the manner described above with respect to
[0125] In some examples, an integrated circuit definition dataset could include software which runs on hardware defined at the dataset or in combination with hardware defined at the dataset. In the example shown in
[0126] Those skilled in the art will realize that storage devices utilized to store program instructions can be distributed across a network. For example, a remote computer may store an example of the process described as software. A local or terminal computer may access the remote computer and download a part or all of the software to run the program. Alternatively, the local computer may download pieces of the software as needed, or execute some software instructions at the local terminal and some at the remote computer (or computer network). Those skilled in the art will also realize that by utilizing conventional techniques known to those skilled in the art that all, or a portion of the software instructions may be carried out by a dedicated circuit, such as a DSP, programmable logic array, or the like.
[0127] The methods described herein may be performed by a computer configured with software in machine readable form stored on a tangible storage medium e.g. in the form of a computer program comprising computer readable program code for configuring a computer to perform the constituent portions of described methods or in the form of a computer program comprising computer program code means adapted to perform all the steps of any of the methods described herein when the program is run on a computer and where the computer program may be embodied on a computer readable storage medium. Examples of tangible (or non-transitory) storage media include disks, thumb drives, memory cards etc. and do not include propagated signals. The software can be suitable for execution on a parallel processor or a serial processor such that the method steps may be carried out in any suitable order, or simultaneously.
[0128] The hardware components described herein may be generated by a non-transitory computer readable storage medium having encoded thereon computer readable program code.
[0129] Memories storing machine executable data for use in implementing disclosed aspects can be non-transitory media. Non-transitory media can be volatile or non-volatile. Examples of volatile non-transitory media include semiconductor-based memory, such as SRAM or DRAM. Examples of technologies that can be used to implement non-volatile memory include optical and magnetic memory technologies, flash memory, phase change memory, resistive RAM.
[0130] A particular reference to “logic” refers to structure that performs a function or functions. An example of logic includes circuitry that is arranged to perform those function(s). For example, such circuitry may include transistors and/or other hardware elements available in a manufacturing process. Such transistors and/or other elements may be used to form circuitry or structures that implement and/or contain memory, such as registers, flip flops, or latches, logical operators, such as Boolean operations, mathematical operators, such as adders, multipliers, or shifters, and interconnect, by way of example. Such elements may be provided as custom circuits or standard cell libraries, macros, or at other levels of abstraction. Such elements may be interconnected in a specific arrangement. Logic may include circuitry that is fixed function and circuitry can be programmed to perform a function or functions; such programming may be provided from a firmware or software update or control mechanism. Logic identified to perform one function may also include logic that implements a constituent function or sub-process. In an example, hardware logic has circuitry that implements a fixed function operation, or operations, state machine or process.
[0131] The implementation of concepts set forth in this application in devices, apparatus, modules, and/or systems (as well as in methods implemented herein) may give rise to performance improvements when compared with known implementations. The performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During manufacture of such devices, apparatus, modules, and systems (e.g. in integrated circuits) performance improvements can be traded-off against the physical implementation, thereby improving the method of manufacture. For example, a performance improvement may be traded against layout area, thereby matching the performance of a known implementation but using less silicon. This may be done, for example, by reusing functional blocks in a serialised fashion or sharing functional blocks between elements of the devices, apparatus, modules and/or systems. Conversely, concepts set forth in this application that give rise to improvements in the physical implementation of the devices, apparatus, modules, and systems (such as reduced silicon area) may be traded for improved performance. This may be done, for example, by manufacturing multiple instances of a module within a predefined area budget.”
[0132] Any range or device value given herein may be extended or altered without losing the effect sought, as will be apparent to the skilled person.
[0133] It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages.
[0134] Any reference to ‘an’ item refers to one or more of those items. The term ‘comprising’ is used herein to mean including the method blocks or elements identified, but that such blocks or elements do not comprise an exclusive list and an apparatus may contain additional blocks or elements and a method may contain additional operations or elements. Furthermore, the blocks, elements and operations are themselves not impliedly closed.
[0135] The steps of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. The arrows between boxes in the figures show one example sequence of method steps but are not intended to exclude other sequences or the performance of multiple steps in parallel. Additionally, individual blocks may be deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought. Where elements of the figures are shown connected by arrows, it will be appreciated that these arrows show just one example flow of communications (including data and control messages) between elements. The flow between elements may be in either direction or in both directions.
[0136] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.