SOFTMAX AND LOG SOFTMAX METHOD AND SYSTEM

20240061903 ยท 2024-02-22

Assignee

Inventors

Cpc classification

International classification

Abstract

Circuits and methods for determining a maximum bias for computing softmax on a tensor include a processor circuit configured to transform in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements. The respective power-of-two element from element x.sub.t of the tensor is p.sub.t, p.sub.t=(x.sub.t*log.sub.2e), and p.sub.t has an integer part and a fraction part. A first comparison circuit (204) is configured to determine respective group-level biases for the groups. The group-level bias of group.sub.m is d.sub.m, and d.sub.m is an integer part of a maximum of the power-of-two elements of group.sub.m. A second comparison circuit is configured to determine a greatest one of the respective group-level biases to be a tensor-level bias, d.sub.max.

Claims

1. A method comprising: transforming in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements by a processor circuit, wherein the respective power-of-two element from element x.sub.t of the tensor is p.sub.t, p.sub.t=(x.sub.t*log.sub.2e), and p.sub.t has an integer part and a fraction part; determining respective group-level biases for the groups by a comparison circuit (204), wherein the group-level bias of group.sub.m is d.sub.m, and d.sub.m is an integer part of a maximum of the power-of-two elements of group.sub.m; and determining a greatest one of the respective group-level biases by the comparison circuit to be a tensor-level bias, d.sub.max.

2. The method of claim 1, further comprising: adjusting in parallel by a plurality of adder circuits, the respective power-of-two elements of each group into respective group-biased elements based on the respective group-level biases to prevent underflow and overflow; summing the respective group-biased elements by an accumulator circuit for each group into a group-level sum; and summing the group-level sums into a tensor-level sum by an update circuit.

3. The method of claim 2, further comprising: adjusting the respective group-biased elements into respective tensor-biased elements corresponding to the elements of the tensor based on d.sub.max, the respective group-level biases, and exponents of the group-biased elements; and determining softmax values in parallel for elements of each group by a processor circuit, wherein the softmax value of x.sub.t=(tensor-biased element corresponding to x.sub.t)/(tensor-level sum).

4. The method of claim 3, further comprising: determining a log-tensor-sum=log.sub.e(tensor-level sum) by a processor circuit; and determining log-softmax values in parallel for the elements x.sub.t of each group by a processor circuit, wherein log(softmax(x.sub.t))=x.sub.td.sub.max(log-tensor-sum).

5. The method of claim 4, wherein determining softmax values is performed by a first processor circuit, and determining log-softmax values is performed by a second processor circuit, and the first processor circuit and the second processor circuit operate in parallel.

6. The method of claim 4, wherein determining softmax values is performed by a first processor circuit, and determining log-softmax values is performed by a second processor circuit, and the method further comprising: activating the first processor circuit and deactivating the second processor circuit in response to a first state of mode control signals; and deactivating the first processor circuit and activating the second processor circuit in response to a second state of the mode control signals.

7. The method of claim 3, wherein the group-biased element of x.sub.t of group m is equal to e.sup.x_t*2.sup.d_m, and the adjusting the respective power-of-two elements into respective group-biased elements includes: determining in parallel by a plurality of subtractor circuits, differences between integer portions, x.sub.t_k, of the respective power-of-two elements of a group m and the respective group-level bias, d.sub.m; determining in parallel by a processor circuit from fraction portions, x.sub.t_j, of the respective power-of-two elements of the group m, floating point values of 2.sup.(x_t)_j; and determining in parallel by a plurality of adder circuits, exponents of the respective group-biased elements in parallel as sums of the differences from the plurality of subtractor circuits and exponents of the floating point values 2.sup.(x_t)_j.

8. The method of claim 3, wherein transforming the elements of each group into respective power-of-two elements includes transforming the elements of group m+1 of the plurality of groups concurrent with the comparison circuit determining the respective group-level bias for group m.

9. The method of claim 3, wherein: determining the respective group-level biases includes determining the respective group-level biases of the groups in successive time intervals such that the group-level bias of group m is determined in a first time interval, and the group-level bias of group m+1 is determined in a second time interval that follows the first time interval in succession; determining the tensor-level bias includes determining and registering a current tensor-level bias by a comparison circuit that compares the current tensor-level bias to the respective group-level bias as each group-level bias is determined; and wherein summing the group-level sums into the tensor-level sum includes aligning each group-level sum with a current tensor-level sum based on the current tensor-level bias, adding the group-level sum to the current tensor level sum after aligning the group-level sum, and registering an update of the current tensor-level sum.

10. The method of claim 3, wherein adjusting the respective group-biased elements into respective tensor-biased elements includes: determining a difference between each group-level bias d.sub.m and the tensor-level bias, d.sub.max, by a subtractor circuit as (d.sub.md.sub.max); and determining exponents of the respective group-biased elements in parallel by a plurality of adder circuits, wherein each exponent is a sum of (d.sub.md.sub.max)+(the exponent of the respective group-biased element).

11. A circuit arrangement, comprising: a first processor circuit configured to transform in parallel, elements of each group of a plurality of groups of elements of a tensor X into respective power-of-two elements, wherein the respective power-of-two element from element x.sub.t of the tensor is p.sub.t, p.sub.t=(x.sub.t*log.sub.2e), and p.sub.t has an integer part and a fraction part; a first comparison circuit (204) configured to determine respective group-level biases for the groups, wherein the group-level bias of group.sub.m is d.sub.m, and d.sub.m is an integer part of a maximum of the power-of-two elements of group.sub.m; and a second comparison circuit configured to determine a greatest one of the respective group-level biases to be a tensor-level bias, d.sub.max.

12. The circuit arrangement of claim 11, further comprising: a pluralty of adder circuits configured to adjust the respective power-of-two elements of each group in parallel into respective group-biased elements based on the respective group-level biases to prevent underflow and overflow; an accumulator circuit configured to sum the respective group-biased elements for each group into a group-level sum; an update circuit configured to sum the group-level sums into a tensor-level sum.

13. The circuit arrangement of claim 12, further comprising: an adjustment circuit configured to adjust the respective group-biased elements into respective tensor-biased elements corresponding to the elements of the tensor based on d.sub.max, the respective group-level biases, and exponents of the group-biased elements; and a second processor circuit configured to determine softmax values in parallel for elements of each group, wherein the softmax value of x.sub.t=(tensor-biased element corresponding to x.sub.t)/(tensor-level sum).

14. The circuit arrangement of claim 13, further comprising: a third processor circuit configured to determine a log-tensor-sum=log.sub.e(tensor-level sum); and a fourth processor circuit configured to determine log-softmax values in parallel for the elements x.sub.t of each group, wherein log(softmax (x.sub.t))=x.sub.td.sub.max(log-tensor-sum).

15. The circuit arrangement of claim 14, wherein the first, second, third, and fourth processor circuits are configured to operate in parallel.

16. The circuit arrangement of claim 14, further comprising a control circuit configured to activate the first and second processor circuits and deactivating the third and fourth processor circuits in response to a first state of mode control signals; and deactivate the first and second processor circuits and activating the third and fourth processor circuits in response to a second state of the mode control signals.

17. The circuit arrangement of claim 13, wherein the group-biased element of x.sub.t of group m is equal to e.sup.x_t*2.sup.d_m, and further comprising: a plurality of subtractor circuits configured to determine in parallel, differences between integer portions, x.sub.t_k, of the respective power-of-two elements of a group m and the respective group-level bias, d.sub.m; a third processor circuit configured to determine floating point values of 2.sup.(x_t)_j in parallel from fraction portions, x.sub.t_j, of the respective power-of-two elements of the group m; and a plurality of adder circuits configured to determine in parallel, exponents of the respective group-biased elements in parallel as sums of the differences from the plurality of subtractor circuits and exponents of the floating point values 2.sup.(x_t)_j.

18. The circuit arrangement of claim 13, wherein the first processor circuit is configured to transform the elements of group m+1 of the plurality of groups concurrent with the comparison circuit determining the respective group-level bias for group m.

19. The circuit arrangement of claim 13, wherein: the first comparison circuit is configured to determine the respective group-level biases of the groups in successive time intervals such that the group-level bias of group m is determined in a first time interval, and the group-level bias of group m+1 is determined in a second time interval that follows the first time interval in succession; the second comparison circuit is configured to determine and register a current tensor-level bias by a comparison circuit that compares the current tensor-level bias to the respective group-level bias as each group-level bias is determined; and the update circuit is configured to align each group-level sum with a current tensor-level sum based on the current tensor-level bias, add the group-level sum to the current tensor level sum after aligning the group-level sum, and register an update of the current tensor-level sum.

20. The circuit arrangement of claim 13, wherein the adjustment circuit includes: a subtractor circuit configured to determine a difference between each group-level bias d.sub.m and the tensor-level bias, d.sub.max, by as (d.sub.md.sub.max); and a plurality of adder circuits configured to determine exponents of the respective group-biased elements in parallel, wherein each exponent is a sum of (d.sub.md.sub.max)+(the exponent of the respective group-biased element).

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] Various aspects and features of the methods and circuits will become apparent upon review of the following detailed description and upon reference to the drawings in which:

[0010] FIG. 1 shows a dataflow diagram of data generated in performing operations according to the disclosed approaches for computing a softmax function on a tensor X;

[0011] FIG. 2 shows an exemplary circuit arrangement for computing softmax and log(softmax) functions on a tensor X;

[0012] FIG. 3 shows a timing diagram of operations performed in computing the softmax function by the circuit arrangement of FIG. 2;

[0013] FIG. 4 shows a timing diagram of operations performed by the circuit arrangement of FIG. 2 in computing the log(softmax) function;

[0014] FIG. 5 shows a timing diagram of operations of the softmax function performed in parallel with operations of the log(softmax) function by the circuit arrangement of FIG. 2; and

[0015] FIG. 6 is a block diagram depicting a System-on-Chip (SoC) that can host circuitry that implements the softmax and log(softmax) functions according to the methods and circuits disclosed herein.

DETAILED DESCRIPTION

[0016] In the following description, numerous specific details are set forth to describe specific examples presented herein. It should be apparent, however, to one skilled in the art, that one or more other examples and/or variations of these examples may be practiced without all the specific details given below. In other instances, well known features have not been described in detail so as not to obscure the description of the examples herein. For ease of illustration, the same reference numerals may be used in different diagrams to refer to the same elements or additional instances of the same element.

[0017] The disclosed approaches provide methods and circuitry that addresses the aforementioned issues. This methods and circuits are useful in neural network inference and training. According to the disclosed approaches, the exponential functions of softmax are transformed into 2.sup.x form. The transformation is explained as follows.


a.sup.b=e.sup.lna.sup.b=e.sup.blna

For a=2 and b*ln(a)=x,

[00001] e x = 2 x ln 2 = 2 x * log 2 e e x - x max = 2 x - x max ln 2 = 2 x * log 2 e * 2 - x max * log 2 e

A bias, d.sub.max, is computed to prevent overflow and underflow and to align terms for summing as:


d.sub.max=[x.sub.max*log.sub.2e]

where [] is a floor operation. The softmax function can be restated as:

[00002] soft max ( x_t ) = e x t - bias .Math. e x t - bias = 2 - d max e x t 2 - d max .Math. e x t = 2 - d max 2 x t * log 2 e 2 - d max .Math. 2 x t * log 2 e

where x.sub.t is element t of tensor X, and the summations are of all x.sub.t in X.

[0018] The term, 2.sup.x.sup.t.sup.*log.sup.2.sup.e can be written as:


2.sup.x.sup.t.sup.*log.sup.2.sup.e=2.sup.x.sup.t_k*2.sup.x.sup.t_j

where x.sub.t_k is the integer part, x.sub.t_j is the fractional part of (x.sub.t*log.sub.2e), and x.sub.t_j is in the interval [0, 1). To calculate the softmax function according to the disclosed approaches, three components are calculated: x.sub.t_k, 2.sup.(x_t)_j, and d.sub.max. The term, x.sub.t_j, is in the interval [0, 1), and 2.sup.(x_t)_j can be approximated by polynomial fitting with acceptable precision and degree. The value, 2.sup.(x_t)_j, is a floating-point number and can have an 8-bit exponent, for example. After polynomial fitting, the exponent is modified by x.sub.t_kd.sub.max.

[0019] The disclosed methods and circuitry significantly reduce the time expended in computing d.sub.max by dividing an input tensor is divided into several groups, converting tensor elements into power-of-two values, determining group-level biases, adjusting the power-of-two values according to the group-level biases, and summing the adjusted values of the groups. The number of tensor elements in each group is set according to the desired level of computational parallelism. In the exemplary methods and circuitry, each group has 8 tensor elements, though different implementations can have more or fewer tensor elements depending on hardware capabilities.

[0020] FIG. 1 shows a dataflow diagram of data generated in performing operations according to the disclosed approaches for computing a softmax function on a tensor X. The input tensor has n+1 elements (x.sub.0 . . . x.sub.n) and i+1 groups of elements (group.sub.0 . . . group.sub.i). Each group has p+1 elements (e.g., p=7). The input tensor is provided in buffer1 as shown by dashed block 102. Buffer1 can be an on-chip or off-chip RAM (relative to computational circuitry), and the tensor elements can be input by a streaming or direct memory access (DMA) interface.

[0021] The groups of elements can be input one group at a time, and a processor circuit is configured to multiply the elements of the group by log.sub.2e in parallel (x.sub.t*log.sub.2e for t=0 . . . 7). The purpose multiplying x.sub.t*log.sub.2e is to transform e.sup.x_t to the form 2.sup.y (power-of-two form), where y=x.sub.t*log.sub.2e, per the derivations above. The products produced from each group m are used to determine the group-level bias, d.sub.m. The d.sub.m of group m is the integer part of the greatest one of the products of the group ([max (x.sub.t*log.sub.2e for t=0 . . . 7))]). A tensor-level bias, d.sub.max, is determined by finding the greatest of the d.sub.m values as the groups are successively processed.

[0022] The d.sub.m along with the x.sub.t_k and 2.sup.(x_t)_j, values are used to adjust the computed products and prevent overflow and underflow relative to the group. The 2.sup.(x_t)_j, values for a group are determined by polynomial fitting, and the power-of-two values are adjusted by x.sub.t_kd.sub.m+(the exponent bits of 2.sup.(x_t)_j). The group-biased power-of-two values, e.sup.x_t*2.sup.d_m, are stored in association with the group-level d.sub.m as each group is computed in buffer2, which is shown by dashed block 104. Each e.sup.x_t*2.sup.d_m is a floating point value having an exponent equal to x.sub.t_kd.sub.m+(the exponent bits of 2.sup.(x_t)_j), and a mantissa equal to the mantissa of 2.sup.(x_t)_j. Buffer2 can be an on-chip or off-chip RAM (relative to computational circuitry), and the group-level d.sub.m values and associated group-biased power-of-two values can be input by a streaming or direct memory access (DMA) interface.

[0023] The adjusted power-of-two values are accumulated into a group-level sum (2.sup.d_m sum.sub.m=SUM.sub.group_m=sum(e.sup.x_t*2.sup.d_m) for all t in group.sub.m) as the adjusted power-of-two values are computed. The group-level sums are accumulated into a tensor-level sum as each group is accumulated. The group-level d.sub.m is compared to the group level d.sub.max once d.sub.m is determined, and the group-level sum 2.sup.d_m sum.sub.m is aligned with the current sum (2.sup.d_max sum) according to the current value of d.sub.max. Once aligned, the aligned group-level sum 2.sup.d_m sum.sub.m is added to the current sum (2.sup.d_max sum) to produce a new current sum.

[0024] The tensor-level d.sub.max and tensor level sum, 2.sup.d_max sum, (which is 2.sup.d.sup.maxe.sup.x.sup.t) are available for the next stage of softmax computation once the tensor elements of group.sub.i have been processed.

[0025] Once d.sub.max has been determined, the group-biased power-of-two values (e.sup.x_t*2.sup.d_m) are tensor-wise adjusted based on the tensor-level d.sub.max value. The tensor-wise biases for elements in group.sub.m are made by retrieving from buffer2, d.sub.m and exponents of the associated group-wise-adjusted power-of-two values e.sup.x_t*2.sup.d_m. The exponents of e.sup.x_t*2.sup.d_m in group.sub.m are added to (d.sub.md.sub.max) to generate the exponents of the e.sup.x_t* 2.sup.d_max values, which are illustrated in the column 106 of blocks. The mantissas of the e.sup.x_t*2.sup.d_max values are the same as the mantissas of the corresponding values from buffer2. Though not shown in FIG. 1, it will be recognized that each of the e.sup.x_t*2.sup.d_max values in column 106 is divided by the tensor level sum, 2.sup.d_max sum to generate softwmax (x.sub.t).

[0026] FIG. 2 shows an exemplary circuit arrangement 200 for computing softmax and log(softmax) functions on a tensor X. The circuit arrangement generally includes one or more processor circuits configured to perform parallel multiply-and-accumulate operations, registers for storing temporary result values, and various addition and subtraction circuits.

[0027] A group of p+1 tensor elements (x.sub.t, t=0 . . . p) is read from buffer1 102 and input in parallel to processor circuitry 202. Processor circuit 202 computes products (power-of-two elements) of x.sub.t*log2e for t=0 . . . p in parallel. The p+1 power-of-two elements are provided on parallel signal lines to circuit 204, which compares values of the p+1 power-of-two elements and extracts and provides the integer portion of the greatest one of the values as d.sub.m. The compare-and-select circuit 206 compares the d.sub.m value from circuit 204 to the current d.sub.max value in register 208 and selects the greater of the two values to update the contents of the register.

[0028] The power-of-two elements computed by processor circuits 202 are floating point values, and the integer portions (group.sub.m x.sub.t_k) and fraction portions (group.sub.m x.sub.t_j) of the values are determined from the mantissas and exponents. The integer portions are provided to the subtraction circuits 210, and the fraction portions are provided to the processor circuitry 212, which can be a vector processor that performs multiply-and-accumulate (MAC) operations in parallel.

[0029] The subtraction circuits 210 compute in parallel the differences between the integer portions and the group-level bias, d.sub.m (x.sub.t_kd.sub.m for t=0 . . . p). The processor circuitry 212 computes in parallel 2.sup.(x_t)_j for t=0 . . . p by polynomial fitting of the fraction portions, x.sub.t_j. The tensor elements of the next group (group.sub.m+1) can be input to the processor circuitry 202 for computing the power-of-two elements while circuit 204 determines the group-level bias d.sub.m, the subtraction circuits 210 compute the differences (x.sub.t_kd.sub.m for t=0 . . . p), and the processor circuitry 212 computes 2.sup.(x_t)_j for t=0 . . . p for group.sub.m.

[0030] The differences and exponents of the 2.sup.(x_t)_j values are input to adder circuits 214 that compute in parallel the exponents of the group-biased power-of-two elements. Each e.sup.x_t*2.sup.d_m is a floating point value having an exponent equal to x.sub.t_kd.sub.m+(the exponent bits of 2.sup.(x_t)_j), and a mantissa equal to the mantissa of 2.sup.(x_t)_j. The group-biased power-of-two values for the group are stored in buffer2 104 in association with the group-level bias d.sub.m.

[0031] The group-biased power-of-two values for the group are input to summing circuit 216, which sums the group-biased power-of-two values into a group-level sum (SUM.sub.m=sum(e.sup.x_t*2.sup.d_m) for all x.sub.t in group m).

[0032] The update circuit 218 accumulates the group-level sums as each group-level sum is provided by summing circuit 216. The update circuit 218 inputs the group-level sum from summing circuit 216, the current greatest bias value, d.sub.max from register 208, and the current accumulated SUM from register 220. The update circuit aligns the group-level sum and the current accumulated SUM according to d.sub.max and produces a new SUM that is stored in register 220.

[0033] Once all groups of tensor elements of a tensor (e.g., group.sub.0 . . . group.sub.i of a tensor having i+1 groups) have been processed and a final tensor-level sum has been computed, control circuit 222 can activate the final softmax circuitry 224. The final softmax circuitry generates final softmax values group-by-group, with the p+1 softmax values generated in parallel. The final softmax circuitry inputs the tensor-level bias, d.sub.max, from register 208, the final tensor level SUM from register 220 (SUM=2.sup.d_max*sum(2.sup.x_t*log_2(e))), and reads the group-biased power-of-two elements of group.sub.m and the associated group-level bias d.sub.m from buffer2 104.

[0034] The subtractor circuit 226 of the final softmax circuitry determines the difference between the group-level bias, d.sub.m, and the tensor-level bias d.sub.max (d.sub.md.sub.max). The difference from subtractor circuit 226 and the exponents of the group-biased power-of-two elements (exp(e.sup.x_t*2.sup.d_m)) are input to adder circuits 228. The adder circuits 228 compute in parallel sums of the difference and the exponents of the e.sup.x_t*2.sup.d_m terms from buffer 2. The sums from exponent adders 228 are exponents that are paired with the corresponding mantissas of the e.sup.x_t*2.sup.d_m terms from buffer2 to provide the tensor-baised terms, x.sub.t_dmax as divisors to the divider circuit 230. The exponent of x.sub.t_dmax. is exp(x.sub.t_dmax)=(d.sub.md.sub.max)+exp(e.sup.x_t*2.sup.d_m), and the mantissa of x.sub.t_dmax. is man(x.sub.t_dmax)=man(e.sup.x_t*2.sup.d_m).

[0035] Divider circuitry 230, which can be a vector division circuit, computes in parallel the final softmax values (x.sub.t_dmax/SUM for t=0 . . . p)).

[0036] The control circuit 222 controls activation of the final softmax circuit 224 and final log_softmax circuit 232. The final softmax circuit and the final log_softmax circuit can be operated alone or in parallel with one another. For example, in response to a state of mode control signals, the control circuit 222 can activate the final softmax circuit 224 and deactivate final log_softmax circuit 232, deactivate the final softmax circuit 224 and activate final log_softmax circuit 232, or activate both the final softmax circuit 224 and the final log_softmax circuit 232 to operate in parallel. The control circuit 222 can gate clock signals to the final softmax circuit 224 and final log_softmax circuit 232 to reduce power consumption when only one of the circuits is activated.

[0037] The formula of log(softmax):

[00003] log ( soft max ( x ) ) = log ( e x i - x max .Math. e x i - x max ) = ( x i - x max ) - log ( .Math. e x i - x max )

The term, 2.sup.dmax 2.sup.x.sup.i.sup.log.sup.2.sup.e is used to replace e.sup.x.sup.i.sup.x.sup.max, and d.sub.max is used to replace x.sub.max as explained above. Thus, the log(softmax) can be restated as:


log(softmax(x))=(x.sub.id.sub.max)log(2.sup.d.sup.max2.sup.xi*log.sup.2.sup.e)

The input variable of the natural log function is a floating point number, which is represented in the form:


(1).sup.s*(1+M)*2.sup.EE.sup.a

where s is the sign bit, M is the mantissa, E is the exponent, which is shifted by a constant bias E.sub.0. The log function can be written as:

[00004] log ( y ) = log 2 y log 2 e = 1 log 2 e ( log 2 ( ( 1 + ? ) * ? ) ) = 1 log 2 e ( ( E y - E 0 ) + log 2 ( 1 + ? ) ) ? indicates text missing or illegible when filed

where M.sub.y is in the interval to [0, 1), log.sub.2(1+M.sub.y) could be calculated by polynomial fitting.

[0038] The final log_softmax circuit 232 is activated once the tensor-level d.sub.max is available. A group of p+1 tensor elements (x.sub.t, t=0 . . . p) is read from buffer1 102 and input in parallel to the subtraction circuits 234. The subtraction circuits compute in parallel, differences between (x.sub.td.sub.max for t=0 . . . p).

[0039] The mantissa (M.sub.SUM) of the SUM is input to the processor circuitry 212, which is configured to compute log.sub.2(1+M.sub.SUM). The exponent of the SUM (E.sub.SUM) is input to circuit 236, which converts E.sub.SUM to a floating point value. Adder 238 sums the values output from circuits 212 and 236 (float(E.sub.SUM)+log.sub.2(1+M.sub.SUM)), and the sum is input to processor circuitry 240. Processor circuitry 240 is configured to compute:


SUM.sub.log=(float(E.sub.SUM)+log.sub.2(1+M.sub.SUM))/log.sub.2e

The processor circuitry 240 can be processor circuitry dedicated to computing SUM.sub.log, or circuitry 212.

[0040] The SUM.sub.log and p+1 differences (x.sub.td.sub.max for t=0 . . . p) from subtraction circuits 234 are input to subtraction circuits 242. Subtraction circuits compute in parallel (x.sub.td.sub.max)SUM.sub.log for t=0 . . . p, and the p+1 output terms are log (softmax (x.sub.t)).

[0041] FIG. 3 shows a timing diagram of operations performed in computing the softmax function by the circuit arrangement of FIG. 2. Each block summarizes an operation(s) performed in computing the softmax function, and the horizontal alignment of the blocks indicates relative time slots in which the operations are performed. Vertically aligned blocks indicate operations performed in the same time slot (in parallel).

[0042] The example shows the relative timing of operations involved in processing groups 0, 1, m, m+1, and i of i+1 groups of tensor elements (1<m<i). In time slot t0, p+1 tensor elements of a group are input to the circuit arrangement. In time slot t1, the tensor elements of group.sub.0 are multiplied by log.sub.2e, and in parallel therewith, the tensor elements of group.sub.1 are input. In time slot t2, the group-level bias, d.sub.0, is determined, along with polynomial fitting of the x.sub.j terms and differences between the x.sub.k terms and d.sub.0. Also in time slot t2, the tensor elements of group.sub.1 are multiplied by log.sub.2e. Though not shown, the tensor elements of group.sub.3 would be input in time slot t2.

[0043] In time slot t3, the differences and exponents of the 2.sup.(x_t)_j values computed from group.sub.0 are summed into the group-biased power-of-two elements and stored in association with the group-level bias d.sub.0. The group-biased power-of-two values for group.sub.0 are summed into a group-level sum (SUM.sub.group0). Also, in time slot t3, the group-level bias, d.sub.1, is determined, along with polynomial fitting of the x.sub.j terms and differences between the x.sub.k terms and d.sub.1.

[0044] In time slot t4, the group-level bias, d.sub.0, is compared to the current tensor-level bias, d.sub.max, and the current tensor-level d.sub.max is updated to the value of d.sub.0, since d.sub.0 is the first maximum computed. Also during time slot t4, the differences and exponents of the 2.sup.(x_t)_j values computed from group.sub.1 are summed into the group-biased power-of-two elements and stored in association with the group-level bias d.sub.1. The group-biased power-of-two values for groupo are summed into a group-level sum (SUM.sub.group0).

[0045] In time slot t5, the group-level sum is accumulated with the current SUM. The group-level sum SUM.sub.group_0 is aligned with the current accumulated SUM according the current d.sub.max, and the aligned values are added to produces a new SUM. Also in time slot t5, the group-level bias, d.sub.1, is compared to the current tensor-level bias, d.sub.max. If d.sub.1>d.sub.max then the current tensor-level bias, d.sub.max, is updated to the value of d.sub.1. Otherwise, d.sub.max remains unchanged.

[0046] In time slot t6, the group-level sum is accumulated with the current SUM. The group-level sum SUM.sub.group_1 is aligned with the current accumulated SUM according the current d.sub.max, and the aligned values are added to produces a new SUM.

[0047] FIG. 3 shows similar processing of group.sub.m tensor elements beginning in time slot t0+m*(1 time slot), and of group.sub.m+1 tensor elements beginning in time slot t0+m*(1 time slot)+1.

[0048] The final group.sub.i of tensor elements commences in time slot t0+i, and the processing is similar to that described above for time slots t0+i through t0+i t+5. In time slot t0+i+6, the final operations of softmax processing begin.

[0049] In time slot t0+i+6, the group-biased power-of-two elements of group.sub.0 and the associated group-level bias d.sub.0 are input, and e.sup.x_t*2.sup.d_max values, x.sub.t_dmax, are computed for group.sub.0 as described above. Each e.sup.x_t*2.sup.d_max is a floating point value having an exponent equal to x.sub.t_kd.sub.max+(the exponent bits of 2.sup.(x_t)_j), and a mantissa equal to the mantissa of 2.sup.(x_t)_j. In time slot t0+i+7, the p+1 softmax values of group.sub.0 are computed as (x.sub.t_dmax/SUM for t=0 . . . p)) and then output. Though not shown, the operations in time slots t0+i+6 and t0+i+7 would be performed for group.sub.1 . . . group.sub.i, in ensuing time slots. For example, in time slot t0+i+7, the group-biased power-of-two elements of group.sub.1 and the associated group-level bias d.sub.1 are input, and e.sup.x_t*2.sup.d_max values are computed for group.sub.1. In time slot t0+i+8, the the p+1 softmax values of group.sub.1 are computed as (x.sub.t_exp/SUM for t=0 . . . p)) and then output.

[0050] FIG. 4 shows a timing diagram of operations performed by the circuit arrangement of FIG. 2 in computing the log(softmax) function. The operations are the same as those described in FIG. 3 for softmax through time slot t0+i+4.

[0051] In time slot t0+i+5, the of p+1 tensor elements of groupo are input and parallel subtraction circuits compute differences between (x.sub.td.sub.max for t=0 . . . p).

[0052] In timeslot t0+i+6, log.sub.2(1+M.sub.SUM) is computed from the mantissa (M.sub.SUM) of the SUM, and the exponent of the SUM (E.sub.SUM) is converted to a floating point value.

[0053] In timeslot t0+i+7, the log.sub.2(1+M.sub.SUM) and float(E.sub.SUM) values are summed.

[0054] In timeslot t0+i+8, the SUM.sub.log term is computed from the log.sub.2(1+M.sub.SUM) and float(E.sub.SUM) values as:


(float(E.sub.SUM)+log.sub.2(1+M.sub.SUM))/log.sub.2e

[0055] In timeslot t0+i+9, the p+1 log(softmax) values of group.sub.0 are computed in parallel as (x.sub.td.sub.max)SUM.sub.log for t=0 . . . p, and then output.

[0056] FIG. 5 shows a timing diagram of operations of the softmax function performed in parallel with operations of the log(softmax) function by the circuit arrangement of FIG. 2.

[0057] The operations are the same as those described in FIGS. 3 and 4 through time slot t0+i+5. In time slot t0+i+6, the e.sup.x_t*2.sup.d_max values are computed in a softmax operation for group.sub.0, and the log.sub.2(1+M.sub.SUM) float(E.sub.SUM) values are computed in log(softwmax) operations.

[0058] In time slot t0+i+7, the p+1 softmax values of group.sub.0 are computed and then output. In parallel with the final softmax operation in time slot t0+i+7, the log.sub.2(1+M.sub.SUM) and float(E.sub.SUM) values are summed for log(softmax). The log(softmax) operations in time slots t0+i+8 and t0+i+9 are as described in FIG. 4.

[0059] FIG. 6 is a block diagram depicting a System-on-Chip (SoC) 601 that can host circuitry that implements the softmax and log(softmax) functions according to the methods and circuits disclosed herein. In the example, the SoC includes the processing subsystem (PS) 602 and the programmable logic subsystem 603. The processing subsystem 602 includes various processing units, such as a real-time processing unit (RPU) 604, an application processing unit (APU) 605, a graphics processing unit (GPU) 606, a configuration and security unit (CSU) 612, and a platform management unit (PMU) 611. The PS 602 also includes various support circuits, such as on-chip memory (OCM) 614, transceivers 607, peripherals 608, interconnect 616, DMA circuit 609, memory controller 610, peripherals 615, and multiplexed (MIO) circuit 613. The processing units and the support circuits are interconnected by the interconnect 616. The PL subsystem 603 is also coupled to the interconnect 616. The transceivers 607 are coupled to external pins 624. The PL 603 is coupled to external pins 623. The memory controller 610 is coupled to external pins 622. The MIO 613 is coupled to external pins 620. The PS 602 is generally coupled to external pins 621. The APU 605 can include a CPU 617, memory 618, and support circuits 619. The APU 605 can include other circuitry, including L1 and L2 caches and the like. The RPU 604 can include additional circuitry, such as L1 caches and the like. The interconnect 616 can include cache-coherent interconnect or the like.

[0060] Referring to the PS 602, each of the processing units includes one or more central processing units (CPUs) and associated circuits, such as memories, interrupt controllers, direct memory access (DMA) controllers, memory management units (MMUs), floating point units (FPUs), and the like. The interconnect 616 includes various switches, busses, communication links, and the like configured to interconnect the processing units, as well as interconnect the other components in the PS 602 to the processing units.

[0061] The OCM 614 includes one or more RAM modules, which can be distributed throughout the PS 602. For example, the OCM 614 can include battery backed RAM (BBRAM), tightly coupled memory (TCM), and the like. The memory controller 610 can include a DRAM interface for accessing external DRAM. The peripherals 608, 615 can include one or more components that provide an interface to the PS 602. For example, the peripherals can include a graphics processing unit (GPU), a display interface (e.g., DisplayPort, high-definition multimedia interface (HDMI) port, etc.), universal serial bus (USB) ports, Ethernet ports, universal asynchronous transceiver (UART) ports, serial peripheral interface (SPI) ports, general purpose (GPIO) ports, serial advanced technology attachment (SATA) ports, PCIe ports, and the like. The peripherals 615 can be coupled to the MIO 613. The peripherals 608 can be coupled to the transceivers 607. The transceivers 607 can include serializer/deserializer (SERDES) circuits, MGTs, and the like.

[0062] Various logic may be implemented as circuitry to carry out one or more of the operations and activities described herein and/or shown in the figures. In these contexts, a circuit or circuitry may be referred to as logic, module, engine, or block. It should be understood that logic, modules, engines and blocks are all circuits that carry out one or more of the operations/activities. In certain implementations, a programmable circuit is one or more computer circuits programmed to execute a set (or sets) of instructions stored in a ROM or RAM and/or operate according to configuration data stored in a configuration memory.

[0063] Though aspects and features may in some cases be described in individual figures, it will be appreciated that features from one figure can be combined with features of another figure even though the combination is not explicitly shown or explicitly described as a combination.

[0064] The methods and circuits are thought to be applicable to a variety of systems that compute softmax and log(softmax) functions. Other aspects and features will be apparent to those skilled in the art from consideration of the specification. The methods and circuits may be implemented as one or more processors configured to execute software, as an application specific integrated circuit (ASIC), or as a logic on a programmable logic device. It is intended that the specification and drawings be considered as examples only, with a true scope of the invention being indicated by the following claims.