Method and Apparatus for Decimation in Frequency FFT Butterfly
20170011005 ยท 2017-01-12
Assignee
Inventors
Cpc classification
G06F17/142
PHYSICS
G06F1/025
PHYSICS
H03K17/165
ELECTRICITY
H03K7/10
ELECTRICITY
International classification
Abstract
A pipelined decimation in frequency FFT butterfly method, and an apparatus to perform this method comprising: a data memory with at least one read port and one write port; an add/subtract unit receiving data from the memory; a multiply/accumulate unit receiving data from the add/subtract unit; a source of coefficients, from logic gates or a coefficient memory, to supply FFT twiddle factors to the multiply/accumulate unit; a shifter receiving data from at least one of the add/subtract unit and the multiply/accumulate unit, the shifter supplying data to the write port of the data memory; wherein the apparatus performs these calculations in four cycles of the add/subtract unit and in four cycles of the multiply/accumulate unit, using complex arithmetic.
Claims
1. A decimation in frequency FFT butterfly processor comprising: a data memory, having at least one read port and at least one write port, adapted to store a write data; a first register, operably coupled to the read port, adapted to store first data received from the read port; a second register, operably coupled to the read port, adapted to store second data received from the read port; a third register, operably coupled to the first register, adapted to store a fourth data comprising a selected one of the first data and inverted first data received from the first register; an adder, operably coupled to the second and third registers, adapted to add the second data and the fourth data to produce a fifth data comprising a selected one of a sum and a difference; a fourth register, operably coupled to the adder, adapted to store the fifth data received from the adder; a fifth register, operably coupled to a coefficient source, adapted to store a coefficient received from a coefficient source; a first accumulator register; a second accumulator register; a multiplier/accumulator unit, operably coupled to the fourth and fifth registers and to the first and second accumulator registers, adapted to: produce a product of the fifth data received from the fourth register and the coefficient received from the fifth register; add the product and a sum stored in a selected one of the first and the second accumulator register; and store the sum in a selected one of the first and second accumulator registers; a sixth register, operably coupled to the adder, adapted to store a sixth data comprising a selected one of the sum and difference received from the adder; and a multiplexer, operably coupled to the sixth register, the first and second accumulator registers and to the write port of the data memory, adapted to select as the write data a selected one of the sixth data stored in the sixth register and the sum stored in a selected one of the first or second accumulator registers, and to provide the selected write data to the write port of the data memory.
2. The decimation in frequency FFT butterfly processor of claim 1, wherein the fourth register is operably coupled to receive the inversion of the fourth register in anticipation of the multiplier/accumulator unit calculating a negative product of the fourth and fifth registers.
3. The decimation in frequency FFT butterfly processor of claim 1: wherein the multiplier/accumulator unit is further characterized as producing the product and the sum in a carry-save format, and the first and second accumulator registers are further characterized as storing the sum received from the adder in the carry-save format, wherein at least one bit position is saved as a sum bit and a carry bit; and wherein the multiplexer is further characterized as comprising a carry propagate adder adapted to produce as the write data a propagated sum from the sum and carry bits stored in a selected one of the first and second accumulator registers.
4. The decimation in frequency FFT butterfly processor of claim 1, further comprising: a shifter, operably coupled between the multiplexer and the write port of the data memory, adapted to select as the write data at least a subset of the multiplexer bits to be written to the data memory.
5. The decimation in frequency FFT butterfly processor of claim 4, wherein the shifter is further characterized as providing to the write port of the data memory the write data selected by the multiplexer multiplied by a selected one of and 1.
6. A method for use in an audio CODEC, the method adapted to calculate a decimation in frequency butterfly operation comprising the steps of: [6.1] during a first phase of operation, reading from a data memory: [6.1.1] a first imaginary data value; [6.1.2] a second imaginary data value; [6.1.3] a first real data value; and [6.1.4] a second real data value; [6.2] during a second phase of operation, calculating, in a selected order: [6.2.1] a first difference of the second imaginary data value minus the first imaginary data value; [6.2.2] a first sum of the first and second imaginary data values; [6.2.3] a second difference of the second real data value minus the first real data value; and [6.2.4] a second sum of the first and second real data values; [6.3] during a third phase of operation, calculating, in a selected order: [6.3.1] a first product of the first difference and a real twiddle factor; [6.3.2] a second product of the negative of the first difference and an imaginary twiddle factor; [6.3.3] a third product of the second difference and the imaginary twiddle factor; [6.3.4] a fourth product of the second difference and the real twiddle factor; [6.3.5] a first accumulation of the first product with the third product; and [6.3.6] a second accumulation of the second product with the fourth product; [6.4] during a fourth phase of operation, transforming, in a selected order: [6.4.1] the first sum by a first predetermined factor; [6.4.2] the first accumulation by a second predetermined factor; [6.4.3] the second sum by a third predetermined factor; and [6.4.4] the second accumulation by a fourth predetermined factor; and [6.5] during a fifth phase of operation, writing in the data memory: [6.5.1] the transformed first sum; [6.5.2] the transformed first accumulation; [6.5.2] the transformed second sum; and [6.5.4] the transformed second accumulation.
7. The method of claim 6 further adapted to be performed in a pipelined fashion wherein a single pipeline cycle of operation comprises, in parallel: [7.1] a selected one of steps [6.1.1] through [6.1.4]; [7.2] a selected one of steps [6.2.1] through [6.2.4]; [7.3] a selected one of steps [6.3.1] through [6.3.4]; [7.4] a selected one of steps [6.4.1] through [6.4.4]; and [7.5] a selected one of steps [6.5.1] through [6.5.4].
8. The method of claim 7 wherein: [8.1] a first pipeline cycle of operation comprises steps [6.1.3], [6.2.1], [6.3.4], [6.3.6], [6.4.2] and [6.5.1]; [8.2] a second pipeline cycle of operation comprises steps [6.1.4], [6.2.2], [6.3.1], [6.4.3] and [6.5.2]; [8.3] a third pipeline cycle of operation comprises steps [6.1.1], [6.2.3], [6.3.2], [6.4.4] and [6.5.3] and [8.4] a fourth pipeline cycle of operation comprises steps [6.1.2], [6.2.4], [6.3.3], [6.3.5], [6.4.1] and [6.5.4].
9. The method of claim 7 wherein the reads from memory during the first phase and the writes to memory during the sixth phase are ordered with respect to each other such that writes to real data are performed in parallel with reads from imaginary data, and writes to imaginary data are performed in parallel with reads from real data.
10. The method of claim 6 wherein: step [6.3.5] is performed in parallel with a selected one of steps [6.3.1] and [6.3.3]; and step [6.3.6] is performed in parallel with a selected one of steps [6.3.2] and [6.3.4].
11. A method for use in an audio CODEC, the method adapted to calculate a decimation in frequency butterfly operation comprising the steps of: [11.1] during a first phase of operation, reading from a data memory, in a selected order: [11.1.1] a first real data value; [11.1.2] a second real data value; [11.1.3] a first imaginary data value; and [11.1.4] a second imaginary data value; [11.2] during a second phase of operation, calculating, in a selected order: [11.2.1] a first difference of the second real data value minus the first real data value; [11.2.2] a first sum of the first and second real data values; [11.2.3] a second difference of the second imaginary data value minus the first imaginary value; and [11.2.4] a second sum of the first and second imaginary data values; [11.3] during a third phase of operation, calculating in a carry-save form, in a selected order: [11.3.1] a first carry-save product of the first difference and an imaginary twiddle factor; [11.3.2] a second carry-save product of the first difference and a real twiddle factor; [11.3.3] a third carry-save product of the second difference and the real twiddle factor; [11.3.4] a fourth carry-save product of the negative of the second difference and the imaginary twiddle factor; [11.3.5] a first carry-save accumulation of the first carry-save product with the third carry-save product; and [11.3.6] a second carry-save accumulation of the second carry-save product with the fourth carry-save product; [11.4] during a fourth phase of operation, calculating, in a selected order: [11.4.1] nothing; [11.4.2] nothing; [11.4.3] a first carry-propagate accumulation from the first carry-save accumulation; [11.4.4] a second carry-propagate accumulation from the second carry-save accumulation; [11.5] during a fifth phase of operation, transforming, in a selected order: [11.5.1] the first sum by a first predetermined factor; [11.5.2] the second sum by a second predetermined factor; [11.5.3] the first carry-propagate accumulation by a third predetermined factor; and [11.5.4] the second carry-propagate accumulation by a fourth predetermined factor; and [11.6] during a sixth phase of operation, writing in the data memory, in a selected order: [11.6.1] the transformed first sum; [11.6.2] the transformed second sum; [11.6.3] the transformed first carry-propagate accumulation; and [11.6.4] the transformed second carry-propagate accumulation.
12. The method of claim 11 further adapted to be performed in a pipelined fashion wherein a single pipeline cycle of operation comprises, in parallel: [12.1] a selected one of steps [11.1.1] through [11.1.4]; [12.2] a selected one of steps [11.2.1] through [11.2.4]; [12.3] a selected one of steps [11.3.1] through [11.3.4]; [12.4] a selected one of steps [11.4.1] through [11.4.4]; [12.5] a selected one of steps [11.5.1] through [11.5.4]; and [12.6] a selected one of steps [11.6.1] through [11.6.4].
13. The method of claim 12 wherein: [13.1] a first pipeline cycle of operation comprises steps [11.1.4], [11.2.2], [11.3.1], [11.4.3], [11.5.2] and [11.6.1]; [13.2] a second pipeline cycle of operation comprises steps [11.1.1], [11.2.3], [11.3.2], [11.4.4], [11.5.3] and [11.6.2]; [13.3] a third pipeline cycle of operation comprises steps [11.1.2], [11.2.4], [11.3.3], [11.3.5], [11.4.1], [11.5.4] and [11.6.3] and [13.4] a fourth pipeline cycle of operation comprises steps [11.1.3], [11.2.1], [11.3.4], [11.3.6], [11.4.2], [11.5.1] and [11.6.4].
14. The method of claim 12 wherein the reads from memory during the first phase and the writes to memory during the sixth phase are ordered with respect to each other such that writes to real data are performed in parallel with reads from imaginary data, and writes to imaginary data are performed in parallel with reads from real data.
15. The method of claim 11 wherein: step [11.3.5] is performed in parallel with a selected one of steps [11.3.1] and [11.3.3]; and step [11.3.6] is performed in parallel with a selected one of steps [11.3.2] and [11.3.4].
16. An audio CODEC configured to perform the method of any preceding claim.
17. An electronic system comprising an audio CODEC according to claim 16.
18. A computer readable medium including executable instructions which, when executed in a processing system, causes the processing system to perform the steps of a method according to any one of claims 1 to 15.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0024] My invention may be more fully understood by a description of certain preferred embodiments in conjunction with the attached drawings in which:
[0025]
[0026]
[0027]
[0028] In the drawings, similar elements will be similarly numbered whenever possible. However, this practice is simply for convenience of reference and to avoid unnecessary proliferation of numbers, and is not intended to imply or suggest that my invention requires identity in either function or structure in the several embodiments.
DETAILED DESCRIPTION OF THE INVENTION
[0029]
[0030] In a read phase of operation, data read from the Data Memory may be loaded into register T or register X. The contents of register T may be loaded inverted or non-inverted into register Y.
[0031] In an add/subtract phase of operation, the contents of X and Y may be added together with a logic_1 or logic_0 carry-in, and the resulting sum or difference loaded into the M or S registers. If Y has been inverted, a difference is calculated. A carry-in of logic_1 may be used for calculating a 2's-complement difference, or for calculating a sum with rounding.
[0032] In a MAC phase of operation, a MAC facility multiplies the contents of the M register by a twiddle factor that has been selectively loaded into the C register from a Coefficient Source. As will be familiar to those skilled in this art, the Coefficient Source may include one or more of ROM, RAM, and logic gates. The product of the multiplier is added to zero, or to a value from one of two accumulator registers, A0 and A1. The sum is then loaded into A0 or A1. The M register may be loaded with an inverted copy of it's contents. This allows the negative product of C*M to be calculated. This is useful because the product of two imaginary numbers results in the negative of the product, so this avoids the need for storing a negative twiddle factor in the Coefficient Source. Preferably, the multiply and add operations may be combined to comprise a single MAC operation, completed in a single cycle.
[0033] In general, the two basic calculations performed in the MAC facility are:
A0=(Wr*Mi)+(Wi*Mr)[Eq. 5]
A1=(Wr*Mr)(Wi*Mi)[Eq. 6]
[0034] As is known, the MAC facility can perform either a multiply or a multiply/accumulate. The multiply performs:
A=W*M[Eq. 7]
and the multiply/accumulate performs:
A=A+(W*M)[Eq. 8]
So one of the two multiplies in each equation is performed first, then the other multiply is performed and the result added to the A register. So for the A0 calculation of [Eq. 5], you can sequentially do either of these combinations:
A0=Wr*Mi;[Eq. 9]
A0=A0+(Wi*Mr);[Eq. 10]
or:
A0=Wi*Mr;[Eq. 11]
A0=A0+(Wr*Mi);[Eq. 12]
Whether you do [Eq. 9] followed by [Eq. 10], or [Eq. 11] followed by [Eq. 12], the result, A0, will be the samebut [Eq. 10] must be performed after [Eq. 9], and [Eq. 12] must be performed after [Eq. 11]. Likewise, for the A1 calculation, you can independently select either of these combinations:
A1=(Wi*Mi);[Eq. 13]
A1=A1+(Wr*Mr);[Eq. 14]
or:
A1=Wr*Mr;[Eq. 15]
A1=A1(Wi*Mi);[Eq. 16]
If you select [Eq. 9] and [Eq. 10] for performing the A0 calculation, you can still perform the A1 calculation using either [Eq. 13] and [Eq. 14], or [Eq. 15] and [Eq. 16]. Accordingly, all the possible combinations (not orderings) of these operations that will give the correct result are:
[0035] (1) [Eq. 9], [Eq. 10], [Eq. 13], [Eq. 14]
[0036] (2) [Eq. 9], [Eq. 10], [Eq. 15], [Eq. 16]
[0037] (3) [Eq. 11], [Eq. 12], [Eq. 13], [Eq. 14]
[0038] (4) [Eq. 11], [Eq. 12], [Eq. 15], [Eq. 16]
Further, within each of these four distinct combinations, the order of operations can be varied as long as each odd numbered step is performed sometime before the +1 even numbered step. For example, the combination (1), above, may validly be reordered as:
[0039] (1) [Eq. 13], [Eq. 14], [Eq. 9], [Eq. 10]
or as:
[0040] (1) [Eq. 13], [Eq. 9], [Eq. 14], [Eq. 10].
[0041] Taking into consideration these constraints, the following 24 sequences of these operations all deliver valid results in accordance with [Eq. 5] and [Eq. 6]:
[0042] (1.1) [Eq. 9], [Eq. 10], [Eq. 13], [Eq. 14];
[0043] (1.2) [Eq. 9], [Eq. 13], [Eq. 10], [Eq. 14];
[0044] (1.3) [Eq. 9], [Eq. 13], [Eq. 14], [Eq. 10];
[0045] (1.4) [Eq. 13], [Eq. 14], [Eq. 9], [Eq. 10];
[0046] (1.5) [Eq. 13], [Eq. 9], [Eq. 14], [Eq. 10];
[0047] (1.6) [Eq. 13], [Eq. 9], [Eq. 10], [Eq. 14];
[0048] (2.1) [Eq. 9], [Eq. 10], [Eq. 15], [Eq. 16];
[0049] (2.2) [Eq. 9], [Eq. 15], [Eq. 10], [Eq. 16];
[0050] (2.3) [Eq. 9], [Eq. 15], [Eq. 16], [Eq. 10];
[0051] (2.4) [Eq. 15], [Eq. 16], [Eq. 9], [Eq. 10];
[0052] (2.5) [Eq. 15], [Eq. 9], [Eq. 16], [Eq. 10];
[0053] (2.6) [Eq. 15], [Eq. 9], [Eq. 10], [Eq. 16];
[0054] (3.1) [Eq. 11], [Eq. 12], [Eq. 13], [Eq. 14];
[0055] (3.2) [Eq. 11], [Eq. 13], [Eq. 12], [Eq. 14];
[0056] (3.3) [Eq. 11], [Eq. 13], [Eq. 14], [Eq. 12];
[0057] (3.4) [Eq. 13], [Eq. 14], [Eq. 11], [Eq. 12];
[0058] (3.5) [Eq. 13], [Eq. 11], [Eq. 14], [Eq. 12];
[0059] (3.6) [Eq. 13], [Eq. 11], [Eq. 12], [Eq. 14];
[0060] (4.1) [Eq. 11], [Eq. 12], [Eq. 15], [Eq. 16];
[0061] (4.2) [Eq. 11], [Eq. 15], [Eq. 12], [Eq. 16];
[0062] (4.3) [Eq. 11], [Eq. 15], [Eq. 16], [Eq. 12];
[0063] (4.4) [Eq. 15], [Eq. 16], [Eq. 11], [Eq. 12];
[0064] (4.5) [Eq. 15], [Eq. 11], [Eq. 16], [Eq. 12]; and
[0065] (4.6) [Eq. 15], [Eq. 11], [Eq. 12], [Eq. 16].
[0066] In one embodiment, illustrated in
[0067] In one other embodiment, also illustrated in
[0068] In a shift phase of operation, data from a selected one of the A0, A1 or S registers may be multiplied by powers of 2, such as , 1, or 2, via a shifter. In one embodiment, the shifter may comprise a bit selector adapted to select a predetermined range of the bits received from the selected source register.
[0069] In a write phase of operation, the transformed data developed by the shifter is coupled to the write port of the Data Memory and written in the Data Memory. In one embodiment, the write data and address are sent during the cycle before the write operation. In one other embodiment, the data/address may be delivered even earlier if a write buffer is provided.
[0070] Other datapath connections or shift factors may be implemented in the preferred embodiment to support other functions, but only those used to realize the DIF FFT butterfly are shown in
[0071] Scaling of the transformed data may be realized by selecting a scaling factor of or 1 for the shifter. Optimally, the data should be scaled by only when it is essential to prevent overflow in the transformed data. This allows the maximum number of significant bits to be used in each data sample, thus maintaining the maximum precision in the result. One way of determining the selection of the or 1 shift factor is to record whether any transformed sample in a particular butterfly level has an absolute value at or above a predetermined threshold, and if not, to scale by 1 in the next butterfly level, otherwise scaling by . The predetermined threshold may be a power of two. Other logic may limit the total number of butterfly levels that the data is scaled by 1 during the entire FFT calculation, in order to achieve a predetermined total scaling for the FFT.
[0072] Another preferred embodiment of the invention performs the several phases of operation in pipelined fashion, wherein each line represents a single pipeline cycle and each column represents the actions of a particular phase of operation of the apparatus illustrated in
TABLE-US-00001 Read Add/Subtract MAC Shift Write Y1i Y0i Y1r Mi = Y0i Y1i Y0r Si = Y0i + Y1i A0 = Wr * Mi Mr = Y0r Y1r A1 = Wi * Mi Sr = Y0r + Y1r A0 = A0 + (Wi * Mr) X0i = * Si A1 = A1 + (Wr * Mr) X1i = * A0 X0i X0r = * Sr X1i X1r = * A1 X0r X1r
[0073] Because the operations are pipelined, the memory will be reading data for the next butterfly operation while the other phases are continuing with the current butterfly. In accordance with my method, each stage of the pipeline only needs four pipeline cycles to complete its actions for a particular butterfly operation. Accordingly, after all stages of the pipe are filled with the initial set of values, all pipe stages operate in parallel until the data stream has been processed, at which point the pipe stages will be drained until all remaining values have been written back into the Data Memory. The alignment of the write operations with respect to the read operations will allow them to avoid conflicting with each other if the real and imaginary parts of each data sample are in separate memory banks, so that data can be read from the real bank while data from a previous butterfly operation is being written to the imaginary bank, and vice versa. Alternatively, the data memory could be arranged in a single bank that is clocked at twice the rate of the other units, so that one memory cycle is used for reading and the other for writing. The transformed output data samples X0i, X1i, X0r, and X1r may then be written to the same memory addresses that the input data samples Y0i, Y1i, Y0r, and Y1r were read from, respectively.
[0074] In yet another embodiment, an additional pipeline stage may be added to the MAC phase to accommodate the optional carry propagate operation. This may, for example, allow the first stage to calculate a carry-save result while the second stage completes the calculation with a carry-propagate adder. In this embodiment, the order of the read and write operations is different so that the write operations will not conflict with the read operations if the real and imaginary parts are stored in different memory banks:
TABLE-US-00002 Read Add/Subtract MAC Prop Shift Write Y1r Y0r Y1i Mr = Y0r Y1r Y0i Sr = Y0r + Y1r A0 = Wi * Mr Mi = Y0i Y1i A1 = Wr * Mr Si = Y0i + Y1i A0 = A0 + (Wr * Mi) Psr = Sr A1 = A1 (Wi * Mi) Psi = Si X0r = * Psr Pa0 = A0 X0i = * Psi X0r Pa1 = A1 X1i = * Pa0 X0i X1r = * Pa1 X1i X1r
[0075] In the Prop stage, data from the S register is loaded into the P register, or carry-save data from the MAC facility is summed and the result stored in the P register. During the following cycle the contents of the P register are shifted and sent to the memory to be written.
[0076] Although I have described my invention in the context of particular embodiments, one of ordinary skill in this art will readily realize that many modifications may be made in such embodiments to adapt either to specific implementations. Thus it is apparent that I have provided a method for performing a DIF butterfly that is both effective and efficient. Further, I submit that my method and apparatus provide performance generally superior to the best prior art techniques.