Method and apparatus for decimation in frequency FFT butterfly

09940303 ยท 2018-04-10

Assignee

Inventors

Cpc classification

International classification

Abstract

A pipelined decimation in frequency FFT butterfly method, and an apparatus to perform this method comprising: a data memory with at least one read port and one write port; an add/subtract unit receiving data from the memory; a multiply/accumulate unit receiving data from the add/subtract unit; a source of coefficients, from logic gates or a coefficient memory, to supply FFT twiddle factors to the multiply/accumulate unit; a shifter receiving data from at least one of the add/subtract unit and the multiply/accumulate unit, the shifter supplying data to the write port of the data memory; wherein the apparatus performs these calculations in four cycles of the add/subtract unit and in four cycles of the multiply/accumulate unit, using complex arithmetic.

Claims

1. A decimation in frequency FFT butterfly processor comprising: a data memory, having at least one read port and at least one write port, adapted to store a write data; a first register, operably coupled to the read port, adapted to store first data received from the read port; a second register, operably coupled to the read port, adapted to store second data received from the read port; a third register, operably coupled to the first register, adapted to store a fourth data comprising a selected one of the first data and inverted first data received from the first register; an adder, operably coupled to the second and third registers, adapted to add the second data and the fourth data to produce a fifth data comprising a selected one of a sum and a difference; a fourth register, operably coupled to the adder, adapted to store the fifth data received from the adder; a fifth register, operably coupled to a coefficient source, adapted to store a coefficient received from a coefficient source; a first accumulator register; a second accumulator register; a multiplier/accumulator unit, operably coupled to the fourth and fifth registers and to the first and second accumulator registers, adapted to: produce a product of the fifth data received from the fourth register and the coefficient received from the fifth register; add the product and a sum stored in a selected one of the first and the second accumulator register; and store the sum in a selected one of the first and second accumulator registers; a sixth register, operably coupled to the adder, adapted to store a sixth data comprising a selected one of the sum and difference received from the adder; and a multiplexer, operably coupled to the sixth register, the first and second accumulator registers and to the write port of the data memory, adapted to select as the write data a selected one of the sixth data stored in the sixth register and the sum stored in a selected one of the first or second accumulator registers, and to provide the selected write data to the write port of the data memory.

2. The decimation in frequency FFT butterfly processor of claim 1, wherein the fourth register is operably coupled to receive the inversion of the fourth register in anticipation of the multiplier/accumulator unit calculating a negative product of the fourth and fifth registers.

3. The decimation in frequency FFT butterfly processor of claim 1: wherein the multiplier/accumulator unit is further characterized as producing the product and the sum in a carry-save format, and the first and second accumulator registers are further characterized as storing the sum received from the adder in the carry-save format, wherein at least one bit position is saved as a sum bit and a carry bit; and wherein the multiplexer is further characterized as comprising a carry propagate adder adapted to produce as the write data a propagated sum from the sum and carry bits stored in a selected one of the first and second accumulator registers.

4. The decimation in frequency FFT butterfly processor of claim 1, further comprising: a shifter, operably coupled between the multiplexer and the write port of the data memory, adapted to select as the write data at least a subset of the multiplexer bits to be written to the data memory.

5. The decimation in frequency FFT butterfly processor of claim 4, wherein the shifter is further characterized as providing to the write port of the data memory the write data selected by the multiplexer multiplied by a selected one of and 1.

Description

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

(1) My invention may be more fully understood by a description of certain preferred embodiments in conjunction with the attached drawings in which:

(2) FIG. 1 illustrates, in block diagram form, a general purpose computer system adapted to practice my invention;

(3) FIG. 2 illustrates, in block diagram form, a typical integrated system adapted to practice my invention; and

(4) FIG. 3 illustrates, in block diagram form, one embodiment of an arithmetic unit adapted to performing a DIF butterfly in accordance with my invention.

(5) In the drawings, similar elements will be similarly numbered whenever possible. However, this practice is simply for convenience of reference and to avoid unnecessary proliferation of numbers, and is not intended to imply or suggest that my invention requires identity in either function or structure in the several embodiments.

DETAILED DESCRIPTION OF THE INVENTION

(6) FIG. 3 illustrates a preferred embodiment of an arithmetic unit (AU) adapted to practice my invention, requiring only a single data memory read port, a single data memory write port, and which is capable of completing a DIF butterfly operation every 4 cycles. In general, my AU operates in response to signals developed in a predetermined sequence by a Control. As will be familiar to those skilled in this art, Control may comprise a general- or special-function controller, programmable using either software or firmware, or a hard-wired sequencer. Although the flow of control will be described hereinafter in terms of the functions performed and the results achieved, the specific control signals developed by Control will not be described in detail; further, the several control signals are not illustrated in FIG. 3 so as not to obscure the data flow paths.

(7) In a read phase of operation, data read from the Data Memory may be loaded into register T or register X. The contents of register T may be loaded inverted or non-inverted into register Y.

(8) In an add/subtract phase of operation, the contents of X and Y may be added together with a logic_1 or logic_0 carry-in, and the resulting sum or difference loaded into the M or S registers. If Y has been inverted, a difference is calculated. A carry-in of logic_1 may be used for calculating a 2's-complement difference, or for calculating a sum with rounding.

(9) In a MAC phase of operation, a MAC facility multiplies the contents of the M register by a twiddle factor that has been selectively loaded into the C register from a Coefficient Source. As will be familiar to those skilled in this art, the Coefficient Source may include one or more of ROM, RAM, and logic gates. The product of the multiplier is added to zero, or to a value from one of two accumulator registers, A0 and A1. The sum is then loaded into A0 or A1. The M register may be loaded with an inverted copy of it's contents. This allows the negative product of C*M to be calculated. This is useful because the product of two imaginary numbers results in the negative of the product, so this avoids the need for storing a negative twiddle factor in the Coefficient Source. Preferably, the multiply and add operations may be combined to comprise a single MAC operation, completed in a single cycle.

(10) In general, the two basic calculations performed in the MAC facility are:
A0=(Wr*Mi)+(Wi*Mr)[Eq. 5]
A1=(Wr*Mr)(Wi*Mi)[Eq. 6]

(11) As is known, the MAC facility can perform either a multiply or a multiply/accumulate. The multiply performs:
A=W*M[Eq. 7]
and the multiply/accumulate performs:
A=A+(W*M)[Eq. 8]
So one of the two multiplies in each equation is performed first, then the other multiply is performed and the result added to the A register. So for the A0 calculation of [Eq. 5], you can sequentially do either of these combinations:
A0=Wr*Mi;[Eq. 9]
A0=A0+(Wi*Mr);[Eq. 10]
or:
A0=Wi*Mr;[Eq. 11]
A0=A0+(Wr*Mi);[Eq. 12]
Whether you do [Eq. 9] followed by [Eq. 10], or [Eq. 11] followed by [Eq. 12], the result, A0, will be the samebut [Eq. 10] must be performed after [Eq. 9], and [Eq. 12] must be performed after [Eq. 11]. Likewise, for the A1 calculation, you can independently select either of these combinations:
A1=(Wi*Mi);[Eq. 13]
A1=A1+(Wr*Mr);[Eq. 14]
or:
A1=Wr*Mr;[Eq. 15]
A1=A1(Wi*Mi);[Eq. 16]
If you select [Eq. 9] and [Eq. 10] for performing the A0 calculation, you can still perform the A1 calculation using either [Eq. 13] and [Eq. 14], or [Eq. 15] and [Eq. 16]. Accordingly, all the possible combinations (not orderings) of these operations that will give the correct result are:

(12) (1) [Eq. 9], [Eq. 10], [Eq. 13], [Eq. 14]

(13) (2) [Eq. 9], [Eq. 10], [Eq. 15], [Eq. 16]

(14) (3) [Eq. 11], [Eq. 12], [Eq. 13], [Eq. 14]

(15) (4) [Eq. 11], [Eq. 12], [Eq. 15], [Eq. 16]

(16) Further, within each of these four distinct combinations, the order of operations can be varied as long as each odd numbered step is performed sometime before the +1 even numbered step. For example, the combination (1), above, may validly be reordered as:

(17) (1) [Eq. 13], [Eq. 14], [Eq. 9], [Eq. 10]

(18) or as:

(19) (1) [Eq. 13], [Eq. 9], [Eq. 14], [Eq. 10].

(20) Taking into consideration these constraints, the following 24 sequences of these operations all deliver valid results in accordance with [Eq. 5] and [Eq. 6]:

(21) (1.1) [Eq. 9], [Eq. 10], [Eq. 13], [Eq. 14];

(22) (1.2) [Eq. 9], [Eq. 13], [Eq. 10], [Eq. 14];

(23) (1.3) [Eq. 9], [Eq. 13], [Eq. 14], [Eq. 10];

(24) (1.4) [Eq. 13], [Eq. 14], [Eq. 9], [Eq. 10];

(25) (1.5) [Eq. 13], [Eq. 9], [Eq. 14], [Eq. 10];

(26) (1.6) [Eq. 13], [Eq. 9], [Eq. 10], [Eq. 14];

(27) (2.1) [Eq. 9], [Eq. 10], [Eq. 15], [Eq. 16];

(28) (2.2) [Eq. 9], [Eq. 15], [Eq. 10], [Eq. 16];

(29) (2.3) [Eq. 9], [Eq. 15], [Eq. 16], [Eq. 10];

(30) (2.4) [Eq. 15], [Eq. 16], [Eq. 9], [Eq. 10];

(31) (2.5) [Eq. 15], [Eq. 9], [Eq. 16], [Eq. 10];

(32) (2.6) [Eq. 15], [Eq. 9], [Eq. 10], [Eq. 16];

(33) (3.1) [Eq. 11], [Eq. 12], [Eq. 13], [Eq. 14];

(34) (3.2) [Eq. 11], [Eq. 13], [Eq. 12], [Eq. 14];

(35) (3.3) [Eq. 11], [Eq. 13], [Eq. 14], [Eq. 12];

(36) (3.4) [Eq. 13], [Eq. 14], [Eq. 11], [Eq. 12];

(37) (3.5) [Eq. 13], [Eq. 11], [Eq. 14], [Eq. 12];

(38) (3.6) [Eq. 13], [Eq. 11], [Eq. 12], [Eq. 14];

(39) (4.1) [Eq. 11], [Eq. 12], [Eq. 15], [Eq. 16];

(40) (4.2) [Eq. 11], [Eq. 15], [Eq. 12], [Eq. 16];

(41) (4.3) [Eq. 11], [Eq. 15], [Eq. 16], [Eq. 12];

(42) (4.4) [Eq. 15], [Eq. 16], [Eq. 11], [Eq. 12];

(43) (4.5) [Eq. 15], [Eq. 11], [Eq. 16], [Eq. 12]; and

(44) (4.6) [Eq. 15], [Eq. 11], [Eq. 12], [Eq. 16].

(45) In one embodiment, illustrated in FIG. 3 to the left of the main flow path, the MAC facility may be adapted to produce results directly in carry-propagated format. In this embodiment, a mux can be configured to select the results stored in the A0 or A1 registers, or, in some cases, the contents of the S register.

(46) In one other embodiment, also illustrated in FIG. 3 to the left of the main flow path, the MAC facility may be adapted to produce data in carry-save format. In this embodiment, the carry-save bits and the sum bits may be stored in a selected one of the A0 and A1 registers. Each of the A0 and A1 registers may store both the carry-save bits and the sum bits, with two register bits available for at least some bit positions. In one embodiment, some lower-order bits are in carry-propagated format and the higher-order bits are in carry-save format. The carry-save bits may then be added to the sum bits in an optional Carry Propagate unit, and the carry-propagated result saved in a pipeline register P. In some cases, the S register may be selectively loaded into the pipeline register P.

(47) In a shift phase of operation, data from a selected one of the A0, A1 or S registers may be multiplied by powers of 2, such as , 1, or 2, via a shifter. In one embodiment, the shifter may comprise a bit selector adapted to select a predetermined range of the bits received from the selected source register.

(48) In a write phase of operation, the transformed data developed by the shifter is coupled to the write port of the Data Memory and written in the Data Memory. In one embodiment, the write data and address are sent during the cycle before the write operation. In one other embodiment, the data/address may be delivered even earlier if a write buffer is provided.

(49) Other datapath connections or shift factors may be implemented in the preferred embodiment to support other functions, but only those used to realize the DIF FFT butterfly are shown in FIG. 3.

(50) Scaling of the transformed data may be realized by selecting a scaling factor of or 1 for the shifter. Optimally, the data should be scaled by only when it is essential to prevent overflow in the transformed data. This allows the maximum number of significant bits to be used in each data sample, thus maintaining the maximum precision in the result. One way of determining the selection of the or 1 shift factor is to record whether any transformed sample in a particular butterfly level has an absolute value at or above a predetermined threshold, and if not, to scale by 1 in the next butterfly level, otherwise scaling by . The predetermined threshold may be a power of two. Other logic may limit the total number of butterfly levels that the data is scaled by 1 during the entire FFT calculation, in order to achieve a predetermined total scaling for the FFT.

(51) Another preferred embodiment of the invention performs the several phases of operation in pipelined fashion, wherein each line represents a single pipeline cycle and each column represents the actions of a particular phase of operation of the apparatus illustrated in FIG. 3. As will be familiar to those skilled in this art, all actions on the same line happen simultaneously; so for example, the value of M used on the same line that has an assignment to M will be the old value assigned to M in an earlier cycle. A variable with an r represents the real part of a complex number and a variable with an i represents the imaginary part:

(52) TABLE-US-00001 Read Add/Subtract MAC Shift Write Y1i Y0i Y1r Mi = Y0i Y1i Y0r Si = Y0i + Y1i A0 = Wr * Mi Mr = Y0r Y1r A1 = Wi * Mi Sr = Y0r + Y1r A0 = A0 + (Wi * Mr) X0i = * Si A1 = A1 + (Wr * Mr) X1i = * A0 X0i X0r = * Sr X1i X1r = * A1 X0r X1r

(53) Because the operations are pipelined, the memory will be reading data for the next butterfly operation while the other phases are continuing with the current butterfly. In accordance with my method, each stage of the pipeline only needs four pipeline cycles to complete its actions for a particular butterfly operation. Accordingly, after all stages of the pipe are filled with the initial set of values, all pipe stages operate in parallel until the data stream has been processed, at which point the pipe stages will be drained until all remaining values have been written back into the Data Memory. The alignment of the write operations with respect to the read operations will allow them to avoid conflicting with each other if the real and imaginary parts of each data sample are in separate memory banks, so that data can be read from the real bank while data from a previous butterfly operation is being written to the imaginary bank, and vice versa. Alternatively, the data memory could be arranged in a single bank that is clocked at twice the rate of the other units, so that one memory cycle is used for reading and the other for writing. The transformed output data samples X0i, X1i, X0r, and X1r may then be written to the same memory addresses that the input data samples Y0i, Y1i, Y0r, and Y1r were read from, respectively.

(54) In yet another embodiment, an additional pipeline stage may be added to the MAC phase to accommodate the optional carry propagate operation. This may, for example, allow the first stage to calculate a carry-save result while the second stage completes the calculation with a carry-propagate adder. In this embodiment, the order of the read and write operations is different so that the write operations will not conflict with the read operations if the real and imaginary parts are stored in different memory banks:

(55) TABLE-US-00002 Read Add/Subtract MAC Prop Shift Write Y1r Y0r Y1i Mr = Y0r Y1r Y0i Sr = Y0r + Y1r A0 = Wi * Mr Mi = Y0i Y1i A1 = Wr * Mr Si = Y0i + Y1i A0 = A0 + (Wr * Mi) Psr = Sr A1 = A1 (Wi * Mi) Psi = Si X0r = * Psr Pa0 = A0 X0i = * Psi X0r Pa1 = A1 X1i = * Pa0 X0i X1r = * Pa1 X1i X1r

(56) In the Prop stage, data from the S register is loaded into the P register, or carry-save data from the MAC facility is summed and the result stored in the P register. During the following cycle the contents of the P register are shifted and sent to the memory to be written.

(57) Although I have described my invention in the context of particular embodiments, one of ordinary skill in this art will readily realize that many modifications may be made in such embodiments to adapt either to specific implementations. Thus it is apparent that I have provided a method for performing a DIF butterfly that is both effective and efficient. Further, I submit that my method and apparatus provide performance generally superior to the best prior art techniques.