Serial FFT-based low-power MFCC speech feature extraction circuit

11715456 · 2023-08-01

Assignee

Inventors

Cpc classification

International classification

Abstract

It discloses a serial FFT-based low-power MFCC speech feature extraction circuit, and belongs to the technical field of calculation, reckoning or counting. The circuit is oriented toward the field of intelligence, and is adapted to a hardware circuit design by optimizing an MFCC algorithm, and a serial FFT algorithm and an approximation operation on a multiplication are fully used, thereby greatly reducing a circuit area and power. The entire circuit includes a preprocessing module, a framing and windowing module, an FFT module, a Mel filtration module, and a logarithm and DCT module. The improved FFT algorithm uses a serial pipeline manner to process data, and a time of an audio frame is effectively utilized, thereby reducing a storage area and operation frequency of the circuit under the condition of meeting an output requirement.

Claims

1. A serial FFT-based low-power MFCC speech feature extraction circuit, comprising: a pre-emphasis module that is a software component for preprocessing an input speech sequence by subtracting previous data of an adjacent time from input data and accumulating a value obtained by shifting previous data rightwards by 4 hits; a framing and windowing module that is a software component for performing framing and windowing operations on the pre-processed speech sequence by cutting a long audio into several frames; an FFT module that is a software component for—converting a time-domain signal of one frame into a frequency-domain signal by performing Fourier transform layer by layer and packet by packet on the sequence data subjected to framing and windowing operations and then outputting complex data subjected to bit permutation, wherein each layer of the Fourier transform performs two times of serial packeting and then butterfly operation on the input data, to output a product of the last butterfly operation output data and a twiddle factor to a next layer of the Fourier transform, an operational equation of the FFT module is as follows: ( k 1 + 2 k 2 + 4 k 3 ) = .Math. n 3 = 0 N 4 - 1 { [ x ( n 3 ) + ( - 1 ) k 1 x ( N 2 + n 3 ) ] + ( - j ) ( k 1 + 2 k 2 ) [ x ( N 4 + n 3 ) + ( - 1 ) k 1 x ( 3 N 4 + n 3 ) ] } W N n 3 ( k 1 + 2 k 2 ) W N 4 n 3 k 3 wherein (k.sub.1+2k.sub.2+4k.sub.3) is a sequence of output signals, k.sub.1 is 0, 1, k2 is 0, 1, and k3 is an integer number ranging from 0 to 63; a Mel filtration module that is a software component for performing a Mel filtration operation on a frequency-domain signal of each frame by extracting an energy value of a complex output by the FFT module and performing multi-stage Mel filtration on the energy value to obtain a Mel value; a logarithm taking module that is a software component for performing compressed representation on data of filter sets by taking a logarithm value on the Mel value with 2 as a base through a lookup table; and a DCT module that is a software component for performing DCT on the logarithm value of the Mel value with 2 as the base by multiplying the input data and a cosine coefficient, the DCT module as butterfly operations are specific to the FFT module; wherein performing butterfly operation on the input data packet by packet by each layer of the Fourier transform comprises setting a high-bit part of the input data as a first data set, setting a low-bit part of the input data as a second data set, performing the first butterfly operation on the first data set and the second data set, then updating the first data set to be a low-bit data of a first butterfly operation result, packeting the first butterfly operation result, then performing the second butterfly operation, and outputting a second butterfly operation result; the FFT module comprises N/2 radix-2.sup.2 single delay feedback units sequentially connected in series, N=log.sub.2T, T being the number of data contained in each frame of the speech sequence, and each radix-2.sup.2 single delay feedback unit comprising: a first butterfly operation unit and a storage unit thereof, wherein an input end of the first butterfly operation unit is connected to the speech sequence subjected to the framing and windowing operation or output data of the previous radix-2.sup.2 single delay feedback unit, the high-bit part of the input data is cached in the storage unit of the butterfly operation unit, the high-bit part and the low-bit part of the input data are subjected to the first butterfly operation, then the data in the storage unit of the butterfly operation unit is updated to be the low-bit part of the first butterfly operation result, and a high-bit part of the first butterfly operation result is output to a second butterfly operation unit; the second butterfly operation unit and a storage unit thereof, wherein the high-bit part of the input data is cached in the storage unit of the butterfly operation unit, the high-bit part and the low-bit part of the input data are subjected to the second butterfly operation, then the data in the storage unit of the butterfly operation unit is updated to be a low-bit part of the second butterfly operation result, and the second butterfly operation result is output to a multiplying unit; and the multiplying unit for performing a multiplication on the result of the second butterfly operation and the twiddle factor.

2. The serial FFT-based low-power MFCC speech feature extraction circuit according to claim 1, wherein preprocessing the input speech sequence by the pre-emphasis module specifically comprise subtracting a previous time data from a current time data of the input speech sequence and then accumulating a value acquired by shifting the previous time data rightwards by 4 bits so as to acquire a preprocessed speech signal.

3. The serial FFT-based low-power MFCC speech feature extraction circuit according to claim 1, wherein performing multi-stage Mel filtration on the energy value to obtain a Mel value specifically comprises performing multiplication-accumulation on the energy value and a function value of a multi-stage Mel filter.

4. The serial FFT-based low-power MFCC speech feature extraction circuit according to claim 1, wherein taking the logarithm value on the Mel value with 2 as the base though the lookup table specifically comprises taking a digit , where the highest bit “1” appears, of the Mel value as the logarithm value with 2 as the base.

5. The serial FFT-based low-power MFCC speech feature extraction circuit according to claim 1, wherein performing DCT on the logarithm value of the Mel value with 2 as the base specifically comprises multiplying the logarithm value of the Mel value with 2 as the base and a cosine coefficient and then performing accumulation.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic diagram of the present invention.

(2) FIG. 2 is a system architecture diagram of the present invention.

(3) FIG. 3 is a structure diagram of a framing and windowing module circuit of the present invention.

(4) FIG. 4 is a structure diagram of an FFT module circuit of the present invention.

(5) FIG. 5 is a structure diagram of a Mel filtration module circuit of the present invention.

(6) FIG. 6 is a structure diagram of circuits of a logarithm and DCT module of the present invention.

DESCRIPTION OF THE EMBODIMENTS

(7) The technical solution of the present invention will be described in detail below with reference to the accompanying drawings, an example having a frame size being 256 points, a step size being 128 points, a 20-stage Mel value and a 10-stage DCT value (T=256, S=128, M=20, L=10) is taken to explain the specific implementation of the present invention but does not limit the scope of the present invention. FIG. 1 is a schematic diagram of the present invention.

(8) As shown in FIG. 2, a serial FFT-based low-power MFCC speech feature extraction circuit designed by the present invention is mainly divided into four modules: a framing and windowing module, an FFT module, a Mel filtration module, and a logarithm and DCT module. The circuit inputs a clock signal and a speech analog-to-digital conversion (ADC) sampling data signal and outputs a speech feature value, and the work of the circuit includes the following steps.

(9) Step 1: as shown in FIG. 3, starting from circuit switching on, a speech ADC sampling end samples an audio at a sampling rate of 8K, the system firstly needs to use an asynchronous first-in first-out memory (FIFO) to cache data, data output of the asynchronous FIFO serves as input of a pre-emphasis module, and the data enter the module and then achieve an pre-emphasis operation by means of a register and shift and add operations. An equation of the pre-emphasis operation is as follows:
data.sub.out[k]=data.sub.in[k]−data.sub.in[k−1]+data.sub.in[k−1]»4,

(10) wherein k starts from 1 and presents a position of the data.

(11) Step 2: a framing operation in the framing and windowing module does not operate the data but cuts, recombines and outputs the data, such that only one memory and one multiplexer may achieve framing of the data, and the second half of the previous frame of data is stored every time. The data subjected to framing is multiplied by a coefficient with a stored Hamming window by means of a multiplying unit.

(12) Step 3: as shown in FIG. 4, the data subjected to framing and windowing enter the FFT module in a pipeline manner. Firstly, due to the fact that the point number of the Fourier transform is 256, 4-stage Radix-2.sup.2SDF units are needed, each Radix-2.sup.2SDF unit includes a BF1 operation unit, a BF2 operation unit and a twiddle factor multiplication unit, and an operational equation of the serial FFT module is as follows:

(13) ( k 1 + 2 k 2 + 4 k 3 ) = .Math. n 3 = 0 N 4 - 1 { [ x ( n 3 ) + ( - 1 ) k 1 x ( N 2 + n 3 ) ] + ( - j ) ( k 1 + 2 k 2 ) [ x ( N 4 + n 3 ) + ( - 1 ) k 1 x ( 3 N 4 + n 3 ) ] } W N n 3 ( k 1 + 2 k 2 ) W N 4 n 3 k 3 ,
in the above equation, (k.sub.1+2k.sub.2+4k.sub.3) is a sequence of the output signals, k.sub.1 is 0, 1, k2 is 0, 1, and k3 is an integer number ranging from 0 to 63. In a part, on the right side of the equal sign, of the equation, an equation in a summation sign has an actual meaning of a mathematical processing of the butterfly operation. x(n.sub.3)+(−1).sup.k1x(N/2+n.sub.3) serves as the BF1 butterfly operation, {[x(n.sub.3)+(−1).sup.k1x(N/2+n.sub.3)]+(−j).sup.(k.sup.1.sup.+2k.sup.2.sup.)[x(N/4+n.sub.3)+(−1).sup.k1x(3N/4+n.sub.3)]} serves as the BF2 butterfly operation, and W.sub.n.sup.n.sup.3.sup.(k.sup.1.sup.+2k.sup.2.sup.) serves as the twiddle factor. The data are subjected to 4 rounds of operations of the Radix-2.sup.2SDF units, and finally FFT results are output in a sequence of bit permutation.

(14) Step 4: as shown in FIG. 5, a complex output by the FFT is firstly subjected to real-part imaginary-part sum-of-squares operation, an output modulo value and a function value stored in a Mel filter of the memory are subjected to multiplying and accumulation, and finally the 20-stage Mel value of one frame is output.

(15) Step 5: as shown in FIG. 6, after the Mel value is output, logarithm taking on the same is needed, a position in which the highest bit ‘1’ of the data bits appears is searched to achieve the function through a lookup table, for example, in an eight-bit binary number 10001111, the position in which the highest bit ‘1’ thereof appears is the 7th bit, such that a corresponding logarithm value is 7. After logarithm taking, the Mel value needs to be subjected to DCT, and an equation of the DCT is as follows:

(16) C ( x ) = .Math. m = 0 M - 1 s ( m ) cos ( π x ( m - 0.5 ) M ) , x = 1 , 2 , .Math. , L ,
s(m) is the logarithm value of the Mel value with 2 as a base, L is the stage number of the DCT, M is the stage number of Mel, the equation may finally output the 10-stage DCT value by multiplying, accumulating and calculating the data and a corresponding cosine value on the hardware, and the 10-stage DCT value serves as a feature value of the frame.