Efficient sphere detector algorithm for large antenna communication systems using graphic processor unit (GPU) hardware accelerators
11075781 · 2021-07-27
Assignee
Inventors
- Hatem Ltaief (Thuwal, SA)
- Zouheir Rezki (Thuwal, SA)
- Mohamed Amine Arfaoui (Thuwal, SA)
- Mohamed-Slim ALOUINI (Thuwal, SA)
- David E. Keyes (Thuwal, SA)
Cpc classification
H04L2025/03426
ELECTRICITY
International classification
H04L5/12
ELECTRICITY
Abstract
A method of detecting a symbol transmitted over a communication channel in a multiple input-multiple output communication system. The method includes receiving a plurality of symbols transmitted over a communication channel of a multiple input-multiple output communication system. A sphere radius is initialized based on attributes of the communication channel. A first matrix of possible transmitted signals is defined as well as a second matrix of received symbols. The matrix of possible transmitted signals is searched using a breadth-first search (BFS). Each level of the search tree is analyzed utilizing matrix multiplication to determine selected symbols satisfying the initialized sphere radius. A maximum likelihood solution is of the transmitted symbols is derived based on the selected symbols.
Claims
1. A method of decoding a signal comprised of a plurality of symbols communicated via a multiple input-multiple output (MIMO) communication system, the method comprising: receiving a signal at a receiver end of a communication channel of the multiple input-multiple output communication system, wherein the received signal is a function of a transmitted signal and the communication channel; initializing a sphere radius r based on attributes of the communication channel and not on the received signal; and defining a set of possible transmit signal symbols based on the initialized sphere radius r, wherein defining a set of possible transmit signal symbols based on the initialized sphere radius r includes utilizing breadth-first tree traversal to define the set of possible transmit signal symbols based on the sphere radius r; and selecting a maximum likelihood solution of the transmitted signal based on the selected set of possible transmit signal symbols.
2. The method of claim 1, wherein attributes of the communication channel include number of transmitters utilized in the MIMO communication system, wherein the sphere radius r is initialized based, at least in part, on the number of transmitters.
3. The method of claim 2, wherein the sphere radius r is selected as a function of the number of transmitters and a noise variance estimation.
4. The method of claim 3, wherein the sphere radius r is selected according to an equation r=Mσ.sup.2, wherein r is the sphere radius, M is the number of the transmitters, and σ is the noise variance estimate.
5. The method of claim 1, wherein the sphere radius r is increased if no transmit signal symbols are located within the initialized sphere radius.
6. The method of claim 5, wherein the sphere radius r is increased according to an equation: r.sup.2=r.sup.2+Mσ.sup.4, wherein r is the sphere radius, M is the number of the transmitters, and σ is the noise variance estimate.
7. The method of claim 1, wherein defining the set of possible transmit signal symbols based on the initialized sphere radius r includes defining a first matrix V.sub.k having size (k, M.sub.cG.sub.k-1), for k=1 . . . M, wherein k represents a search tree level being evaluated, M represents a number of transmitters, M.sub.c represents a number of symbols in a constellation set, and G.sub.k-1 represents a number of vectors at level k−1 determined to be located within the sphere radius r, wherein first matrix V.sub.k is comprised of vectors s.sub.k-1 located in the set L.sub.k-1 representing the transmitted signal vectors that satisfy the sphere radius r, and further includes defining a second matrix y.sub.k based on the received signal y.sub.k duplicated M.sub.cG.sub.k-1 times.
8. The method of claim 7, wherein matrix multiplication is utilized to calculate a third matrix P.sub.k based on y.sub.k−R.sub.kV.sub.k, wherein R.sub.k is related to a QR decomposition of channel estimation H, wherein the third matrix P.sub.k is evaluated with respect to the sphere radius r.
9. A decoder circuit for decoding a multiple-input, multiple output (MIMO) signal, the decoder circuit comprising one or more processors configured to: initialize a sphere radius r based on attributes of a communication channel; initialize a parallel breadth-first tree search algorithm based on matrix-matrix multiplication of each level of the search tree based on values located within the sphere radius r, wherein the search tree includes a number of levels corresponding with a number of transmitters; and select a maximum likelihood solution of a transmitted signal based on the matrix-matrix multiplication.
10. The decoder circuit of claim 9, wherein initializing the parallel breadth-first tree search algorithm includes: define a first matrix V.sub.k having size (k, M.sub.cG.sub.k-1), where in k is search tree level being evaluated, M.sub.c defines a number of symbols in a constellation set, and G.sub.k-1 defines a number of vectors at level k−1 determined to be located within the sphere radius r, wherein the first matrix V.sub.k is comprised of vectors s.sub.k-1 located in a set L.sub.k-1 representing a transmitted signal vectors that satisfy the sphere radius r; define a second matrix y.sub.k based on a received signal y.sub.k duplicated M.sub.cG.sub.k-1 times; utilize matrix-matrix multiplication to calculate a third matrix P.sub.k based on an equation P.sub.k=y.sub.k−R.sub.kV.sub.k, wherein R.sub.k is related to a QR decomposition of channel estimation H, wherein the third matrix P.sub.k is evaluated with respect to the sphere radius r to derive a set of vectors L.sub.k satisfying the sphere radius r for k=1 . . . M, wherein M is the number of transmitter; and select a maximum likelihood solution of the transmitted signal based on a matrix P.sub.M, wherein P.sub.M is the calculated third matrix P.sub.k when k=M.
11. The decoder circuit of claim 9, wherein attributes of the communication channel include number of transmitters utilized in the MIMO communication system, wherein the sphere radius r is initialized based, at least in part, on the number of transmitters.
12. The decoder circuit of claim 11, wherein the sphere radius r is selected as a function of the number of transmitters and a noise variance estimation.
13. The decoder circuit of claim 11, wherein the sphere radius r is increased according to an equation: r.sup.2=r.sup.2+Mσ.sup.4 if no transmit signal symbols are located within an initialized sphere radius, wherein r is the sphere radius, M is the number of the transmitters, and σ is a noise variance estimate.
14. The decoder circuit of claim 9, wherein the matrix-matrix multiplication is a compute-bound computation performed by the processor.
15. The decoder circuit of claim 14, wherein the one or more processors are graphics processing units (GPUs).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION
(10) The present invention provides a system and method of decoding massive multiple-input, multiple output (MIMO) wireless transmissions. In particular, the present invention utilizes a sphere decoding scheme wherein the radius of the sphere is fine-tuned to reduce the pool of possible candidates. In addition, one embodiment of the present invention utilizes a breadth-first tree traversal—as opposed to a depth first tree traversal. A benefit of this approach is that breadth-first tree traversal utilizes matrix-matrix multiplication operations that may be processed in a graphics processing unit (GPU) environment that is compute-bound (as opposed to memory-bound). This is in contrast with depth-first traversal which requires vector-matrix multiplication and as a result requires memory-bound computations. By utilizing a GPU processing environment in combination with the parallelism gained via a breadth-first search tree, typically memory-bound computations are cast into compute-bound operations, the overall complexity of the MIMO decoder is reduced while maintaining real-time or near real-time processing.
(11)
(12) In general, MIMO refers to a communication system using multiple transmit and receive antennas to exploit multi-path propagation. In the embodiment shown in
(13) Transmitter 104 transmits the modulated stream of data via channel 106 to receiver 110, which is a comprised of a plurality of antennas (e.g., large antenna array). Channel 106 describes the propagation paths the signal takes and may be known or unknown. In one embodiment, the large transmitter array include M transmitters, while the large receiver array include N receivers. For an M×N MIMO configuration, the transmitter sends different signal streams on M antennas and the receiver receives N different signal streams, one per receiver antenna. With respect to channel 106, in some embodiments information may be known regarding the channel at the receiver side and how the channel interacts with the transmitted signal. Known channel properties are described as channel state information (CSI). In addition to channel properties, at least some noise is added to the transmitted signal. Receiver 110 decodes the received signal and provides the decoded signal to demodulator 112. Decoding of the received signal may rely on a plurality of well-known decoders, including linear and non-linear decoders. Embodiments of the present invention propose a unique non-linear decoder for providing decoding operations at receiver 110. Performance of the decoder algorithm is measured in the bit-error rate (BER) of the decoder algorithm as well as computational complexity (e.g., time required to perform the algorithm). It is beneficial to provide good performance (e.g., low BER) at low complexity (e.g., fast computation time). Based on the modulation scheme utilized to modulate the binary stream of data originally, demodulator 112 demodulates the decoded signal to generate a binary stream output.
(14) These operations can be described mathematically. The input-output relationship of an M×N MIMO system is described by the following equation:
y=Hs+w (1)
where y=[y.sub.1, y.sub.2 . . . , y.sub.N,].sup.T is the received vector, H is N×M channel matrix, where each element H.sub.i,j is an independent zero mean circularly symmetric complex Gaussian random variable with variance σ.sup.2.sub.c, and w=[w.sub.1, w.sub.2, . . . w.sub.N].sup.T, where w.sub.i is an independent zero mean circularly symmetric complex Gaussian random variable with variance σ.sup.2. The transmitted vector (provided by transmitter array 104) is s=[s.sub.1, s.sub.2, . . . s.sub.M], where s.sub.i is drawn from a finite complex constellation alphabet set of cardinal symbols M.sub.c, (s∈S=Ω.sup.M). In general, two types of detection algorithms are utilized to recover transmitted signals, linear decoders and non-linear decoders. Linear decoders operate by separating the received signal into streams by multiplying the received signal y by H.sub.inv.sup.H to get the decoded signal ŝ=Q(H.sub.inh.sup.H.Math.y), wherein Q(.Math.) designates a mapping function to the original constellation. Conventional linear detectors include the maximum ratio combining (MRC), the zero forcing (ZF), and the minimum mean square error (MMSE). Non-linear decoders—such as the MLD calculates the a posteriori probability in terms of likelihood ratio for each possible transmitted vector s by browsing all the set S. The ML estimate of the transmitted vector s, ŝ.sub.ML is given by the following equation:
(15)
The vector s that yields the smallest distance between the received vector y and the hypothesized message Hs is selected as the most likely. Because the optimization problem is performed over the set S=Ω.sup.M, the algorithmic complexity of the ML decoder is O(M.sub.c.sup.M), which becomes very high for a large number of transmitted antennas and/or for a large constellation size.
(16) The sphere decoder is a variant of the ML decoder that reduces the complexity of the ML decoder. In general, the sphere decoder solves equation (2), provided above, by solving for all points which belong to a hypersphere of radius r around the received point y. This can be expressed as follows:
∥y−Hs∥.sup.2≤r.sup.2 (3)
That is, instead of searching each possible transmitted vector s for the one that provides the minimum probability of error, only those points residing within the radius r of the sphere are searched. The search process using the sphere decoder algorithm is a combinatorial optimization problem which can be solved using the ascendant-tree-search algorithm. The two known tree-search algorithms are depth-first-search trees (DFS) and breadth-first-search trees (BFS). In general, the algorithmic complexity of the BFS and DFS algorithms are identical, because the same number of points are ultimately searched. However, as discussed in more detail below, in one embodiment the massive MIMO system 100 utilizes the BFS algorithm to take advantage of additional parallelism as compared with the DFS algorithm. For a fixed radius r, the algorithmic complexity of the sphere decoder is equal to O(Mc.sup.γM), where γ=γ.sub.H,w,s(r) is a real random variable between 0 and 1 and whom statistic is induced by those of the channel matrix H, the signal noise at the receiver w and the transmitted signal s.
(17) In one embodiment, receiver 110 utilizes an efficient sphere detector (SD) algorithm (hereinafter “efficient SD algorithm”) to reduce the complexity of the traditional SD algorithm. Modifications to the SD algorithm may be utilized alone or in conjunction with one another, and include modifications to selection of the sphere radius as well as implementation of a parallel sphere decoder algorithm that combines the efficient breadth-first-search tree (BFS) algorithm. In some embodiments, receiver 110 is comprised of one or more processors that execute the efficient sphere detector algorithm to implement a decoder circuit. In some embodiments, at least one or more of the processors is a graphics processing unit (GPU). As discussed in more detail below, one of the benefits of the claimed invention is that the efficient SD algorithm implemented on one or more GPUs utilizes matrix-matrix multiplication that results in operations being compute-bound (i.e., does not require storage of results to memory as is the case with vector-matrix multiplication, which requires storage of a result to memory and subsequent access of the stored result to continue the calculation). Implementation of the efficient SD algorithm in a processor environment that allows for compute-bound operations is a benefit over prior art sphere decoders.
(18) As illustrated above, the sphere radius utilized by the SD algorithm impacts the performance (in terms of BER) and the complexity (in terms of elapsed time) of the efficient SD detector. As the sphere radius increases, the sphere detector essentially becomes equivalent to the ML detector in which each possible transmitted vector s is searched, which provides high performance (high BER) and high complexity (slow computation time). An ideal choice of the criterion for selecting r is a radius such that the sphere contains the ML salutation as well as the minimum possible number of other candidates from S, so that the algorithm provides high performance (low BER) and low complexity (fast computation time).
(19) Efficient Sphere Radius Selection
(20) In one embodiment, the efficient SD algorithm converts the radius selection equation provided in Equation 3, above, into an equivalent problem, wherein channel matrix H=QR, wherein QR is the decomposition of the matrix H, and wherein Q∈C.sup.N×N is an orthogonal matrix and R∈C.sup.N×M is an upper triangular matrix. In addition,
(21)
(22) Equation (4) can be solved by considering the set L.sub.k for 1≤K≤M defined by L.sub.k={s.sub.k∈Ω.sup.k, ∥
(23) In particular, an embodiment of the present invention evaluates the expectation of ∥
E∥
with the variance of the probability distribution can be expressed as:
var(∥
As a result, the radius r is selected to satisfy the equation:
r.sup.2=Nσ.sup.2 (7)
This selection of the radius r ensures that most of the maximum likelihood (ML) predominant solution candidates will be encapsulated within the sphere defined by radius r. In cases where the selected radius r results in an empty sphere, the radius of the sphere r is expanded. In one embodiment, the expansion is provided as follows:
r.sup.2=r.sup.2+var(∥
(24) In this way, an embodiment of the present invention provides for efficient selection of a sphere radius r to be utilized in the efficient sphere detection (SD) algorithm, and for selectively increasing the size of the sphere in the event the initial radius r results in an empty set.
(25) Parallel Breadth-First Search Sphere Decoder Algorithm
(26)
(27) A benefit of this approach is that this criterion can be expressed for all signals s.sub.k simultaneously using a matrix formulation based on a matrix-matrix multiplication computational kernel. In one embodiment, the matrix P.sub.k is defined as follows:
P.sub.k=
where y.sub.k is the matrix which contains the vector y.sub.k duplicated M.sub.cG.sub.k-1 times. The benefit of this approach stands in contrast with a depth first search algorithm, in which the criterion would be based on a matrix-vector multiplication kernel, which requires memory-bound operations.
(28) The matrix 1×M.sub.cG.sub.k-1 contains the square Euclidean norm of each vector P.sub.K,i in matrix P.sub.k. The previous criterion is reduced now to verifying if the weights in the matrix P.sub.k are lower than r.sup.2 or not (i.e., included under the sphere). For each evaluation, the matrix M.sub.k is generated and the set L.sub.k is derived. If the set L.sub.k is empty, the radius r is increased and the algorithm is started again. For the case of k=1, the process is the same as considering L.sub.0, which corresponds to the root node which is an empty set. Upon reaching the last level, l.sub.M, the minimum weight in matrix P.sub.M is searched and from the matrix M.sub.M the solution s.sub.ML is derived.
(29) According to one embodiment of the present invention, the following algorithm is employed for the case where M≤N. The algorithm is reproduced below:
(30) TABLE-US-00001 Algorithm 1 Parallel Standard BFS Tree Algorithm Inputs: Received signal: y Constellation order: Ω; Channel estimation: H Noise variance estimation: σ.sup.2 QR Decomposition: H = QR Preliminary:
(31) In one embodiment, the sample algorithm illustrated above is implemented by a graphical processing unit (GPU) due to the high memory throughput in terms of bandwidth (byte/s) and high computation rate in terms of floating point operations per second (flop/s). However, most GPUs utilize memory located off-chip, which is accessed through a (relatively) slow PCIe link that operates at a bandwidth substantially lower than the internal GPU bandwidth. As a result, for parallel performance it is important to reduce the data off-loading between the CPU/device memory by reusing the freshly moved data on the GPU board as much as possible. A benefit of utilizing the breadth-first tree search algorithm outlined above is that operations involving multiple subsequent levels of the tree are cast into a single matrix-matrix multiplication kernel.
(32) In addition, one of the challenges of implementing the parallel breadth first search tree resides in the programmability and/or productivity. Scientific codes can be accelerated on GPUs through complier directives, CUDA programming model, or accelerated libraries (e.g., NVIDIA cBLAS). In one embodiment, the breadth-first search algorithm relies on the NVIDIA implementation of the matrix-matrix multiplication kernel from cuBLAS to deliver high performance computing.
(33)
(34) At step 302, a matrix V.sub.k is defined that has a size (k, M.sub.cG.sub.k-1). The matrix V.sub.k is comprised of all the vectors s.sub.k-1 that are included within the set L.sub.k-1. That is, the matrix V.sub.k is comprised of all the vectors of symbols from levels 1 through k−1 that have been determined to be located within the selected radius r of the sphere. The number of rows in the matrix V.sub.k is set to k instead of k−1 because the vectors s.sub.k-1 further includes the next possible symbol selected from the constellation of symbols selected from the set Ω. The number of columns is defined as M.sub.cG.sub.k-1, wherein M.sub.c defines the number of symbols in the constellation set Ω and G.sub.k-1 represents the number of vectors at level k−1 determined to be located within the radius r of the sphere. Because each vector located at level k−1 may be followed by any one of the constellation symbols defined within the set Ω, the number of rows within the matrix is defined as M.sub.cG.sub.k-1. For example, if G.sub.k-1 equals four and the number of symbols in the constellation set Ω equals three, then the number of columns is equal to the twelve. In this way, matrix V.sub.k represents the symbols selected from levels one to k−1 that are located within the radius of the sphere, as well as all possible symbols located at level k that need to be evaluated.
(35) At step 304, the matrix y.sub.k is generated which contains the vector y.sub.k duplicated M.sub.cG.sub.k-1 times. The vector y.sub.k represents the received signal, and this vector is duplicated M.sub.cG.sub.k-1 to provide a matrix of the same size as V.sub.k defined at step 302.
(36) At step 306, the matrix P.sub.k is defined as equal to y.sub.k−R.sub.kV.sub.k. If the depth first algorithm was utilized to determine this criterion, then a matrix-vector multiplication vector would be utilized, which would require memory bound operations. In contrast, utilizing a breadth first algorithm allows this criterion to be determined using a matrix-matrix multiplication kernel, which is compute bound and therefore runs close to the theoretical peak performance of the system. The 1×M.sub.cG.sub.k-1 matrix contains the square Euclidean norm of each vector P.sub.K,i+ in P.sub.k. The previous criterion is reduced now to just verifying if the weights in the matrix P.sub.k are lower than r.sup.2 or not (i.e., located within the sphere).
(37) At step 308, for each evaluation we generate the matrix M.sub.k and we derive the set L.sub.k that indicates those values that satisfy the radius criterion (i.e., fit under the sphere).
(38) At step 310 the set L.sub.k is evaluated to determine if the set is empty. If the set L.sub.k is empty, then the radius r is increased at step 312 and the decoding process is restarted at step 302. If the set is not empty, then at step 314 the minimum weight is searched in matrix P.sub.M and the maximum likelihood solution s.sub.ML is derived from the matrix M.sub.M. The ML solution s.sub.ML represents the signal transmitted from the transmitters to the receiver as determined by the efficient SD decoder algorithm.
(39)
(40)
(41) In the embodiment shown in
(42) In the embodiment shown in
(43)
(44)
(45) In addition,
(46) In this way, the present invention provides a breadth-first-search (BFS) tree in the context of an efficient non-linear sphere decoder algorithm. In particular, the present invention utilizes the parallelism presented by breadth first searching to cast operations involving multiple subsequent levels of the tree into a single matrix-matrix multiplication kernel. Implementation of these operations on a GPU allows for compute-bound processing of the efficient non-linear sphere decoder, as opposed to more costly memory-bound processing.
(47) While the invention has been described with reference to an exemplary embodiment(s), it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment(s) disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.