A review on principles, performance and complexity of linear estimation and detection techniques for MIMO systems

The advent of the fifth generation (5G) of mobile networks has introduced several new use cases that are pushing mobile networks in environments that are typically supported by wired technologies. The initial discussions around the sixth generation (6G) of mobile networks signalizes that different approaches are needed to address all contrasting requirements, where multiple-input multiple-output (MIMO) technique stands as a key technology for most future wireless systems. In this review, we present an introduction on classical linear estimators and coherent detectors along with an innovative and accurate complexity formulation within a common framework, allowing a fair comparison and providing an initial guideline for researchers that are looking for a general view of the main techniques available for spatial multiplexing (SM)-MIMO detection and estimation.


Introduction
Since the first generation (1G) of mobile communication in the 1980s until today, the field of digital communication has tremendously evolved both in capacity and reliability (Giordani et al., 2020). The emerging fifth generation (5G) is driving mobile communication systems towards an unprecedented evolution in terms of flexibility, data rate and latency, enabling wireless networks to support applications that are typically backed by wired technologies. The scenarios for the sixth generation (6G) are even harder to achieve considering the foreseen increase in flexibility, while supporting conflicting requirements for several applications in different verticals, besides higher data rates, higher coverage, higher frequency bands and extreme low latency. It is clear that future mobile networks cannot rely on a single radio access network to fulfill all these requirements. Different approaches are needed to address all requirements, but multiple-input multiple-output (MIMO) schemes represent a key technology for most future wireless systems. For example, in the agribusiness scenario, high data rates are necessary to transmit multi-spectral videos in infrared, ultraviolet and visible light in real time from drone to the cloud. In industry 4.0, very low latency is necessary for controlling robots and synchronizing autonomous actions with humans on the plant floor. MIMO can provide the necessary bandwidth, reducing the frame duration and increasing the robustness for data with a very short life span. Communication systems employing MIMO techniques present significant advantages when compared with traditional SISO architectures, i.e., mitigating fading effects by creating spatial diversity or exploiting channel scattering to achieve higher data rates by spatial multiplexing means. MIMO systems with detection schemes that can harvest diversity and multiplexing gains are able to improve the throughput, increase coverage and reduce the outage probability at the same time. Although the mentioned advantages are appealing features in the future mobile communication context, they are accompanied by demanding drawbacks such as uncorrelated transmission channels requirement, in order to avoid weak conditioned channel matrices, high signaling coordination on MIMO channel estimation, considering each individual transmitting antenna and higher complexity for the network nodes. The challenges imposed by the mobile communication channels require complex processes on the receiver side to recover the information with a desired quality of service (QoS). Among these processes, the estimation and detection algorithms tailored for mobile communications deserve special attention, since the interest on these techniques is growing constantly with the adoption of new schemes, such as MIMO, reconfigurable intelligent surfaces (RIS) (Tapio et al., 2021) and new waveforms (Zhang et al., 2017).
In the digital communications context, the estimation process is associated with choosing a hypothesis from an uncountable infinite set of hypotheses, according to some predefined criteria. One interesting example in communication systems is the channel estimation, where an estimate of the channel impulse response or the channel frequency response must be obtained on the receiver side with minimal mean-squared error (MSE). On the other hand, the detection procedure relates to making a decision from a countably finite set of hypotheses following, again, some established criterion. Another interesting example in the digital communication system is the discrete data detection. In this case, the detector must decide in favor of one of the M possible data symbols from a discrete sample space (or constellation) using the maximum likelihood or the minimum distance criteria (Gallager, 2008). The detectors and estimators analyzed in this paper are used to recover data in the downlink of mobile networks, where the base station (BS) and the user equipment (UE) employ orthogonal frequency division multiplexing (OFDM) with multiple transmit and receive antennas for space multiplexing. The channel is assumed to be frequency-selective and time-variant, with coherence time larger than the symbol block length and coherence bandwidth larger than one subcarrier.
Since these topics are widely studied, the number of different techniques available in the literature can be overwhelming. Hence, this paper aims for presenting an introduction on two critical tasks of mobile communication physical layer (PHY), which are: a) the linear estimation algorithms, suitable for channel state information (CSI) estimation and equalization and; b) detection and non-linear estimators schemes, in order to surpass the limitations of linear estimators in certain applications, i.e., spatial multiplexing (SM)-MIMO (Foschini, 1996). The analysis of beamforming MIMO and non-coherent detectors are beyond the scope of this intro work and can be addressed in (Ahmed et al., 2018;Mahmoud and El-Mahdy, 2021) for the interested reader. Other studies are also available in the literature as follows.
An extensive survey on detection algorithms related to massive MIMO can be found in (Albreem et al., 2019), where well-known linear detectors, including linear equalizers and suitable iterative methods as alternatives to avoid matrix inversion, are characterized according to its performance and complexity profiles. This reference also chronologically lists pertinent publications on MIMO subject. Similarly, in (Albreem et al., 2021a), low complexity linear detectors employing different numerical solutions for the large matrix inversion problem are evaluated, comparing its respective estimated computational cost and resource utilization in a system level deployment.
Still related to this topic, a recent overview on precoding techniques for massive MIMO is conducted (Albreem et al., 2021b). The idea behind precoding is to simplify the detection task on the receiver terminal, transferring its complexity to the transmission side, at the BS, which is in charge of precoding the transmitted information, employing suitable and eventually heavy digital pre-processing. This work also lists features, advantages, challenges and research opportunities related to massive MIMO. Moreover, it discusses and summarizes different precoding algorithms such as linear and non-linear precoders, RIS based precoder and machine learning deep neural network precoding. It concludes that, although linear precoders suffer from performance deterioration under certain scenarios, they still play a crucial role in the transmitter design due to their relative simplicity. This conclusion reinforces the importance of linear estimation concepts discussed here.
In (Jang et al., 2021;Pereira de Figueiredo, 2022), machine learning is pointed out as an important enabling technology for applications such as MIMO, massive MIMO and beamspace MIMO in the millimeter wave. It is notable that machine learning is gathering special attention from the academic community thanks to its potential to replace statistical driven solvers by generalist and adaptive learning-based techniques. Machine learning based MIMO detectors can even outperform the classic approaches when the channel statistical behavior does not match the model considered in the design of the model-driven detectors. Crucial tasks as channel estimation and detection (Ye et al., 2018), resource management (Hussain et al., 2020) and other optimization problems (Dahrouj et al., 2021) are being revisited under the machine learning perspective and, undoubtedly, will play an important role in 6G (Kaur et al., 2021;Sarkar and Debnath, 2021).
An accessible overview encompassing the state-of-the-art solutions to the detection problem is available in (Larsson, 2009). Prominent linear equalizers and detectors are investigated in (Kobayashi et al., 2014), presenting bit error rate (BER) performance analysis and computational complexity comparison under the assumption of different channel correlation scenarios.
In (Jalden and Ottersten, 2005), the sphere detector (SD) is examined, presenting the respective complexity in terms of the number of visited nodes, culminating in the definition of lower and upper bounds for the computational cost given the channel matrix dimension, constellation size and signal-to-noise ratio (SNR).
In this review article, regarding estimators and equalizers, we concentrate in the linear minimum mean squared error (LMMSE) estimator and its low cost iterative variances (Sayed, 2003;Proakis, 2007) as feasible alternatives to address the large matrix inversion problem. Moreover, the component-wise conditionally unbiased (CWCU)-LMMSE (Lang and Huemer, 2015) is considered as an appropriate equalization method featured by an additional weighting process leading to a constrained unbiased estimation. We also present detailed derivations for these estimators and its respective MSE, allowing the reader to investigate the estimation error performance along the Bayesian Cramer-Rao bound (BCRB) as a an absolute reference. Besides this analysis, computational complexity is formulated and further evaluated, seeking to identify which techniques present reasonable performance and affordable computational cost.
Regarding the detector, we focus on the renowned and relevant detection methods for the spatial multiplexing multiple-input multipleoutput (SM-MIMO) applications, encompassing the Maximum Likelihood Detector (MLD), the minimum mean squared error (MMSE)-ordering successive interference cancellation (OSIC) (Hampton, 2013), the SD (Hassibi and Vikalo, 2005) and the iterative MMSE-Parallel Interference Cancellation (PIC) detector (Bensaad et al., 2013), presenting a compiled reviewing about these techniques together with each algorithm description. Furthermore, respective computational cost is derived, allowing one to conduct a complexity comparison alongside a performance analysis through a Monte Carlo BER simulation.
One of the main contributions of this tutorial work relies on.
• gathering and compiling the principles of classical estimators and detectors, in a uniform and accessible way, supported by detailed description of the procedures of each analyzed technique; • performing a complexity comparison among the techniques in terms of float-point operations (FLOPs) under a common framework, allowing a fairly analysis and providing an interesting reference for researchers starting in this field.
In order to achieve these goals, the remaining of this manuscript is structured as follows: Section 2 presents a simplified model to linearly describe a generic and orthogonal digital communication PHY that will be used as a reference along this paper. Section 3 brings the background on complexity analysis of algorithms employed in estimation and detection problems. In Sections 4 and Sections 5, in this order, we present the main techniques for linear estimation and detection, their derivations, and estimated complexity. Section 6 presents numerical examples and performance analysis of the techniques described in Sections 4 and Sections 5, but tailored to specific applications on low order SM-MIMO, such as equalization and detection. Finally, Section 7 presents the conclusions by highlighting the main findings and results achieved in this paper.
Notation: Matrices and vectors are written in boldface uppercase and lowercase as X and x, respectively. A random variable observation is represented as a sub-indexed lowercase x i . The notation E{·} is the expected value of a given random entity. The exponent (·) H is the transpose and conjugate (Hermitian) operator. The mean and the covariance of a given random vector, e.g. x, are defined as μ x E{x} and C xx E{xx H } − E{x}E{x} H , respectively. The operator diag (·) gives the diagonal of a square matrix or, when the argument is a vector, it returns a square matrix whose diagonal is populated with the vector elements. The symbol ∇ α,β f(x) represents the differentiation of the function f(x) with respect to α and β. The Frobenius norm of a generic matrix A is defined by A 2 ≜ i,j |A ij | 2 , with A ij being the elements of matrix A. The set of real and complex numbers are denoted by R m×n and C m×n , respectively, where m and n are the dimension size of the numerical structure.

System model for the digital communication PHY
This section introduces the concept of an orthogonal multicarrier digital communication PHY as a linear model, which is employed in following sections of this paper.
In a broad sense, a digital communication PHY is responsible for: a) adapting the digital information to a waveform that is transmitted to one or more receivers through a communication channel, and; b) for retrieving the information on the receiver side from the distorted and noisy version of the transmitted signal. Both transmitter and receiver are designed based on the communication channel characteristics as noise and fading statistics, average scattering pattern, coherence time, coherence bandwidth and the impairments introduced by the transmitter's and receiver's RF front-end, among others. Specifically for a modern mobile communication system, the PHY must deal with double-dispersive MIMO channels, where each path between one transmitting and one receiving antenna is modeled as a time-variant and frequency-selective impulse response. We consider a scheme employing n transmitting antennas and m receiving antennas as a generalization of the mobile communication system, as it embraces more simplified arrangements, e.g., the usual soft-input soft-output (SISO) when m = n = 1. It is worth to mention that, assuming a SM-MIMO case, when m = n ≥ 2, inter-antenna interference (IAI) takes place once each receiving antenna might collect signals from more than one transmitting antennas. In this case, the detection method should be carefully chosen to take the trade-off between performance and complexity into account. Figure 1 illustrates a simplified wireless communication system assuming this scenario.At the transmitter side, the Bit Encoding block receives the data bit sequences and protects them by applying randomization, forward error correction (FEC) (Ryan and Lin, 2009) and interleaving, aiming to increase the system robustness against the adverse effects of the mobile channel. The resulting coded bits are then fed to the Waveform Modulator block, where different techniques may be used, e.g., symbol modulation and multicarrier techniques, leading to specific waveforms tailored for mobile MIMO channels. The channel block introduces time and frequency fading and it combines the transmitted signals at each receiving antenna, besides adding the additive white Gaussian noise (AWGN). On the receiver side, the Waveform Demodulation block is responsible for performing the time and frequency synchronization, waveform demodulation, antenna decoupling and data symbol estimation, while the Bit Decoding block is responsible for correcting the bit errors that might be introduced by the channel. In SISO iterative detection, the Bit Decoding performs soft demapping in order to obtain soft coded bit sequences, in the form of Log-Likelihood Ratios (LLRs), carrying a sort of confidence measurement about the bit sequence. The LLRs are fed backwards to Waveform Demodulation block as a-priori information, subscriptedb in Figure 1, leading to a more refined symbol estimation at each iteration. After an appropriate number of iterations, the bit information is decided from decoded LLRs. In non-iterative detection, the Bit Decoding block directly decodes the bit sequences once. The recovered binary sequence is finally delivered for the data sink (user application).
It is worth to clarify that, despite of its importance in the communication research field, the availability of studies involving channel coding techniques are wide and easily found in literature, thus, beyond the scope of this work. Instead, this work relies on available algorithms in (Glavieux and Vaton, 2007;Ryan and Lin, 2009), such as Convolutional Codes (CCs) and A-Posteriori Probability (APP) decoding.
Next, we describe the matricial notation for the orthogonal multicarrier MIMO, since this scheme is one of the most popular solution for the air interface in mobile communication systems.

Orthogonal multicarrier MIMO PHY
Consider the system from Figure 1, where n × m antennas may be employed, resulting in n × m paths between the transmitter and receiver. The channel impulse response (CIR) between the jth transmitting antenna and the ith receiving antenna with L taps is represented byh i,j ∈ C L×1 .
Assuming that an uncoded and non-precoded OFDM is employed as multi-carrier modulation scheme to transmit n parallel streams of N s Quadrature Amplitude Modulation (QAM) symbols, mapped into K on active subcarriers from a total of K. Let x j ∈ C K×1 be the data vector in frequency domain transmitted by the jth transmitting antenna with j = 1, 2, . . ., n and K on ≤ K. The transmitted samples is obtained performing an inverse discrete Fourier transform (DFT), leading to ( 1 ) where F K ∈ C K×K is the unitary Fourier matrix. In order to protect the transmit samples χ j ∈ C K×1 against the intersymbol interference (ISI) introduced by a dispersing channel, a cyclic prefix (CP), consisting in the last N cp ≥ L samples of (1), is appended to its  Frontiers in Communications and Networks frontiersin.org 04 beginning, leading toχ j ∈ C Nt×1 where N t = K + N cp . The transmission is performed concatenating the sample blocks into a sample based sequenceχ j (t).
At the receiver side, the received signal from the ith antenna at instant t can be represented as where i = 1. . .m and ω i (t)~CN (0, σ 2 ω ) is the AWGN at the ith receiver antenna.
Assuming perfect time and frequency synchronization, the demodulated signal from the ith receive antenna is y i ∈ C K×1 given by where γ i ∈ C K×1 is the received symbol without the CP and given by In (4), H i,j ∈ C K×K is a circular convolution matrix obtained from the channel impulse response between the jth transmit and ith receive antennas and ω i ∈ C K×1 is the AWGN vector.
Introducing 1) and 4) into 3) yields to where H i,j ∈ C K×K is a diagonal matrix whose elements are the channel frequency response (CFR) between the jth transmit and ith receive antennas and w i ∈ C K×1 is the noise in the frequency domain at the ith receive antenna.
From (5), the received signal for the kth subcarrier at the ith receive antenna is where k = 1. . .K on is the subcarrier index and H i,j [k]~CN (0, 1) is the kth diagonal element from matrix H i,j , representing a Rayleigh fading channel applied over the kth QAM symbol x j [k]. Equation 6 allows us to define the received signal at subcarrier k for all m receiving antennas through a matrix structure considering all m × n antennas as where y[k] ∈ C m×1 is the received signal for all m receiving antennas at the kth subcarrier, H ∈ C m×n is a flat fading channel matrix between each transmitting and receiving antennas for the kth subcarrier, x ∈ C n×1 represents the data symbols transmitted by the kth subcarrier and w[k] ∈ C m×1 is the noise in the kth subcarrier. Indeed, 7) can be seen as a system factorization, where the orthogonal multicarrier detection problem splits into K on subsystems with dimension m × n, m ≥ n. In general, for linear estimators employing matrix inversion, the expected complexity order is K on m 3 . It is easy to note that, for high order MIMO applications, where dozens or even hundreds of antennas are used, not only solving the entire system requires prohibitive computational cost but also other challenging aspects arise, e.g., high signaling coordination on MIMO channel estimation, considering each transmitting antenna. It is worth to mention that, for systems exploring diversity gain and employing a sufficiently high number of receiving antennas, an especial situation named channel hardening takes place, where the resulting fading channel behaves as deterministic (Gunnarsson et al., 2020). In this sense, a simple linear detector, e.g., the zeroforcing (ZF) detector, approaches the MLD performance while granting a significant simplification in the system implementation.
From now on, suppressing the subcarrier index in (7), for notation simplicity, yields to a linear model representation of the orthogonal multicarrier MIMO PHY, for a given subcarrier, that is widely used in next sections.

Linear model
Taking into account that the coding and decoding blocks can be seen as independent and separated functions, the linear model that represents the communication chain of the orthogonal system is given by where y ∈ C m×1 is an observable random vector, H ∈ C m×n is a linear transformation matrix and x ∈ C n×1 is the non-observable vector, which an estimate is desired amid the presence of the noise vector w ∈ C m×1 . In (8), we omitted the corresponding resource and antenna indexes for simplicity of notation. According to Section 2.1, the proposed model consists in a CP protected orthogonal multicarrier scheme resulting in a equivalent block-fading channel from subcarrier perspective, where each element of H is a flat fading scaling factor associated with each m × n path between all transmit and receiving antennas. The probability density function (PDF) of the random elements of H and w are assumed to be complex Gaussian.

Description Notation FLOPs
Matrix-Vector Prod.
Frontiers in Communications and Networks frontiersin.org For the linear model in (8), and also assuming that x and y are independent and identically distributed (iid), the cross covariance and the covariance matrices can be described in terms of the independent random variable (RV) as and C yy HC xx H H + C ww .
These matrices are, in general, part of the linear estimation solution and are also present in the associated error covariance matrix of the estimator, which is a common parameter for performance analysis.

Definitions on complexity evaluation
Before describing the linear estimation and detection techniques, we present a comprehensive review on the algorithms complexity analysis in order to allow a comparison among techniques. Table 1 summarizes the amount of FLOPs demanded on common matrix algebra (Hunger, 2007) and the implicit Householder QR factorization (HQR) (Trefethen and Bau, 1997). In this context, ϕ ∈ C m×1 , Θ ∈ C n×m and Φ ∈ C m×p are arbitrary matrices, while Ω ∈ C m×m is a diagonal matrix and Λ ∈ C m×m is a positive definite matrix. For the HQR, we consider Γ ∈ C m×n with m ≥ n. The FLOP account for the complex float-point operations (CFLOPs) in (Hunger, 2007) considers that a complex summation consists of only 2 FLOPs (2 real summations), a complex multiplication requires 6 FLOPs (4 real multiplications and 2 real summations), a complex square takes 3 FLOPs (2 real multiplications and 1 real summation), a complex square root (Mopuri and Acharyya, 2017) demands 10 FLOPs (2 real multiplications, 2 real divisions, 3 real summations and 3 real square roots) and a complex division takes 11 FLOPs (6 real multiplications, 2 real divisions and 3 real summations).
Moreover, Table 2 presents some useful finite sum identities (Rosen, 2010), which were widespread applied on the complexity formulation for the investigated techniques.

Linear estimation techniques
In this subsection, we describe the most relevant techniques on linear estimation that are commonly used in digital communications systems, such as synchronization, channel estimation, and equalization.

Linear minimum mean squared error estimator
The LMMSE is a simple implementation of the classic MMSE estimator that presents low complexity and sub-optimal performance once it is constrained to be linear. The LMMSE estimator (Huemer et al., 2017;Albreem et al., 2021a) for a random vectorx isx where A l and b l are, respectively, a coefficient matrix and an offset vector that minimizes the MSE for the linear estimation of x. Subindex l is used as an identification for the LMMSE estimator. The MSE matrix stores the covariance among the desired random vector and its estimation as where E l ∈ C n×n . The parameters A l and b l , s.t. A l ∈ C n×m and b l ∈ C n×1 , can be obtained introducing (11) into the optimization problem which is detailed in Supplementary Appendix SA, yielding to and Notice that for a zero mean parameter μ x , (15) reduces to a null vector. Introducing (14) and (15) in (11) results in Applying the expectation operator on (16), allows us to verify that the LMMSE estimator is globally unbiased, as E{x l } E{x}, where the attribute global indicates that the unbiasedness condition is made on the whole parameter vector x.
The associated computational complexity of (16), in terms of FLOPs, is It is worth to mention that, for the specific case when n = m and H is diagonal, the operations in (16) involves only diagonal matrices, reducing the overall complexity to O l (47n). The error covariance matrix of the LMMSE estimator is given by whose diagonal represents the estimation error variance onx.

Steepest-descent estimator
The steepest-descent (STPD) is an iterative procedure that can be applied to LMMSE and other estimators non-subject only to MMSE performance criteria. In particular, applying the STPD procedure to the LMMSE allows to reduce its overall complexity,

Index
Identity expression Frontiers in Communications and Networks frontiersin.org 06 once no matrix inversion is necessary. As a result, the STPD scheme can achieve the same MSE performance as the LMMSE. Consider the STPD linear estimator for (8) aŝ where A s and b s are, respectively, the final coefficient matrix and an offset vector that recursively converges to the MMSE criteria for the linear estimation of x. The sub-index s denotes the STPD linear estimator. The parameter A s ∈ C n×m can be approximated in a interactive fashion by where t is an iteration index and δ is the step-size parameter which defines the behavior of the recursive algorithm. This is a crucial parameter and should be chosen carefully to guarantee a reasonable trade-off between convergence speed and stability (Sayed, 2003).
which allows to express (19) at iteration t aŝ and finally, the instantaneous MSE E st is given by As A st converges to the optimal solution given by (14), (23) approaches the final MSE E s , which is equivalent to (18). Supplementary Appendix SB describes the procedure to obtain these results. It is worth to notice that the complexity to compute (19) is reduced once (20) requires no matrix inversion, although a certain amount of iterations must be considered in order to achieve convergence. Furthermore, as in the LMMSE, the prior knowledge of the signal statistic information is necessary to compute (20) and (21). The complexity of computing (22), in terms of FLOPs, is Notice that O s refers to the computational effort per iteration. For the diagonal case, when n = m and H is diagonal, the complexity per iteration reduces to O s (42n). In the next subsection, a popular algorithm based on STPD is presented as an option to further reduce the complexity and does not require previous knowledge on the exact signal statistics.

Least mean squares estimator
The least mean squares (LMS) is a widely employed adaptive algorithm solution involving unconstrained optimization problems in linear estimation. It is based on stochasticgradient method that is obtained from steepest-descent implementation with a suitable approximation (Sayed, 2003). This approach dismiss the need to know the exact signal statistics and also performs its role through an iterative learning and tracking mechanism.
Retrieving from (9), (10) and (20), the LMS recursion can be rewritten, replacing those statistic parameters by its instantaneous approximations, leading to Typically, the mean of the observable RVs are not known a priori and, in order to calculate b lms , it is necessary to employ some estimation method, as the local-mean estimator (LME) (Lin and Unbehauen, 1991), then estimate the mean information. It is also worth to highlight that, although it is not necessary to know exact values of C xy , C yy , μ x and μ y , it is mandatory to grant access to x t and y t while processing the recursion. In practical applications, such as channel estimation and equalization, this requirement can be addressed employing a training sequence, also known as pilot symbol, whenever the estimation procedure demands an update (Sayed, 2003;Guimarães, 2010). Thus, the LMS parameters can be defined as whereμ y is locally estimated using a recursive moving average structure as the leaky integrator (Prandoni and Vetterli, 2008) whilẽ μ x can also be estimated in the same way or, in case of employing a known sequence, it can be previously defined. Considering the localmean estimation approach, these parameters can be obtained through the following recursions.
where ζ = (M w − 1)/M w is the real valued parameter that depends on the moving average window of length M w . Finally, the LMS estimator is given bŷ The instantaneous estimation error is obtained directly from x t and x lmst samples, which can be evaluated as and, as the algorithm converges, its magnitude decays. Analyzing (28), while approaching the convergence state, the update term also decays and, under some conditions, i.e., the choice of a stable-convergent update coefficient, the final coefficient matrix is approximately given by By replacing (34) into (32) and taking the expectation of (33), we can obtain the expected MSE matrix of the LMS algorithm as which already includes the estimator update task, incorporating the computational cost from (28) and (29). As these equations depend only on the instantaneous observations of the vectors x t and y t , even when the transformation matrix H is diagonal, this condition does not reduces the complexity, once the instantaneous approximations C xty t andC y t y t are evaluated through the outer products in (26) and (27).

Component-wise conditionally unbiased LMMSE estimator
The CWCU-LMMSE is a constrained linear and conditionally unbiased version of MMSE estimator, where the conditional expectation of each estimated componentx ci is individually forced to be equivalent to its associated indirect observable RV x ci (Huemer et al., 2017). The CWCU-LMMSE performs unbiased estimation although it can not outperform the LMMSE estimator in a MSE criteria, since it seeks to minimize the MSE under additional constraints. We assume the already defined vectors x and y as proper and jointly complex Gaussian random variables, where proper means that the real and imaginary parts are uncorrelated with equal variances. The sub-index c is used to designate the CWCU-LMMSE estimator, which can be defined aŝ where A c and b c are, respectively, the coefficient matrix and the offset vector that minimizes the MSE with the additional constraint E{x ci | x ci } x ci . The solution for this unbiased linear estimator, for complex proper Gaussian x c , with mutually independent elements, is described in detail at Supplementary Appendix SC and it is given by where A l is the coefficient matrix of the LMMSE estimator and are the real valued elements of the weighting diagonal matrix D ∈ R n×n , which, for x with mutually independent elements, can also be expressed as The diagonal matrix D can also be obtained performing D C xx diag(diag(A l HC xx )) −1 . Considering the reduced complexity on computing the diagonal matrix inversion and also that A l and the resultant matrix multiplication HC xx are already available, the required computational effort on obtaining D is Applying (38), (39) and (41) in (37) allows us to represent the CWCU-LMMSE estimator aŝ whose computational complexity in terms of FLOPs can be obtained from (17), taking into account the weighting diagonal matrix D, yielding to O c 4m 3 + 16m 2 n + 8n 2 m + 8m 2 +4n 2 + 26nm + 10m + 12n.
For the case when n = m and H is diagonal, the complexity decreases to O c (82n). The error co-variance matrix of the CWCU-LMMSE estimator is described in Supplementary Appendix SC and it is given by

Maximum a-posteriori estimator
The maximum a-posteriori (MAP) estimator is an useful approach to incorporate prior information into the data estimation problem. The MAP estimator is based on posterior probability maximization and it is close related to MAP hypothesis testing (Yates and Goodman, 2005). Although the MAP estimator could, in some cases, result in optimum estimates, like the MSE, it can be cumbersome (or even prohibited) to determine the exactly posterior probability function. In general, the MAP estimate of the RV X given the observation Y = y isx where f X|Y (x|y) is the conditional distribution function of X given Y = y.
Retrieving the definition of the complex Gaussian random vectors x and y from Section 2.2, which are connected through the linear model in (8), its conditional PDF is given by the bi-variate complex Gaussian distribution (Andersen et al., 1995) as where det (·) is the determinant operator, andμ μ x + C xy C −1 yy y − μ y .
Analyzing (47), it is clear that its maximum probability occurs when its exponent is null, hence, when x μ. This condition yields tô Finally, replacing 9) and (10) in (50) leads to Frontiers in Communications and Networks frontiersin.org 08 which, in this specific case, resolves to the LMMSE estimator in (16). Hence, its computational complexity is equivalent to (17), while the error covariance matrix corresponds to (18) and is given by

Bayesian Cramér-Rao bound on estimators
The BCRB defines a physical lower bound for MSE performance, which is helpful to classify whether a given estimator attains the BCRB criteria.
Since all our information is embodied in the observed data and, eventually, in our prior knowledge about the unknown parameter, the estimation accuracy depends directly on its PDFs (Kay, 1993).
The BCRB is defined as a lower bound for the MSE matrix and it is related to the inverse of the Bayesian Fisher information matrix (BFIM) I B (Trees and Bell, 2007), which means that where the matrix inequality means that E BCRB − I −1 B is a non-negative definite matrix. The elements of I B , whose row and column indexes are given by the subscripts p and q, respectively, are obtained by From the linear model in (8), replacing the corresponding PDFs and in (55), allows to define the BCRB as

Detection techniques
In this section we present the MLD as a reference for the performance and complexity analysis once, considering the use of multiple antennas on transmitting and receiving simultaneous streams, is able to achieve optimal performance, minimizing the BER although showing an unfeasible complexity depending on system parameters. As a countermeasure for the prohibitive computational cost, we also review two known detection techniques, one sub-optimal and one close optimal, both with affordable complexity.

Maximum likelihood detector
As presented in Section 2, the detection task at the receiver is basically a decision process applied on the received signal in order to recover the transmitted message.
Retrieving the system model presented in (8), with m ≥ n, we admit now that x holds a sequence of discrete symbols from an alphabet D ∈ C M×1 , with M distinct members. Under the common assumption that all code words from D are equiprobable and, in the case where w is AWGN, the maximum likelihood (ML) detector criterion yields to the minimization of the Euclidean distance (ED) (Bai and Choi, 2014), given byx wherex k ∈ C n×1 is the kth column vector from the set D n ∈ C n×M n , which contains all possible cross combinations of the elements in D, taken n at a time. In general, finding the optimal solution requires an exhaustive search over all M n hypothesis (Albreem et al., 2020), yielding to a n-exponential complexity order problem. Each hypothesis contains n symbols and the MLD can be seen as a lattice structure consisting of nM n nodes. The Algorithm 1 summarizes the MLD procedure whose complexity is mainly dictated by the squared ED computation of each hypothesis at line 3 and the comparison at line 5. In the Algorithm 1, H i is the ith row of the matrix H and d 2 k is the squared ED of the kth hypothesis. The required computational effort on MLD, in terms of FLOPs, is Supposing an alphabet with 16 distinct elements and n = 8, the MLD needs a total of 4.29 × 10 9 hypothesis tests in order to detect x ml . Although optimal, in certain applications the MLD requires a prohibited computational effort (Albreem et al., 2020). In following sub-sections, we present reduced complexity alternatives for the MLD at the expense of sub-optimal performance.

MMSE with ordering successive interference cancellation detector
The MMSE-OSIC is a non-linear estimator that attempts to improve interference cancellation in applications such as SM-MIMO, i.e., Bell Laboratories layer space-time (BLAST) variances (Hampton, 2013).

Frontiers in Communications and Networks frontiersin.org
This method basically starts organizing the rows of the received sequence y and the full-rank channel matrix H in ascending order of SNR. Afterwards, starting from the last row, which holds the symbol with higher SNR, decides by the most likely transmitted information at layer ℓ through ML detection. The estimated symbol from layer ℓ = n is then used, together with CSI, to remove its interference in the foregone layer ℓ = n − 1, prior to ML detection. This procedure is repeated until all layers ℓ = n − 1, . . ., 1 have been processed. Finally, original ordering is reestablished undoing the initial sort operation. The idea to organize the received sequence according to SNR is to avoid error propagation, possibility that might occur in case we start the successive interference cancellation (SIC) algorithm with an arbitrary low SNR signal (Golden et al., 1999).
The ordained received vector and CSI can be obtained performing and where P is a m × m binary permutation matrix, built based on SNR estimation for each y element. It has exactly one entry of 1 in each row and each column, with 0s elsewhere. For example, sorting a vector in ascending order according to a given estimated SNR sequence, e.g., (Zhang et al., 2017;Giordani et al., 2020;Tapio et al., 2021), would require a left multiplication by P 0 1 0 0 0 1 1 0 0 Underneath the common assumption that the constellation alphabet D is zero mean with normalized energy and w is AWGN, the estimated SNR at the ith receiver antenna after equalization iŝ where A i is the ith row of a linear equalization matrix, e.g., the LMMSE equalizer from (14). Hereafter, the SIC at layer ℓ is accomplished by interference cancellation and equalization through followed by ML detection x osicℓ argmiñ In (64), x o is the ordained received vector y o right after interference cancellation from upper layers, followed by equalization considering H oj , which is the jth columns of H o for j = [1, . . ., ℓ]. The interference removal occurs on lower layers only, when ℓ < n, employing already detected symbolsx osicℓ+1 and H oℓ+1 , which is the ℓ + 1 column of H o . In (65), the sub-indexed termx osicℓ is the ML detection result for layer ℓ, wherex k is the kth symbol constellation from the set D with dimension M × 1. It is worth to mention thatx osic is already correctly ordered. Algorithm 2 is proposed for MMSE-OSIC implementation.

Sphere detector
The SD (Fincke and Pohst, 1985) is an algorithm to address the non-deterministic polynomial-time (NP) hard integer least squares (ILS) problem and achieve optimal MLD performance with an average polynomial complexity (Hassibi and Vikalo, 2005).
The principle of SD is to reduce the exhaustive search procedure over all possible code words carried out by the MLD. This is accomplished restricting the search only on hypothesis where the distance from one possible code word are within a predefined radius of a high-dimensional sphere, where each hypothesis can be seen as a path with sequentially interconnected points in a finite tree structure. Whenever a path segment reaches a cumulative distance that exceeds the sphere radius, this segment and all subsequent points are discarded, yielding to a variable complexity reduction (Larsson, 2009).
In this way, depending only on the radius parameter ρ, a trade off between performance and complexity can be trimmed. If ρ is chosen sufficiently high and kept constant, all paths might be checked and the SD behaves like the MLD. If ρ is too small, this can result in non-eligible paths. In this situation, the procedure can be repeated with an increased radius. A practical approach is to initialize ρ = ∞ or based on a code word given by a low complexity technique (Dehghani Soltani et al., 2014), e.g., the ZF or MMSE detectors (Proakis, 2007) and update the radius whenever a better hypothesis is found during the search procedure.
We start introducing the HQR factorization (Koudougnon et al., 2011) expressed as QR = HQR(H) for the full-rank channel matrix H and admitting m ≥ n, where Q ∈ C m×m is an orthonormal matrix s.t. Q H Q = I m and R ∈ C m×n is an upper- Since Q is orthonormal, the noise distribution ofw is still AWGN and the detection problem is equivalent to (59), includingx k definition. Thus, the SD problem can be represented by

FIGURE 3
Intensity chart for the complex random variables estimation example employing LMMSE, STPD or MAP.

FIGURE 4
Intensity chart for the complex random variables estimation example employing LMS.
Frontiers in Communications and Networks frontiersin.org 11

FIGURE 5
Intensity chart for the complex random variables estimation example employing CWCU LMMSE. STPD and LMS convergence behavior for the proposed example.
which can be addressed through a point-sequence search algorithm (Agrell et al., 2002). This structure resembles a spanning tree, with a root node located at the top layer ℓ = n, spanning to M nodes in the immediately next layer ℓ = n − 1. In this way, each node from an upper layer connects to M nodes in subsequent beneath layer. Each layer connection or path section is defined here as a segment. A series of segments connecting a root node to one of the M n nodes at the final layer ℓ = 1 forms one distinct path among M n possibilities. The total amount of nodes, η SD , in a given spanning tree structure has a close relation with the SD complexity and can be obtained as The SD executes a top-down search along the tree while compute the squared node distance d 2 ℓ at layer ℓ under analysis given by where ℓ = n, n − 1, . . ., 1 is the layer index andx i is one of the M distinct constellation symbols chosen from the set D. The scalar r ℓ,i is an element of matrix R andỹ ℓ is the ℓth element of vectorỹ. When d (ℓ,s) 2 > ρ 2 , subsequent layers are pruned, leading the algorithm to move on next segment direction or beginning a new path search. It is worth to mention that, in order to avoid the square root in (70), this comparison can be evaluated on the squared distances. Algorithm 3 describes the SD mechanism to findx sd employing a recursive function SD(ℓ) along the tree search task. For simplicity, the algorithm initializes the parameter ρ 2 = ∞. The variable s = 1, . . ., M is the segment or constellation symbol index for each node hypothesis test. It is also assumed that, when necessary, some parameters are global accessible inside the algorithm environment. The task of finding an exact expression for the complexity of the SD is not trivial once it depends not only on the transmission channel matrix dimension but also on the sphere radius, which is, in turn, a function of the SNR. Indeed, the SD FLOPs account is a   Frontiers in Communications and Networks frontiersin.org random variable with expected polynomial complexity although, in the worst case scenario, it can reach exponential complexity (Hassibi and Vikalo, 2005). Considering the worst case, where all nodes of every segment are visited, the resulting complexity is given by where O HQR is the complexity of the HQR factorization at line 1 of the Algorithm 3, given by Table 1, and the term Oỹ is the matrixvector product in line 2 of the Algorithm 3. The last term in (71) represents the complexity of the recursive function SD(ℓ) with M n representing the total amount of segments in the spanning tree search that multiplies the summation over all layers, which represents a fully segment containing n nodes, connecting the uppermost layer down to layer one. Finally, O node is the complexity of one node inspection at layer ℓ, given by With the aid of the identities given in Table 2, an upper bound for the SD algorithm complexity is obtained as where the last term on (73) is the predominant computational cost associated with the recursive SD(ℓ) function. In a first glance, the worst case scenario for the SD algorithm exhibits higher complexity when compared with the MLD. This happens due the fact that every visited node requires the computation of the squared partial distances d 2 ℓ . However, as the SD is able to reduce its complexity to a polynomial degree, thanks to its segment pruning behavior based on partial distances, in practice, the SD complexity is smaller than the MLD complexity, specially at high SNR. On the other hand, the MLD will always exhibits an exponential complexity, since all hypotheses are always evaluated.
There are also some slight variants of the SD algorithm that seeks to achieve a reduced (Arfaoui et al., 2016) or even fixed complexity (Larsson, 2009) at the cost of sub-optimal performance. In (Burg et al., 2005), some suitable approximations and simplifications are admitted, leading to implementations with affordable complexities.

Iterative MMSE-PIC detector
This method relies on two concepts, parallel interference cancellation and SISO channel decoding (Studer et al., 2011), exchanging refined information between these two domains in a iterative form. Differently from SIC, where data symbols are individually detected removing the interference caused by already decided symbols, the PIC estimates all data elements sharing the same radio resource jointly. The PIC performs a detection on the jth element x j of x assuming that all its other elements, denoted by x \ j , are interfering terms. Rewriting 8) as where H j is the jth column of H while H \ j is H removing the jth column. Then, the PIC yields the signalỹ j given bỹ with μ x\ j denoting μ x removing the jth element,w j models the noiseplus-interference, with variance HC xx H H + σ 2 I. Based on (75), all j = 1. . .n systems can be solved in parallel, employing independent linear estimation processes, whose, in general, require m × m matrix inversions for each x j , normally yielding to cubic complexity order.
In (Matthe et al., 2016), the authors demonstrate that this process is equivalent to a single linear estimation exploring a-priori knowledge on x. Figure 2 illustrates the block diagram for the proposed iterative SISO MMSE-PIC detector. Its entry point considers, under the assumption of perfect synchronization and channel state information at the receiver (CSIR), the demodulated signal vector y, the equivalent MIMO CFR and the corresponding AWGN variance σ 2 w . The expectation and variance of x are estimated with the help of a linear estimation function, in this case, we consider the CWCU-LMMSE estimator, and the available a-priori information μ a x and C a xx . Initially, when no a-priori information is available, these parameters are initialized as a null vector and an identity matrix, respectively. Next, the resulting posterior estimates, μ p x and C p xx , respectively obtained employing (43) and (45), are used by the soft demapping function M −1 soft to obtain individual bit probability estimation for each data symbol, constrained by the constellation set D. Assuming uncorrelated noise and dismissing the necessity of previous knowledge about the bit sequence, the approximated LLRs, for a given subcarrier, are efficient obtained with negligible impact on the overall detection performance (Studer et al., 2011;Matthé et al., 2018) by where λ p i ∈ R μn×1 is the intrinsic LLR vector, j = 1. . .n, b = 0. . .μ − 1, with μ being the number of bits per symbol, D (0) b and D (1) b being the subsets of constellation symbols whose bth bit is 0 or 1, respectively. It is worth to mention that diverse LLRs sequences may be required when considering SM-MIMO systems employing frame structures that carry out integer multiples of a codeword, simultaneously transmitted by n antennas in multiple block symbols per frame. Thus, after properly gathering and organizing each codeword, soft decoding is performed in every iteration of PIC procedure. In case of random bit interleaving had been considered in the transmitter side, a de-interleaving function must be done prior to channel decoding. Afterwards, the same random bit interleaver operation is performed over extrinsic LLRs before soft mapping. The interleaving is an efficient option in order to improve the correction capacity. If the noise coming into channel decoding is highly correlated, then the convolutional decoder, commonly designed with a short constraint length, is more likely to make a decoding error than if the noise was independent. Since channel decoders are affected by burst errors, the interleaver spreads these errors out, allowing the decoders to operate with relative independent noise from bit to bit (Gallager, 2008).
The soft channel decoding is responsible for recovering the transmitted bit information from demapper estimates, obeying the code constraints. Usually, soft decoding applies algorithms able to exactly compute or approximate the APP of the information bits or, more generally, a reliability measure about each information bit. Hence, soft decoding results decoded LLRs, required in order to output re-encoded LLRs λ a i for a next iteration or by taking a final hard bit decisionb PIC . To improve the correctness of its decisions, both demapper and decoding have to be fed with information which does not originate from itself (Colavolpe et al., 2001), corresponding to the so called extrinsic LLRs λ e . Without loss of generality, we assume convolutional codes as channel coding technique, which is

Frontiers in Communications and Networks
frontiersin.org sufficient to analyze the PIC concept. Therefore, Soft output Viterbi Algorithm (SOVA) (Hagenauer and Hoeher, 1989) is also considered hereinafter. The extrinsic information gained during codeword domain processing, denoted by λ a e , is soft modulated according to (Studer et al., 2011). Let define the bth bit of the sth constellation symbol by b s,b ∈ {0, 1} with probabilities P[b s,b 1] [1 + exp(λ a e )] −1 and P[b s,b 0] 1 − P[b s,b 1]. Then, mean and variance of the a-priory constellation symbols are updated performing and diag C a This encloses the iterative MMSE-PIC loop, allowing to start a new iteration considering the a-priory information gained from soft decoding. Along successive iterations, both demapper and decoder procedures self benefit from exchanged refined information between each other. Algorithm 4 summarizes the required tasks for each iteration. With respect to computational effort comparison, it is reasonable to consider the estimation processes at line 2, detaining the majority complexity executing Algorithm 4. Since demapping and decoding are, in general, common tasks in every digital communication system, it is sufficient to express the MMSE-PIC detector complexity per iteration by The estimation stage employs up to K parallel instances of the CWCU-LMMSE estimator to solve each m × n linear systems, corresponding to the number of active subcarriers. Retrieving (44), the complexity order of Algorithm 4 is mainly dictated by a cubic behavior on the employed MIMO dimension. In order to further reduce the computational cost involved in MMSE-PIC detection, approximated estimators that avoid costly matrix inversion can be considered (Matthé et al., 2018;Zhang and Kim, 2019;Park, 2022).

Genie-aided detector
The MLD energy efficiency is commonly taken as reference for performance analysis involving alternative detectors. According to Section 5.1, the computational effort required to evaluate the ML detection is, in general, unfeasible, depending on system order, even for simulation purpose. In order to overpass this demanding condition, we can consider employing an hypothetical Genie-Aided Detector (GAD).
The GAD access additional transmission side information carried to the detector through a non-dispersive and unitary gain parallel channel, representing the genie (Eriksson et al., 1995). In the receiver side, upon decision of the transmitted data, the ED ϱ R between the detected informationx and the received signal y is compared with the ED ϱ T of the transmitted data x and y. If ϱ R < ϱ T , an optimal MLD would also detect the wrong data sequence, leading to a frame error. On the contrary, when ϱ R ≥ ϱ T , it is assumed that the MLD would have found the correct data sequence and the detector under analysis not, accumulating a frame error. This procedure overestimates the MLD performance yielding to a lower bound on ML detection (Matthé et al., 2018). The validity of this bound is a direct consequence of the fact that any composite hypothesis test cannot perform better than the corresponding perfect measurement test (Anastasopoulos, 2003). Algorithm 5 is used to evaluate the MLD frame error rate (FER) lower bound employing a GAD. The GAD description encloses this section and finally allows to evaluate the aforementioned concepts, involving the estimation and detection techniques.

Simulation results
In order to demonstrate the properties of the investigated estimators and detectors, we propose three numerical examples. The first one seeks to evaluate and compare the resulting MSE on the estimation of four different RVs, each one considering a specific scenario, chosen to allow an easy visualization of its characteristic biasedness. The second example focus on the uncoded and noniterative detection task considering a hypothetical SM-MIMO system, where a Monte Carlo simulation is conducted for the BER performance evaluation. The complexity comparison is performed, in both examples, analyzing the involved computational effort as a function of its respective cost sensitive Frontiers in Communications and Networks frontiersin.org parameters. The third example evaluates the iterative MMSE-PIC and compare its FER performance w.r.t. an overestimated and hypothetical SISO MLD with the help of a GAD.

Numerical example of linear estimation
Consider a linear system as in (8) with m = 4 and n = 4, where each element of the observable random vector y is y i n j 1 H i,j x j + w i for i = 1, . . ., m.
Here, x j was defined in order to position each RV on different quadrants of a complex plane to facilitate its visual identification. Moreover, the linear estimation applied to a finite set of discrete RV allows to analyze the biasedness aspect of the different estimators. The co-variance of the random vectors x and w are C xx = I n and C ww diag(σ 2 w ), respectively. The proposed transformation matrix H = diag ([1, 0.5, 1, 0.5]) was built in order to allow the investigation of the MSE in some specific cases, varying from unitary transform coefficient and lower noise energy to low scaling factor with high noise energy. Table 3 outlines the proposed cases of study. Figures 3-5 shown the intensity charts of the relative frequencies for each estimates in the complex plane. High incidence values are marked in red, while low occurrence values are plotted in blue. The × markers indicate possible values assumed by the discrete RVs x j .In the top right quadrant, we have the intensity color plot for the RV x 1 , corresponding to the proposed case 1 from Table 3 and, in counterclockwise direction, the remaining quadrants illustrate the resulting estimation of x 2 , x 3 and x 4 , respectively, related to cases 2, 3 and 4, in this order. The plot from Figure 3 reproduces the results for the LMMSE estimator and are equivalently obtained when employing STPD or MAP estimators. Figure 4 and Figure 5 show the relative frequencies obtained by the LMS estimation and the CWCU-LMMSE estimator, respectively.
For the case 1, where the scaling factor equals one and the SNR is high, all estimators perform equally in terms of the MSE. This result is also observed for cases 2 (low scaling factor and high SNR) and case 3 (high scaling factor and low SNR). For case 4, with low scaling factor and low SNR, the MSE increases in relation to the other cases. In this situation, one can observe that the LMMSE, STPD and MAP estimators, whose restriction relies only on the linearity constraint, achieve the smallest MSE. The exception is the LMS estimator, which shows an poorer performance, mainly because of its built-in approximations. The CWCU-LMMSE estimator cannot outperform the LMMSE estimator in a MSE sense since it has additional constraints. However, the CWCU-LMMSE estimator features its inherent conditional unbiased property, which is evidenced in study case 4, whose result can be visualized at Figure 5. The CWCU-LMMSE has its estimates centered around the true RV events, since this estimator holds the established constraints. In contrast, the LMMSE based estimators introduce a small bias towards the prior mean, μ xj in order to avoid noise enhancement and attains the MMSE. Table 4 summarizes the most relevant results for the performance analysis of the estimators considered in this review. Columns 2, 3 and 4 from Table 4 present the diagonal components of the coefficient matrix A, the elements of the offset vector b and the diagonal elements of the resulting MSE matrix E, respectively. Each element corresponds, from left to right, to the proposed cases 1 to 4, in this order. The last column shows the required computational cost in FLOPs on estimatingx for each method, considering all involved operations, from the computation of A and b tox itself. It is worth to highlight that, since H is a square diagonal matrix, the overall matrix algebra complexity is reduced, except for the LMS algorithm, where, as stated in Section 4.3, it does not take advantage from a diagonal transformation matrix. Furthermore, the presented complexities for the STPD and the LMS algorithms relate to only one iteration. Although, in this specific example, the LMS algorithm exhibits to be the most costly procedure, when considering nondiagonal transformation matrices, this algorithm requires no more FLOPs than the other presented methods. Figure 6 illustrates the convergence behavior for the iterative STPD and LMS methods in terms of the instantaneous squared error estimation and, as expected, it decays along successive iterations. In this example, the error decreases rapidly in the first iterations, and few more are necessary to reach a stable convergence state. Figure 7 represents the amount of FLOPs as a function of the dimensions of the m × n non-diagonal transformation matrix. The STPD and the LMS iterative methods are the less costly algorithms, since no matrix inversion is involved and their complexity accounts only the amount of FLOPs per iteration. However, in practical applications, these algorithms requires no more than few iterations to converge, as exemplified by Figure 6. The LMMSE and the MAP estimators are equivalent and show intermediary complexity, while the CWCU-LMMSE is the most complex algorithm because it employs an additional weighting operation.
In Figure 8, one can visualize the MSE of the linear estimators against the BCRB. Parameter α is the noise energy and parameter β is the transformation matrix gain, where C ww = αI n and H = βI n , for n = 4. The LMMSE and the MAP estimators attain the BCRB and are considered optimal in a Bayesian sense. The iterative STPD and LMS methods are, at least in theory, lower bounded by the LMMSE performance once they depend on further convergence aspects of the algorithms. As expected, the CWCU-LMMSE does not attain the BCRB and it shows low performance in terms of MSE, especially in cases with intense noise enhancement (high α and low β).

Numerical example of non-iterative and uncoded detection in a digital communication system
In this subsection, an n × m SM-MIMO digital communication system based on the model presented in Section 2 is considered. We adopted n = m = 4 antennas for both the transmitter and receiver, such that n different data streams are transmitted simultaneously. Furthermore, no channel coding neither iterative detection are used, these are the reasons why we excluded the MMSE-PIC from this analysis. The system uses a 16-QAM to map bits into symbols, which are transmitted through a time-varying and frequency selective channel employing an orthogonal multicarrier scheme, i.e., CP Frontiers in Communications and Networks frontiersin.org protected OFDM. Assuming a symbol length with K = 64 samples, a CP with N CP = 16 samples, which is larger than the maximum channel delay profile. In this case, the channel coherence bandwidth is wider than the bandwidth of a subcarrier and the channel frequency response can be considered to be a flat Rayleigh channel per subcarrier. We also assume perfect symbol time and carrier frequency synchronization and that the CSIR is available. The simulation parameters are given in Table 5. Figure 9 illustrates some selected energy efficiency Monte Carlo simulation results. It relates the BER versus the E b /N 0 ratio for the main presented estimation and detection techniques. The so called MMSE process performs a LMMSE equalization over the received and demodulated symbols, which can be applied in this case, since the system model is linear. The denominated CWCU-LMMSE further employs a diagonal weighting on the LMMSE equalization matrix, as demonstrated in Section 4. Then, right after equalization, ML detection is used to find the most probable transmitted sequence. According to (63), an equalization matrix is required to estimate the SNR in the MMSE-OSIC described in Algorithm 2. In this example, we used the aforementioned equalization matrices, the LMMSE and its weighted CWCU-LMMSE version. Besides these detectors, we present the results for the SD, described by Algorithm 3 The MLD, described by Algorithm 1, has also been implemented as a benchmark for the techniques considered in this paper.
The detectors exploiting the diagonal weighting, namely CWCU-LMMSE and CWCU-LMMSE-OSIC, show a slight improvement in the BER performance when compared with the corresponding MMSE and MMSE-OSIC, in this order. This small improvement is a result of a better fitting of the CWCU-LMMSE equalized signal on the constellation grid prior to detection. This characteristic holds for the Rayleigh flat-fading channel, where the symbols after CWCU LMMSE equalization remains unbiased, while the symbols equalized by the LMMSE tends to introduce a small bias towards the expected value of the discrete RV set, which, for a symmetric equiprobable QAM constellation, is zero. On the other hand, for AWGN channel, both equalizers performs equally in terms of BER (Lang and Huemer, 2015). The SD achieves a performance that is equivalent to the one observed for the MLD and overperforming all previous detectors.
In Figure 10, we analyze the SD complexity for the proposed example in terms of the average number of visited nodes at each layer. We notice that this parameters is not so dependent on the SNR as the average amount of visited nodes slight decays with the E b /N 0 . Furthermore, bottom layers are more commonly visited once the tree search structure exponentially expands towards the underneath layers. From the average number of visited nodes and (72), we obtain the average complexity in FLOPs. This is an important parameter once it allows to compare the upper bound of the SD complexity, given by (73), its average computational cost, and the MLD complexity given by (60). This behavior can be seen graphically in Figure 11, that brings the complexity growth, in log scale, of the presented detection methods, in terms of FLOPs counting as a function of the constellation size M, while assuming n = m. Among the presented detectors, the MMSE-OSIC has the lowest and restrained computational cost once that (66) follows a cubic expansion rate with the number of receive antennas. This behavior can be extended for the iterative MMSE-PIC considering the linear dependency on the number of iterations. We can also infer, as described in (73) and (60), an exponential growth in the worst case scenario for the SD and for the MLD, respectively. The average complexity of SD for the proposed example is also pointed in the graph, which confirms that, in average, the SD achieves the MLD performance at an smaller complexity (approximately 4 O osic in this example), although it can still reach exponential computational cost.
It is easy to see that both SD and MLD algorithms, although optimal in terms of BER performance compared with the OSIC approach, exhibits prohibitive complexity when the modulation order or the number of antennas increases, which is the case of high order communication systems, such as massive MIMO (Albreem et al., 2019). In these cases, different solutions, in general some slight variances of the presented methods, as those already cited in Section 1, or even new proposals that might emerge from that, should also be considered.

Numerical example of iterative detection in a digital communication system
This example focus on the iterative MMSE-PIC detector, described in Section 5.4, employing a convolutional code aiming error correction capability. In order to provide an optimal reference for energy efficient comparison, we consider the GAD to emulate a SISO MLD, resulting in a hypothetical FER lower bound. We also present the results for the non-iterative detectors MMSE-OSIC and SD, both employing intra-frame interleaving, hard demapping and hard channel decoding. The simulation employs the same parameters used in the uncoded example with supplementary information given by Table 6. Figure 12 illustrates the BER behavior for the same channel coding parameters, accenting the difference between noniterative hard detectors and an iterative soft approach. The MMSE-OSIC poorly performs as a result of error propagation and single iteration execution. The conventional SD, although able to close achieves the ML criteria, are also still far away from the iterative MMSE-PIC once its detection is based only on the constellation constraint. It is worth to mention the availability of SISO-SD in literature (Studer and Bölcskei, 2010;Witte et al., 2010) that approaches the optimal performance in SM-MIMO applications. In this work, we replace this intricate detector by the so called genie detection.
In Figure 13, we analyze the FER performance w.r.t. a reference lower bound given by the GAD, based on the detected codeword from the MMSE-PIC. Whenever a bit error is detected in a codeword, the simulation accounts for a frame error. Once again, both non-iterative MMSE-OSIC and SD are positioned on the right of the iterative MMSE-PIC, this last one exhibiting a close to optimal result. This demonstrates that, among all detectors analyzed, the MMSE-PIC shows a prominent performance not only in terms of energy efficiency but also in complexity meaning, comparable to the cubic order of the MMSE-OSIC.

Conclusion
In this review, linear estimators based on the classical LMMSE was discussed, including its intricate derivations, Frontiers in Communications and Networks frontiersin.org generally omitted even in textbooks. These detailed descriptions might be helpful on proposing new estimators or low complexity approximations. Regarding on complexity, a substantiated computing method is employed, based on FLOP counting for diverse involved operations, which can be a useful tool on the evaluation and comparison of different procedures. Moreover, the proposition of a BCRB permits to use an absolute reference for the MSE analysis. Although the CWCU-LMMSE MSE diverges from the BCRB in cases with potential for noise enhancement, this result does not reflect in the BER, being attributed to its unbiasedness characteristic, yielding to a better constellation grid fitting prior to detection. Note that, in similar situations, the LMMSE would also require a re-scaling in order to fit the constellation grid. In the case of the CWCU-LMMSE, this process is inherent. With respect to detectors, those visited in this work were described through easy implementable algorithms and analysed in terms of its energy efficiency and computing cost. Although both MLD and SD achieves optimal performance, its intrinsic exponential complexity implies harsh restrictions on practical MIMO applications, s.t. small constellation sizes and few simultaneous data streams, limited by the number of transmit and receive antennas. Although the MMSE-OSIC presents affordable complexity, its performance is far from optimal. In contrast, the iterative MMSE-PIC associates expressive results, being able to closely achieves optimal performance with a computational cost comparable to the MMSE-OSIC. The iterative MMSE-PIC poses as a feasible alternative to SM-MIMO, even for massive MIMO applications, especially considering recent researches that seeks to reduce the complexity on the system solution problem, employing suitable approximations to the matrix inversion (Park, 2022).
In summary, in order to achieve the challenging and contrasting requirements of future mobile communication, different radio access techniques, among them the SM-MIMO scheme, will be a fundamental tool. In this sense, the presented tutorial contributes with a common framework that allows a fair comparison among the settled studied solutions and provides an initial guideline for researchers that are looking for a general view of the main techniques available for SM-MIMO detection and estimation.
Future works on these topics might also embrace the use of iterative estimators, specially some mixture of the STPD or the LMS with the CWCU-LMMSE weighting diagonal, in order to attain unbiasedness and avoid costly matrix inversion while keeping a channel tracking mechanism, jointly with parallel interference cancellation methods applied on non-orthogonal waveforms in SM-MIMO applications, with potential to harvest diversity while achieving multiplexing gain at the same time. Moreover, artificial intelligence is a prominent alternative to replace the statistical-based solvers discussed here by more generalist algorithms.