Partial maximum correntropy regression for robust electrocorticography decoding

The Partial Least Square Regression (PLSR) method has shown admirable competence for predicting continuous variables from inter-correlated electrocorticography signals in the brain-computer interface. However, PLSR is essentially formulated with the least square criterion, thus, being considerably prone to the performance deterioration caused by the brain recording noises. To address this problem, this study aims to propose a new robust variant for PLSR. To this end, the maximum correntropy criterion (MCC) is utilized to propose a new robust implementation of PLSR, called Partial Maximum Correntropy Regression (PMCR). The half-quadratic optimization is utilized to calculate the robust projectors for the dimensionality reduction, and the regression coefficients are optimized by a fixed-point optimization method. The proposed PMCR is evaluated with a synthetic example and a public electrocorticography dataset under three performance indicators. For the synthetic example, PMCR realized better prediction results compared with the other existing methods. PMCR could also abstract valid information with a limited number of decomposition factors in a noisy regression scenario. For the electrocorticography dataset, PMCR achieved superior decoding performance in most cases, and also realized the minimal neurophysiological pattern deterioration with the interference of the noises. The experimental results demonstrate that, the proposed PMCR could outperform the existing methods in a noisy, inter-correlated, and high-dimensional decoding task. PMCR could alleviate the performance degradation caused by the adverse noises and ameliorate the electrocorticography decoding robustness for the brain-computer interface.


I. INTRODUCTION
B RAIN-computer interface (BCI) has been conceived as a promising technology that translates cerebral recordings generated by cortical neurons into appropriate commands for controlling neuroprosthetic devices [1].The capability of BCI for repairing or reproducing sensory-motor functions has been increasingly intensified by recent scientific and technological advances [2]- [4].The noninvasive recordings, especially electroencephalogram (EEG) and magnetoencephalogram (MEG), are widely exploited to structure BCI systems due to their ease of use and satisfactory temporal resolution, whereas the noninvasive BCI systems could be limited in their capabilities and customarily require considerable training [5].Invasive singleunit activities and local field potentials usually provide better decoding performance, which suffer pessimistic long-term stability, however, due to capriciousness in the recorded neuronalensembles [6].A sophisticated alternative that exhibits higher signal amplitudes than EEG while presents superior long-term stability compared to invasive modalities, is the semi-invasive electrocorticography (ECoG) [7].Many studies in recent years have investigated the potentials of ECoG recording in decoding motions [8]- [14].The serviceability of ECoG signal for online practice was also demonstrated in [9], [15], [16].
The partial least square regression (PLSR), which has been widely utilized in ECoG decoding tasks, is in particular suited for inter-correlated and high-dimensional problems, where the number of variables is larger than that of the observations.For instance, in [13], the authors successfully predicted the threedimensional continuous hand trajectories of two monkeys during asynchronous food-reaching tasks from the time-frequency representations of subdural ECoG signals by PLSR algorithm.They further demonstrated the admirable prediction capability of PLSR in an epidural ECoG study [14].Although PLSR was originally developed for econometrics and chemometrics [17], it has emerged as a prevailing approach for neuroimaging and decoding [18].
PLSR algorithm answers the regression problems primarily with dimensionality reduction techniques both on independent and dependent variables, in which the dimensionality-reduced samples (usually called as latent variables) of respective sets exhibit maximal correlation, thus structuring association from independent to dependent variables.Multi-way PLSR, NPLSR in short, was proposed as a generalization for tensor variables [19].Recently, further studies have been investigated so as to ameliorate the performance of PLSR, most of which proposed to calculate the projectors or the regression coefficients with an additive regularization item [20]- [23].Nevertheless, PLSR and the regularized variants essentially employ the least square criterion which assigns superfluous importance to the deviated noises, as a result, leading to poor robustness.
In the present study, we aim to propose a new robust version of PLSR by introducing the maximum correntropy criterion (MCC) to replace the non-robust least square criterion, which was proposed in the information theoretic learning (ITL) [24], and has achieved state-of-the-art robust approaches in different tasks, including regression [25]- [27], classification [28], [29], principal component analysis [30], and feature extraction [31].
Recently, a rudimentary implementation of the MCC in the PLSR algorithm was investigated in [32], in which the authors proposed to employ the MCC in the process of dimensionality reduction.However, the proposed algorithm in this study could be limited in some respects.First, except for the MCC-based dimensionality reduction, it remains computing the regression coefficients under the least square criterion.Second, it only executes the dimensionality reduction on the explanatory matrix.Thus, one has to estimate the regression coefficients separately for each dimension of the output matrix, which means it could be inadequate for multivariate response.
By comparison, in the present study, we desire to realize a more comprehensive implementation of the MCC framework in PLSR.The main contributions of this study are summarized as follows.
1) We reformulate PLSR thoroughly with the MCC framework, that not only the dimensionality reduction, but also the regression relations between the different variables are established by the MCC framework.2) Both the explanatory matrix and the response matrix are processed with dimensionality reduction.As a result, the proposed algorithm is adequate for multivariate response prediction.3) We use Gaussian kernel functions with individual kernel bandwidths for different reconstruction errors and prediction errors.Furthermore, each kernel bandwidth could be calculated from the corresponding set of errors.The remainder of this paper is organized as follows.Section II introduces the mathematical derivation of the conventional PLSR algorithm.In Section III, we present a brief introduction about MCC, and reformulate PLSR with MCC.In Section IV, we evaluate the proposed method on synthetic and real ECoG datasets, respectively.We discuss about the proposed method in Section V. Finally, the conclusion of this paper is given in Section VI.

II. PARTIAL LEAST SQUARE REGRESSION
Consider a data set with the explanatory matrix X ∈ R L×N and the response matrix Y ∈ R L×M , in which N and M denote the respective numbers of dimension, and L is the number of observations.PLSR is an iterative regression algorithm which executes dimensionality reduction on explanatory and response matrices, so that the resultant latent variables in each iteration exhibit maximal covariance.
In the first iteration, the original matrices are employed as the current residual matrices, i.e.X 1 = X and Y 1 = Y.PLSR calculates two projectors w 1 ∈ R N and c 1 ∈ R M to acquire the corresponding latent variables, denoted as t 1 = X 1 w 1 and u 1 = Y 1 c 1 , by maximizing the covariance where T means transpose and • 2 denotes the L 2 -norm.One solves the aforesaid problem by singular value decomposition (SVD) on X T 1 Y 1 .Then, one computes the loading vector p 1 by the least square criterion as thus organizing the regression from t 1 to X 1 .PLSR supposes linear association from t 1 to u 1 furthermore by calculating a regression scalar b 1 by the least square criterion as The residual matrices for the next iteration are updated by Such procedures are repeated by PLSR for the optimal number of factors S, which is usually selected by cross-validation.One then collects the outcomes of each iteration, As a result, one rewrites the decomposition of X and the predicted response Ŷ as Thus, the prediction relationship from X to Ŷ is structured as in which H = P T + BC T ∈ R N ×M , and P T + is the pseudoinverse of P T .One notes that, the PLSR algorithm utilizes the least square criterion not only in (2)(3) to build the regression relationship, but also in (1) for the calculation of the projectors.Maximizing the covariance could be rewritten as [33] min in which x l and y l denote the l-th observations for X and Y, respectively.Note that the subscripts for the number of factors are omitted for the reason of simplicity.The connotation of (7) could be interpreted as follows.The first and second items in the summation denote the quadratic reconstruction errors for the input and output, respectively.For example, x l w indicates the dimensionality-reduced input of the l-th observation, and thus x l ww T denotes the reconstruction.The third item means the prediction error between the latent variables.PLSR obtains two projectors by minimizing the sum of three quadratic errors, which is non-robust with respect to noises.

III. PARTIAL MAXIMUM CORRENTROPY REGRESSION A. Maximum Correntropy Criterion
The correntropy concept was developed in the field of ITL as a generalized correlation function of random processes [34], which measures the similarity and interaction between vectors in a kernel space.Correntropy associates with the information potential of quadratic Renyi's entropy [25], in which the data's probability density function (PDF) is estimated by the Parzen's window approach [35], [36].The correntropy which evaluates the similarity between two arbitrary variables A and B, which is denoted by V(A, B) in this paper, is where k(•) is a kernel function satisfying the Mercer's theory [37] and E[•] signifies the expectation operator.In the practical application, one calculates the correntropy with L observations by the following empirical estimation where the Gaussian kernel function g σ (x) exp(−x 2 /2σ 2 ) with kernel bandwidth σ is widely used for the kernel function k(•), thus leading to Maximizing the correntropy in (10), called as the maximum correntropy criterion (MCC), exhibits numerous advantages.Correntropy is essentially a local similarity measure, that its value is chiefly determined along A = B, i.e. zero-value error.As a result, the effect of large error caused by adverse noise is alleviated, leading to better robustness.In addition, correntropy could extract sufficient information from observations, since it considers all the even moments of errors.Moreover, it closely relates to the m-estimation, which can be regarded as a robust formulation of Welsch m-estimator [25], [38].

B. Partial Maximum Correntropy Regression
Substituting the three least-square items in the conventional PLSR (7) with the maximum correntropy yields where σ x , σ y , and σ r denote the Gaussian kernel bandwidths for X-reconstruction errors, Y-reconstruction errors, and the prediction errors, respectively.Then, one transforms the vectors (x l − x l ww T ) and (y l − y l cc T ) into scalars, provided that the two projectors w and c are unit-length vectors, i.e. w T w = c T c = 1 Subsequently, one obtains the following optimization problem to acquire the projectors After obtaining w and c, one calculates the latent variables as the conventional PLSR by t = Xw and u = Yc.We then compute the loading vector p and the regression scalar b with MCC by in which t l and u l are the l-th elements for the latent variables t and u, respectively.σ p and σ b are the corresponding kernel bandwidths.The residual matrices are then updated similarly.
One will repeat such procedures for the optimal number of factors and collects the acquired vectors from each iteration to organize the matrices T, P, B, and C, as the original PLSR.Ultimately, the predicted response Ŷ can be obtained from X by the regression relationship (6).The above-mentioned PLSR variant which is reformulated based on MCC, is called partial maximum correntropy regression (PMCR).
In what follows, we discuss about the optimization, convergence analysis, and determination of hyper-parameters considering the proposed PMCR algorithm.
1) Optimization: Three optimization problems (13)( 14)(15) need to be addressed in PMCR.Those two regression problems (14)( 15) can be well solved by a fixed-point method proposed in [39].We mainly consider the problem (13) for the projectors w and c.Based on the half-quadratic (HQ) optimization [29], (13) could be rewritten as by introducing three sets of auxiliaries {α l } L l=1 , {β l } L l=1 , and {γ l } L l=1 , respectively, and ϕ(•) is a convex conjugated function of g(•).Hence we can conclude that optimizing (13) equals to updating (α l , β l , γ l ) and (w, c) alternately by Since the HQ optimization is an iterative process, we denote the k-th HQ iteration with the subscript k.First, according to the HQ mechanism, we update the auxiliaries with the current projectors Then, to optimize the projectors, we rewrite ( 17) by collecting the terms of projectors and omitting the auxiliaries as This exhibits a quadratic programming problem constrained by nonlinear equations, for which there exist numerous solutions in the literature, such as the sequential quadratic programming (SQP) that represents the state of the art in nonlinear programming methods [40].The comprehensive procedures for PMCR are summarized in Algorithm 1. auxiliaries-step: update (α l , β l , γ l ) with (18); until converged == TRUE 2) Convergence Analysis: For the regression coefficients p and b, one could find the detailed convergence analysis in [39].

Algorithm 1 Partial Maximum Correntropy Regression
Here we mainly consider the convergence of the projectors w and c in the optimization problem (13).Because correntropy is in nature an m-estimator [25], the local optimums of ( 13) are close to the global optimum, which is proved in one theoretical study [41].Accordingly, in what follows, we analyze that (13) would converge to a local optimum with the HQ optimization.
Proposition 1: ), the optimization problem with respect to the projectors in (13) would converge to a local optimum.
Proof: The convergence is proved as where the first inequality is guaranteed by the HQ mechanism [29], and the second inequality arises from the assumption of the present proposition.Note that, to guarantee the convergence of the projectors in (13), it is unnecessary to strictly attain the maximum of ( 19) at every projectors-step of Algorithm 1.On the contrary, so long as the updated projectors lead to a larger objective function J p at each projectors-step, the problem (13) can converge to local optimum.This reveals great convenience in practice, that one only needs a few SQP iterations.One can finish the projectorsstep once confirming the increase on J p , thus accelerating the convergence.
3) Hyper-Parameter Determination: There exist five Gaussian kernel bandwidths σ x , σ y , σ r , σ p , and σ b , respectively, to be determined in practice.In the literature, an effective method to acquire a suitable kernel bandwidth for probability density estimation, named as Silverman's rule, was proposed in [35].By denoting the current set of errors as E with L observations, the bandwidth is in which σ E is the standard deviation of the L errors, and R denotes the interquartile range.

IV. EXPERIMENTS
In this section, we evaluated the proposed PMCR algorithm on synthetic example and real ECoG datasets, respectively, by comparing it to existing PLSR methods.Specifically speaking, we compared PMCR to the conventional PLSR method.Next, we involved a regularized PLSR (RPLSR) for the comparison that was proposed in a recent work [23].Furthermore, PMCR was compared to the rudimentary MCC-based variant that was proposed in [32], which was denoted by MCC-PLSR in what follows.For a evenhanded comparison, all the algorithms used an identical number of factors, which would be selected by the conventional PLSR in five-fold cross-validation.The maximal number of factors was set as 100.
Considering the performance indicators for the evaluation, we used three typical measures in regression tasks: i) Pearson's correlation coefficient (r), ii) root mean squared error (RMSE) which is computed by in which ŷl and y l are the l-th observations for the prediction Ŷ and the target Y, respectively, and iii) mean absolute error (MAE) which represents the average One notes that, to compare the robustness between different algorithms, only contaminating the training samples by noises with isolating testing data from contamination is an extensively approved and implemented method in the literature, as advised in [42].Accordingly, only the training sample would suffer the adverse contamination in the following experiments.

A. Synthetic Dataset
First, we considered inter-correlated, high-dimensional, and noisy synthetic example, in which various PLSR methods were assessed with different levels of contamination.Randomly we created 300 i.i.d.latent variables t ∼ U (0, 1) for training, and 300 i.i.d.latent variables t ∼ U (0, 1) for testing, in which U denotes the uniform distribution, and the dimension of t was set as 20.We generated the hypothesis from the latent variable to the explanatory and response matrices then.Specifically, we randomly generated the transformation matrices with arbitrary values, which were subject to the standard normal distribution.The latent variables t were multiplied with a 20 × 500 transformation matrix, resulting in a 300 × 500 explanatory matrix for input.Similarly, we used a 20 × 3 transformation matrix to acquire a 300 × 3 response matrix for output.Hence, we predicted the multivariate responses from 500-dimensional explanatory variables with 300 training samples, and evaluated the prediction performance on the other 300 testing samples.
Considering the contamination for the synthetic dataset, we supposed the explanatory matrix to be contaminated, since the adverse noise mainly happens to the brain recording, which is usually used as the explanatory in the BCI system.Therefore, some training samples were randomly selected, the inputs of which were then replaced by noises with large amplitude.The noise level denotes the proportion of the contaminated samples in the entirety, which was increased from 0 to 1.0 with a step 0.05.For the distribution of noise, we utilized the zero-mean Gaussian distribution with large standard deviations to imitate outliers, where 30, 100, and 300 were employed, respectively.We evaluated the algorithms with 100 Monte-Carlo repetitive trials, and illustrate the results in Fig. 1.Note that, the results were further averaged across three dimensions of the output.
One observes from Fig. 1 that for all the three different noise distributions, the proposed PMCR achieved superior prediction performance compared to the other existing PLSR algorithms consistently for r, RMSE, and MAE, respectively, in particular when the training set suffered considerable contamination.

B. ECoG Dataset
To further demonstrate the superior robustness of the PMCR algorithm, we evaluated the various PLSR algorithms with the following practical brain decoding task.In this subsection, we used the public available Neurotycho ECoG dataset which was initially proposed in [14].
1) Dataset Description: Two Japanese macaques, denoted by Monkey B and C, respectively, were commanded to capture the foods with their right hands, in which the continuous threedimensional positions of hand with a sampling rate of 120 Hz were recorded by optical motion capture instrument.For both Monkey B and C, ten recording sessions were performed, and each recording session lasted 15 minutes.Both macaques were in advance implanted with customized 64-channel ECoG electrodes on the contralateral (left) hemisphere, which covered the regions from the prefrontal cortex to the parietal cortex.ECoG signals were recorded simultaneously during each session with a sampling rate of 1,000 Hz.In accordance with [14], for each recording session, the data of the first ten minutes was used to train a prediction model, while the data of the remaining five minutes was used to evaluate the prediction performance of the trained model.The schematic drawings of the experiments and ECoG electrodes are shown in Fig. 2 (a) and (b), respectively.
2) Decoding Paradigm: Considering feature extraction, we used an identical offline decoding paradigm as in [14].Initially, ECoG signals were preprocessed by a tenth-order Butterworth bandpass filter with cutoff frequencies from 1 to 400 Hz, and then re-referenced by the common average referencing (CAR).The continuous three-dimensional trajectories of the right wrist were down-sampled to 10 Hz, thus, leading to 9,000 samples in one session (10 Hz × 60 sec × 15 min).The three-dimensional position of time t was predicted from the ECoG signals in the previous one second.To describe the features of ECoG signals, we utilized the time-frequency representation method.For the position of time t, the ECoG signals at each electrode from t -1.1 s to t were processed by the Morlet wavelet transformation.Ten center frequencies ranging from 10 to 120 Hz with equal gaps on the logarithmic scale were considered for the wavelet transformation, overlaying the frequency bands which are most relevant to the motion tasks [14].The resultant time-frequency scalogram was then resampled at ten temporal lags with a 0.1 s gap (t -1 s, t -0.9 s,..., t -0.1 s).Therefore, the input variable of each sample exhibited a 6,400-dimensional vector (64 channels × 10 frequencies × 10 temporal lags), and the output variable was the three-dimensional position of the right hand.As a result, we trained a regression model with 6,000 samples (the first ten minutes) to predict the three-dimensional output variable from the 6,400-dimensional input variable, and evaluated the algorithms with other 3,000 testing samples (the remaining five minutes).The illustrative diagrams for ECoG decoding are summarized in Fig. 2 (c).
3) Contamination: To demonstrate the robustness of different algorithms in the practical ECoG decoding task, the ECoG signals were artificially contaminated by outlier to simulate the detrimental artifact.To be specific, we stochastically selected a certain quantity of the training ECoG samplings and replaced them with outliers which were subject to a zero-mean Gaussian distribution with the variance 50 times that of the signals from the corresponding channel.As stated in [43], the blink related artifacts were remarkably found in ECoG signal, which exhibit much larger amplitudes than a normal ECoG recording.Hence, we used the above-mentioned approach to artificially generate adverse artifacts, so as to contaminate the ECoG signals.This method has been widely utilized in the literature to deteriorate the brain recordings for evaluating the robustness of different algorithms [44], [45].
Note that, for this ECoG dataset, the 'Noise Level' signifies the ratio of the contaminated ECoG samplings in the entirety, which is different from the ratio of the deteriorated samples in 6,000 training samples.The proportion of the affected training samples can be evidently larger than the indicated noise level, since one contaminated ECoG sampling can deteriorate several time windows in the feature extraction.For example, when the noise level is indicated as 10 −3 , the deteriorated proportion of the training set is (0.6645±0.0089).Furthermore, we illustrate how the noises influence the time-frequency feature in Fig. 3.One could obviously perceive the heavy-tailed characteristic of the feature noise, which is particularly intractable for the least square criterion.In addition, the effects of high-frequency band are more prominent, due to the property of impulsive noise.
4) Spatio-spectro-temporal Pattern: Studying how spatiospectro-temporal weights in the regression model contribute to the entirety can help investigate the neurophysiological pattern.The element of the trained prediction model H is denoted by h ch,freq,temp , which corresponds to the ECoG electrode 'ch', the frequency 'freq', and the temporal lag 'temp'.Thus one could compute the spatio-spectro-temporal contributions by the ratio between the summation of absolute values of each domain and the summation of absolute values of the entire model W t (temp) = where W c (ch), W f (freq), and W t (temp) denote the contributions of the ECoG electrode 'ch', the frequency 'freq', and the temporal lag 'temp', respectively.5) Results: Firstly we assessed the different algorithms with the uncontaminated ECoG recordings.Hence, when the noise level was zero, the average performance indicators were gained from the acoustic 20 sessions (Monkey B and C).Afterwards, we contaminated each session with 5 repetitive trials.Thus, for each noise level, the different algorithms were assessed for 100 times (20 sessions × 5 repetitive trials).In TABLE I, we give the performance indicators for each algorithm under the noise levels 0, 10 −3 , and 10 −2 , respectively.One could observe that, the proposed PMCR realized the optimal prediction capability consistently, except for the Y-axis under noise level 0. Further, one can note that, when the noise level was zero, the proposed PMCR achieved better results than the other algorithms on the X-axis and Z-axis.The main reason is, in the acoustic sessions, motion-related artifacts have been considerably observed in the ECoG recordings [14], which further confirms the exceptional robustness of PMCR in practical brain decoding tasks.
In addition, we studied how the neurophysiological patterns of different algorithms were influenced by the sampling noises.We examined the changes between the spatio-spectro-temporal weights which were attained from the acoustic sessions and the contaminated sessions, respectively.Fig. 4 illustrates how the neurophysiological patterns were influenced by the noise level 10 −3 for each algorithm.The prediction model of Monkey B's Z-position was employed here.In Fig. 4 (a), for each algorithm we plot the spatial patterns that were attained from the acoustic sessions and the contaminated sessions, respectively, and also quantify how much the spatial pattern was affected by showing the summation of the absolute values of the difference between the weights of all electrodes, i.e. |W c (ch) − W c (ch)|, where W c (ch) denotes the spatial weight of the acoustic sessions and W c (ch) is obtained from the contaminated sessions.In Fig. 4 (b), for the spectral pattern, we show W f (freq) and W f (freq) for each algorithm, which were obtained from the acoustic and contaminated sessions, respectively, and show the summation of absolute value of the difference |W f (freq) − W f (freq)|.For the temporal pattern, similarly we illustrate W t (temp) and W t (temp) for each algorithm in Fig. 4 (c), which were attained from the acoustic and contaminated sessions, respectively.We also give |W t (temp) − W t (temp)| for each algorithm.One sees that, the proposed PMCR algorithm realized the minimum summation of the absolute value of the difference between the original pattern and the affected pattern for each domain.This demonstrates the robustness of PMCR to extract more accurate neurophysiological information.

A. Number of Factors
The number of factors S plays a vital role in PLSR methods, representing the iteration numbers to decompose the input and output matrices.Since it usually causes a notable effect on the result, in addition, we assessed the performance with respect to the number of factors for each method.To this end, we utilized the synthetic dataset from Subsection IV-A with noise standard deviation being 100 under three different noise levels, 0.2, 0.5, and 0.8, respectively.The resultant performance indicators for each algorithm with respect to the number of factors are shown in Fig. 5 with 100 repetitive trials.
One perceives from Fig. 5 that not only the proposed PMCR eventually achieved superior regression performance indicators with the optimal number of factors, but it could realize a rather admirable performance with a small number of factors as well.Specifically, for example, when the noise level was equal to 0.5 the proposed PMCR achieved its optimal performance with no more than 20 factors.By comparison, for the other algorithms, when the number of factors was larger than 20, their regression TABLE I: Regression performance indicators of different algorithms on the Neurotycho epidural ECoG dataset with three noise levels 0, 10 −3 , and 10 −2 , respectively.For the noise level 0, the results were averaged across the acoustic 20 sessions (Monkey B and C).For the other noise levels, the average results were acquired from 100 trials (20 sessions × 5 repetitive trials).The results are given in mean±deviation, while the optimal results under each condition are marked in bold.The proposed PMCR realized the optimal prediction performance consistently, except for the Y-position with the noise level being 0. performance remained promoting significantly.When the noise level was 0.8, PMCR revealed consistency for all performance indicators when the number of factors became larger than 40, whereas the other methods acquired their optimal performance with a larger number of factors.This suggests that, PMCR can abstract substantial information with a rather small number of factors from training samples in a noisy regression task.

B. PMCR with Regularization
One should additionally note that, the PMCR was proposed by reformulating the conventional PLSR algorithm with using the robust MCC, instead of the mediocre least square criterion.Hence, the proposed PMCR exhibits the supplementary potential for further performance improvements with regularization techniques, as in the existing regularized PLSR methods.The L 1 -norm regularization could be utilized in (13) to encourage sparse and robust projectors.In addition, if one requires better smoothness on the output matrix, polynomial or Sobolev-norm penalization could be introduced into PMCR.In the literature, MCC-based algorithm with regularization techniques has been widely investigated.For instance, a robust implementation of the sparse representation classifier (SRC) for face recognition was developed by regularizing the MCC-based SRC loss with L 1 -norm [46].

C. Extension to Multi-Way Application
The multi-way PLSR establishes the regression relationship between tensor variables with dimensionality reduction by the tensor factorization technique.In the literature, the multi-way PLSR is usually reported to achieve better decoding capability than the generic PLSR algorithm in the brain decoding tasks, where the high-dimensional spatio-spectro-temporal feature is organized with a tensor form.Essentially, the multi-way PLSR decomposes the input and output tensors with the least square criterion by minimizing the Frobenius norm, which represents a generalization of L 2 -norm [47].Hence, the multi-way PLSR is prone to the performance deterioration caused by noises as well.
The proposed PMCR method treats the regression problem of matrix variable, i.e. two-way variable.Extending the PMCR algorithm to multi-way application could probably improve the prediction performance further, which would be investigated in The proposed PMCR algorithm not only acquired better prediction results than the other algorithms ultimately with the optimal number of factors, but also achieved admirable regression performance with a small number of factors.
depth in our future works.Promisingly, MCC has been verified effective for tensor variable analysis in a recent study [48].

VI. CONCLUSION
This paper proposed a new robust variant of PLSR algorithm by reformulating the non-robust least square criterion with the sophisticated MCC framework.The proposed robust objective functions can be efficaciously optimized by half-quadratic and fixed-point-based optimization means.Extensive experimental results on the synthetic dataset and Neurotycho epidural ECoG dataset respectively demonstrated that the proposed PMCR can outperform the existing PLSR algorithms, showing promising robustness in high-dimensional and noisy brain decoding tasks.

14 :
compute latent variables t s = X s w s and u s = Y s c s ; 15: compute p s and b s (14)(15) by the fixed-point method; 16: update the residual matrices X s+1 = X s − t s p T s and Y s+1 = Y s − b s t s c T s ; 17: end for 18: organize the matrices T = [t 1 , .., t S ], P = [p 1 , .., p S ], B = diag(b 1 , .., b S ), and C = [c 1 , .., c S ]; 19: compute H = P T + BC T

2 MAEFig. 1 .Fig. 2 .
Fig. 1.Average regression performance indicators of the inter-correlated, high-dimensional, and contaminated synthetic dataset under different noise standard deviations with noise levels from 0 to 1.0.(a): noise standard deviation = 30, (b): noise standard deviation = 100, and (c): noise standard deviation = 300.The performance indicators were acquired from the 100 Monte-Carlo repetitive trials and averaged across three dimensions of the output.The proposed PMCR algorithm realized better performance than the existing PLSR algorithms consistently for r, RMSE, and MAE, in particular when the training set was contaminated considerably.

Fig. 3 .
Fig. 3. Distributions and scalograms of the resultant time-frequency noise resulting from the ECoG sampling noises.(a): noise level = 10 −3 (the deteriorated proportion of training set = 0.6645 ± 0.0089), (b): noise level = 10 −2 (the deteriorated proportion of training set ≈ 1).The time-frequency noises are calculated by subtracting the training sets which are obtained from acoustic and contaminated ECoG signals, respectively.The distributions are averaged across 20 sessions of Monkey B and C, while the scalograms are averaged across electrodes.The peaks of distributions are truncated to emphasize the heavy-tailed characteristic.

Fig. 4 .
Fig. 4. Spatio-spectro-temporal contributions of the prediction model for Monkey B's Z-position under noise levels 0 and 10 −3 .(a): spatial contributions for each electrode.The upper row shows the spatial contributions of the acoustic sessions W c (ch) and the bottom row presents the contributions of the contaminated sessions W c (ch).(b): spectral contributions for each frequency.(c): temporal contributions for each temporal lag.For both (b) and (c), the contributions of the acoustic sessions are shown in solid lines, while the contributions of the contaminated sessions are illustrated in dashed lines.For each domain, the summations of the absolute value of the difference between the original pattern and the deteriorated pattern are given for each algorithm on the right.PMCR achieved the minimum difference between the original patterns and the deteriorated patterns for each domain.

2 MAEFig. 5 .
Fig. 5. Average regression performance indicators of the synthetic dataset with noise standard deviation being 100 under three noise levels with the number of factors increasing from 1 to 100.(a): noise level = 0.2, (b): noise level = 0.5,and (c): noise level = 0.8.The performance indicators were obtained from 100 repetitive trials and averaged across three dimensions of the output.The proposed PMCR algorithm not only acquired better prediction results than the other algorithms ultimately with the optimal number of factors, but also achieved admirable regression performance with a small number of factors.