Multi-Task Learning Based on Stochastic Configuration Networks

When the human brain learns multiple related or continuous tasks, it will produce knowledge sharing and transfer. Thus, fast and effective task learning can be realized. This idea leads to multi-task learning. The key of multi-task learning is to find the correlation between tasks and establish a fast and effective model based on these relationship information. This paper proposes a multi-task learning framework based on stochastic configuration networks. It organically combines the idea of the classical parameter sharing multi-task learning with that of constraint sharing configuration in stochastic configuration networks. Moreover, it provides an efficient multi-kernel function selection mechanism. The convergence of the proposed algorithm is proved theoretically. The experiment results on one simulation data set and four real life data sets verify the effectiveness of the proposed algorithm.


INTRODUCTION
In supervised machine learning, we often encounter situations that establishing models for several related tasks, such as searching cancer sites, identifying cancer types, judging cancer stages and so on, based on cancer image data. Generally, these tasks are undertaken separately, which we refer to as single-task supervised learning (STSL) in traditional machine learning (Ben-David and Schuller, 2003). These models do not consider the correlation among multiple tasks so some common information in model parameters or data features is lost. In particular, when the training sample size of a single task is insufficient, it is difficult for STSL to capture enough information, which results in poor generalization performance. Multi-task supervised learning (MTSL) provides a solution for such a situation. It improves the performance of each task by setting shared representations among related tasks (Baxter, 2000;Argyriou et al., 2007;Liu et al., 2017). In a sense, a very important reason why human beings can learn based on a small number of samples is that human beings can make full use of various senses to obtain enough information and synthesize relevant information. MTSL is one of the ways to realize this idea.
Classical MTSL can be roughly divided into two categories, namely, MTSL based on constraint sharing and MTSL based on parameter sharing. In relation to the first method, Argyriou et al. (2008) proposed the MTL-L 21 based on regularization strategies, that was achieved by adding a regularization term for all the tasks' objective function coefficients on the cost function. But this method performs poorly when data features have the problem of collinearity. To reduce the impact of this problem, Chen et al. (2012) added a quadratic regularization term for all the tasks' objective function coefficients based on MTL-L 21 . In 2015, Duong et al. (2015) used L 2 distance to regularize the parameters in their multi-task neural networks, so that each task has similar but different model parameters. In 2017, Yang and Hospedales (2017) used the trace norm to implement Duong's model. In 2019, Oliveira et al. (2019) attempted to conceive a group LASSO with asymmetric transference formulation in multi-task learning, looking for the best of both worlds in a framework that admits the overlap of groups. Since all of these MTSL methods need to learn sparse features, their performance is not ideal when the data has few features. The MTSL methods based on parameter sharing (Caruana, 1997;Jacob et al., 2009) are not affected by this problem. In 1997, Caruana (Caruana, 1997) proposed a MTSL method (MTL) based on backpropagation neural networks. He mirrored the correlation information by sharing the input and hidden layer neurons among different tasks. In 1998, Lecun et al. (1998) used convolutional neural networks, named as LeNet-5, for document recognition on the basis of MTL. Their results clearly demonstrated the advantages of training a recognizer at the word level, rather than training it on presegmented, hand-labeled, isolated characters. In 2018, Ma et al. (2018) proposed multi-gate mixture-of-experts (MMoE), which adapted the mixture-of-experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also had a gating network trained to optimize each task. In 2021, Zhang et al. (Zhang et al., 2021) developed a programming framework, AutoMTL, which generates compact multi-task models given an arbitrary input backbone convolutional neural network model and a set of tasks. However, these methods have high computational complexity and poor learning performance when the training samples are insufficient.
To address the aforementioned problems, this paper proposes a MTSL method based on a constraint sharing framework of stochastic configuration networks (SCNs) proposed by Wang et al. (Wang and Li, 2017a;Wang and Li, 2017b) Instead of the complex gradient descent method for solving the weight parameters of hidden layer nodes in general neural networks, SCNs use a supervision mechanism to stochastically configure these parameters. This stochastic configuration mechanism greatly reduces the computational complexity. Inspired by this idea, we establish a multi-task supervised learning algorithm based on stochastic configuration radial basis networks (MTSL-SCRBN). The main contributions of this study are as follows.
1. We combine constraint sharing of SCNs and parameter sharing of MTSL organically. The shared parameters are stochastically configured under certain constraint, which has low computational complexity. At the same time, to improve learning performance, the radial basis functions (Powell, 1987;Broomhead and Lowe, 1988) with different scale parameters are used as the basis functions to replace the original sigmoid functions of SCNs. 2. Two types of difficult to choice hyper parameters of the proposed model, the scale parameters and the centers of the radial basis functions, are stochastically configured during the learning process.
The rest of the paper is organized as follows. In Section 2, we briefly review MTL-L 21 , MTEN, MTL, and SCNs. Section 3 details our proposed algorithm MTSL-SCRBN and proves its convergence. The experimental results of these algorithms on one simulation data set and five real data sets are detailed in Section 4. Section 5 summarizes this paper.

RELATED WORK
Firstly, we introduce some notations. Suppose that there are M supervised learning tasks. The samples of the m-th task are given by, . ., N m , m = 1, . . ., M, and T means transpose transform.

Multi-Task Learning Methods Based on Constraint Sharing
Inspired by group sparsity, Argyrios et al. (Argyriou et al., 2008) proposed the MTL-L 21 method to learn the correlation among multiple tasks under a regularization strategy. It can be described as the following optimization problem, represents the m-th column of V, which is the coefficient vector of the m-th task. λ represents the regularization coefficient and V 2,1 For the inputx m of the m-th task, MTL-L 21 gives the predicted value f(x m ) x m T v mp MTL−L21 . When data features have the problem of collinearity, MTL-L 21 will have an unstable prediction performance. Xi Chen et al. (Chen et al., 2012) proposed the MTEN method by adding another quadratic regularization term for the objective function coefficients of all tasks on the basis of MTL-L 21 . It can be described as the following optimization problem, where ρ ∈ [0, 1] represents the elastic net mixing parameter. For the inputx m of the m-th task, MTEN gives the predicted value f(x m ) x m T v mp MTEN . In the case of insufficient data features and data size, the two algorithms MTL-L 21 and MTEN cannot obtain enough information by learning sparse features, which leads to poor prediction performance. Caruana (1997) implemented MTSL on backpropagation nets by sharing input and hidden layer neurons among different tasks. Essentially, this method optimizes the choice of function space by the correlation among tasks and obtains better internal weight parameters. Figure 1 shows the process of traditional backpropagation nets to deal with four related tasks. This method ignores the information among related tasks. Especially in the case of insufficient data samples, these models may have problems such as over-fitting. Figure 2 shows the multi-task backpropagation net (MTL) conceived by Caruana. In MTL, each task shares input and hidden layer neurons.

Multi-Task Learning Methods Based on Parameter Sharing
Compared with the data form given in Eq. 1, the data form suitable for MTL is that different tasks have the same input, and the output is, MTL can be described as follows where S is the number of hidden layer nodes, β m j is the external weight parameter of the m-th task in the j-th hidden layer node. g j ≔ g j (X) [g(x T 1 w j + b j ), . . . , g(x T N w j + b j )] T (here g represents the sigmoid function), w j [w j,1 , . . . , w j,d ] T and b j represent the internal weight parameter vector and the bias internal weight parameter shared by the backpropagation net. β ∈ R N×M , W ∈ R d×N , b ∈ R N are the corresponding parameter matrix. For the inputx m of the m-th task, MTL gives the predicted value f(x m ) From a mathematical point of view, the essence of the backpropagation net is the gradient descent algorithm. In single-task supervised learning, the backpropagation net may fall into local optimum. However, in multi-task supervised learning, the local optimum of different tasks is in different positions, and the interaction among tasks can help the hidden layer to escape from local optimums (Caruana, 1997).

Stochastic Configuration Networks
Wang and Li (Wang and Li, 2017a;Wang and Li, 2017b) proposed supervised stochastic configuration networks, and implemented SCNs using three algorithms SC-i (i=I,II,III). SCi starts with a small network structure (Tin-Yan Kwok and Dit-Yan Yeung, 1997), and uses a supervision mechanism to add hidden layer neurons until the model meets a predetermined error criterion. Since SC-III performs the best of the three algorithms, we next describe the implementation of the SC-III algorithm.  Suppose a SC-III with L − 1 hidden layer nodes has already been constructed, that is, ∈ R q represents the optimal external weight parameter of the j-th hidden layer node, w p j and b p j represent the optimal internal weight parameters of the jth hidden layer node.
For training data set , which is the residual error matrix of the (L − 1)-th hidden layer node. If e L−1 F does not meet the predetermined error criteria, SC-III needs to generate a new hidden layer node, that is, stochastically if min n (ξ L n ) ≥ 0, then w L , b L are considered to meet the condition, otherwise w L , b L need to be configured again. With the qualified internal weight parameters w L * and b L * , SC-III obtains the optimal external weight parameter vector by the following optimization problem, The leading model f L (x) L j 1 g(x T w j * + b j *)β L j will have an improved residual error. Repeat the above steps to add hidden layer nodes until the residual error meets the predetermined error criteria.

MULTI-TASK SUPERVISED LEARNING BASED ON STOCHASTIC CONFIGURATION RADIAL BASIS NETWORKS
In this section, we introduce the proposed MTSL-SCRBN algorithm.

Model Introduction
In order to combine SCNs and MTL organically, we need to change the data form given in Eq. 1. In MTSL-SCRBN, first, we require each task to have a same number of samples, namely, N 1 =. . .= N M =: N. (If the number of samples for each task is different, this requirement can be achieved by random sampling.) Then we merge the input data of different tasks into a new input data, that is, In order to obtain good learning performance, we use the following radial basis function k σ (x, x′) as model's basis function, where x is the input, x′ represents the center and σ is the scale parameter.
Suppose f 0 = [0, 0, . . ., 0] ∈ R M , for L = 2, 3, . . ., it is assumed that a MTSL-SCRBN with L − 1 hidden layer nodes has already been constructed as follows, ∈ R M represents the optimal external weight parameter vector of the m-th task in the j-th hidden layer node, x j * and σ j * are the optimal center and the optimal scale parameter of the radial basis function in the j-th hidden layer node, respectively. Different from the traditional learning of radial basis neural network, in our MTSL-SCRBN, the optimal centers and the optimal scale parameters at each step are randomly assigned by a shared supervision mechanism given in the following. This is simple to implement and easy to obtain a learning model with good performance.
Denote Here k m x L are considered to meet the condition, otherwise, σ L , x L need to be configured again. With the qualified parameters σ L * and x Lp , MTSL-SCRBN obtains the optimal external weight parameter vector by the following optimization problem, The leading model, Frontiers in Bioengineering and Biotechnology | www.frontiersin.org August 2022 | Volume 10 | Article 890132 will have an improved residual error. Repeat the above steps to add hidden layer nodes until the residual error meets the predetermined error criteria.
The above implementation process of the proposed MTSL-SCRBN algorithm is described as follows.
Algorithm 1. The MTSL-SCRBN algorithm Notice that in step 19, we calculate the parameter matrix based on the standard least squares method, where K † L is the Moore-Penrose generalized inverse (Lancaster and Tismenetsky, 1985) of K L . The setting of μ L,r in step 7 and the updating idea of r in step 14 can be referred to literature (Wang and Li, 2017a).

The Convergence Theorem of the MTSL-SCRBN Algorithm
We extend the method in (Wang and Li, 2017a) to the multi-task learning framework of this paper and prove the convergence of the proposed algorithm. Theorem 1. Assume that there are some p k ∈ R + , satisfying 0 < k m j 2 < p k . Given 0 < r < 1 and a non-negative real value sequence {μ L } with lim L→+∞ μ L = 0 and μ L ≤ (1 − r). For L = 2, 3 . . ., denoted by If the basis function k m L is generated to satisfy the following inequality, and the external weight parameter vector is evaluated by, F } is monotonically decreasing and convergent. Hence, we have, Then, the following inequality holds, Since lim L→+∞ μ L = 0, and 0 < r < 1, we have lim L→+∞ e L 2 F 0, and lim L→+∞ e L F = 0. Remark 1. Unlike SC-III, we relax the condition for the configuration parameters in the formula (p). SC-III requires each task to meet the inequality conditions, but MTSL-SCRBN only requires the sum of all tasks to satisfy the inequality condition. The rationality of this condition will also be verified in the experiment results of next section.

EXPERIMENT RESULTS
In order to show the effectiveness of the proposed algorithm, this section uses the classical STSL algorithms SVM (Cortes and Vapnik, 1995), SC-III (Wang and Li, 2017a) and seven MTSL algorithms MTSL-SCRBN, MTL (Caruana, 1997), MTEN (Chen et al., 2012), DMTRL (Yang and Hospedales, 2017), MMoE (Ma et al., 2018), GAMTL (Oliveira et al., 2019), AUTOMTL (Zhang et al., 2021) to perform comparative experiments. All calculations are conducted using Python 3.6.5 on a computer with 2.60 GHz CPU and 8 GB RAM. The input features are scaled into [ − 1, 1] and the output remains unchanged. All the results reported in this paper take averages over 20 independent trials, except for the SVM and MTEN algorithms, which have fixed experiment results. The accuracy (ACC) and root mean square error (RMSE) are chosen as the classification and regression evaluation indicators, where with y m i andŷ m i representing the target output and the learner's output of i − th sample for task m respectively.
For different data sets, some algorithms used in the following experiments can stochastically configure hyperparameters within specified ranges or determine parameters by cross-validation. Table 1 gives the specific selection range of each parameter.

Experimental Results and Analysis on Simulated Data
The simulation data set selected in this paper is generated by the following five functions, which we refer as five tasks,  The results with the minimum test errors are marked in bold.
Frontiers in Bioengineering and Biotechnology | www.frontiersin.org August 2022 | Volume 10 | Article 890132 Task 1: Task 2: Task 3: Task 4: Task 5: Figure 3 depicts the distributions of five functions on [0, 1]. As we can see, when the independent variables of the five functions are the same, the function values follow similar trends. Therefore, learning the function values of five functions with the same independent variable can be regarded as a multi-task learning. Here, we independently extract 100 one-dimensional input data from the same uniform distribution, then calculate the corresponding function values according to these five functions, and add white Gaussian noise with a standard deviation of 0.01 to form 100 five-dimensional output data. In the following experiments, we randomly select 70% of the data as training data and 30% of the data as test data. From the figures of the five functions, it can be seen that only 70 training samples are not enough to achieve good single-task learning results. We verify this point by experimenting with single-task and multi-task algorithms.
Firstly, we compare the learning performance of the proposed MTSL-SCRBN with other two STSL algorithms, SVM and SC-III, on five tasks. For MTSL-SCRBN, these five tasks are combined to learn together. The training and test RMSEs on five tasks for these three methods are given in Table 2. Clearly, the proposed multi-task learning model can product better performance on each task than the two STSL models, which only use 70 samples to learn each task independently. Furthermore, we show the learning effects of the three algorithms on Task 1 in Figure 4. It is can be seen that the proposed MTSL-SCRBN has good learning performance where the data changes dramatically.
Next, the comparison results of three MTSL algorithms, MTSL-SCRBN, MTL, MTEN, on the simulation data set are recorded in Table 3 and Figure 5. In Table 3, the values in parentheses represent the standard deviations of 20 experiments' results. According to these results, compared with MTEN and MTL, the proposed MTSL-SCRBN has better approximation ability.

Experimental Results and Analysis on Benchmark Datasets
This subsection further compares seven MTSL algorithms on four benchmark datasets. They are MTL, MTEN, DMTRL, MMoE, GAMTL, AUTOMTL and the proposed MTSL-SCRBN. According to the characteristics of data sets and algorithms, different algorithms will be selected for comparative analysis on different data sets. The four benchmark datasets include three regression problems on the stock portfolio performance data set, the bionic robot data set SARCOS and the School data set, one classification problem on the Mnist data set from Yann LeCun 1 The basic information of the four datasets are summarized in Table 4.
Firstly, we compare the performance of MTL, MTEN, DMTRL, MMoE, GAMTL and MTSL-SCRBN on different sizes of three regression data sets. For the three data sets, we randomly choose 15/30/3,500 samples outside the training set as test set, respectively. At the same time, we select 5 tasks, all of which have more than 230 samples, from 139 tasks in School data set. Table 5 below shows the specific experiment results. As we can see, the performance of each algorithm tends to be better and more stable with an increasing number of training samples. Furthermore, the proposed MTSL-SCRBN algorithm exhibits good performance even with a small number of training observations. Then, in order to further verify the performance of MTSL-SCRBN for classification cases, we compare the results of MTSL-SCRBN, DMTRL, MMoE, GAMTL and AUTOMTL on the Mnist data set. We randomly choose 50/100/150 samples from each task in the Mnist data set as training set, and 10000 samples in the remaining samples as test set. Considering that this is a high The results with the minimum test errors are marked in bold.  The results with the minimum test errors are marked in bold.   dimensional small sample problem, we firstly reduce the dimensionality of the data set, and then use MTSL-SCRBN for training and prediction. There are many dimensionality reduction methods, such as Principal Component Analysis(PCA) (Pearson, 1901), Latent Dirichlet Allocation(LDA) (Blei et al., 2003), Sequential Markov Blanket Criterion (SMBC) (Pratama et al., 2017), Auto Encoder (Hinton and Salakhutdinov, 2006) and so on. Here, we use the performance of the dimensionality-reduced data in the MTSL-SCRBN as the selection criterion, and choose Auto Encoder to reduce the dimension of original data set into 30 dimensions. It can be seen from Table 6 that the performance of MTSL-SCRBN which uses dimensionality reduction data set is a little bit better than that of other Multi-task deep learning algorithms, but as the sample size increases, the performance of the two algorithms gradually approaches.

Comparison Experiment Results for Different Activation Functions
The previous results show that the proposed MTSL-SCRBN algorithm is effective for multi-task learning in the case of small samples. This subsection mainly discusses the impact of selecting different activation functions on algorithm performance. Here we select other three usually used activation functions. They are sigmoid function, Tanh function and ReLU function. After replacing the radial basis functions in MTSL-SCRBN with these three functions respectively, the model names are respectively called MTSL-SCSGM, MTSL-SCTANH and MTSL-SCReLU. We choose to conduct comparative experiments on the Stock and SARCOS data sets.
For different data sets, the parameters contained in each algorithm need to be randomly set or cross verified within a certain range. The specific selection range of each parameter is given in Table 7. Table 8 depicts the RMSE results of stochastic configuration multi-task learning models based on four different activation functions. Under the training samples with different sample sizes in the two data sets, the MTSL-SCRBN, which based on radial basis functions, has certain advantages over other three models in terms of performance.

CONCLUSION
In this paper, we propose a multi-task supervised learning framework based on stochastic configuration radial basis network. It can be effectively used in classification and regression problems when a single task has a small number of samples. The series experiment results on the four data sets show the proposed MTSL-SCRBN achieves a good performance compared with some existing methods.
Interesting areas for further directions include using the proposed algorithm in hyperspectral remote sensing image classification and other related research areas, considering the impact of using different activation functions in the network, and trying to explore the range of the sample size of the data set to use the multi-task learning method.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

FUNDING
This work was supported in part by the Characteristic and Preponderant Discipline of Key Construction Universities in Zhejiang Province (Zhejiang Gongshang University-Statistics) and the special fund for basic business expenses of colleges and universities in Zhejiang Province. The results with the minimum test errors are marked in bold.