Edited by: Shu Tao, UCLA Jonsson Comprehensive Cancer Center, United States
Reviewed by: Kumardeep Chaudhary, Icahn School of Medicine at Mount Sinai, United States; Wei Chen, City of Hope National Medical Center, United States
This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Breast cancer is one of the most common cancer diseases in women. The rapid and accurate diagnosis of breast cancer is of great significance for the treatment of cancer. Artificial intelligence and machine learning algorithms are used to identify breast malignant tumors, which can effectively solve the problems of insufficient recognition accuracy and long time-consuming in traditional breast cancer diagnosis methods. To solve these problems, we proposed a method of attribute selection and feature extraction based on random forest (RF) combined with principal component analysis (PCA) for rapid and accurate diagnosis of breast cancer. Firstly, RF was used to reduce 30 attributes of breast cancer categorical data. According to the average importance of attributes and out of bag error, 21 relatively important attribute data were selected for feature extraction based on PCA. The seven features extracted from PCA were used to establish an extreme learning machine (ELM) classification model with different activation functions. By comparing the classification accuracy and training time of these different models, the activation function of the hidden layer was determined as the sigmoid function. When the number of neurons in the hidden layer was 27, the accuracy of the test set was 98.75%, the accuracy of the training set was 99.06%, and the training time was only 0.0022 s. Finally, in order to verify the superiority of this method in breast cancer diagnosis, we compared with the ELM model based on the original breast cancer data and other intelligent classification algorithm models. The algorithm used in this article has a faster recognition time and a higher recognition accuracy than other algorithms. We also used the breast cancer data of breast tissue reactance features to verify the reliability of this method, and ideal results were obtained. The experimental results show that RF-PCA combined with ELM can significantly reduce the time required for the diagnosis of breast cancer, which has the ability of rapid and accurate identification of breast cancer and provides a theoretical basis for the intelligent diagnosis of breast cancer.
Cancer is a disease that seriously threatens human health. The latest annual report on cancer incidence in the United States (
The traditional diagnosis method of breast cancer is mainly a fine-needle aspiration cell method (
Random forest (RF) is a supervised learning algorithm, which can select features according to the importance of attributes and reduce the complexity of the model (
The present work is concerned with the development of analytical method for rapid identification of breast cancer categorical data based on attribute selection and feature extraction. Firstly, the RF is used for characteristic attribute selection processing of original breast cancer data, and the samples are divided into a training set and test set. Then, feature extraction and dimensionality reduction of selected attribute data by the PCA. Finally, the extracted characteristic data are used as the input of the ELM to establish the identification model of breast malignant tumor. Brief conclusions and future work are summarized at the end of the article.
The validity and feasibility of the methods described in this article were verified by the University of Wisconsin breast cancer data sets (
The hardware conditions of the computer used in the experiment include an Intel Core i7 processor, an NVIDIA RTX 2070 graphics card, and a 16G Kingston memory module, etc. The algorithm simulation is run in MATLAB R2016b (MathWorks, United States) environment.
The feature selection method is to select features from the original attribute data and get a new feature subset composed of the original features, so as to reduce the number of attributes in the attribute set. It is an inclusive relationship and does not change the original feature space (
The steps for attribute selection of RF algorithm are as follows:
Attribute importance
Where,
Standardization refers to the pre-processing of data so that the values fall into a unified range of values. In the process of modeling, the difference of each feature amount is reduced (
[0,1] normalization:
[−1,1] normalization:
Where
The method of feature extraction is mainly to transform the feature space through the relationship between attributes, map the original feature space to the low-dimensional feature space, so as to complete the purpose of dimension reduction (
The steps of the PCA algorithm for feature extraction are as follows:
As the basis of selecting the number
The ELM is a simple and efficient learning algorithm proposed by professor Huang (
Network structure diagram of ELM.
There are
Where
The steps of the ELM algorithm are as follows:
Sin function, Hardlim function, and Sigmoid function can be selected as the activation function of hidden layer neurons (
In order to better evaluate the performance of classifier, we introduce the confusion matrix. In the field of machine learning, confusion matrix is a visual tool to evaluate the performance of classification models. Among them, each column of the matrix represents the situation of predictive samples and each row of the matrix represents the situation of actual samples (
Accuracy is the ratio of the correctly classified examples to the total sample size.
Precision is the percentage of samples are correctly classified as true positive.
Sensitivity is the percentage of samples are correctly classified as true positive in total positive samples.
Specificity is the percentage of samples are correctly classified as true negative in total negative samples.
F1-score is an index used to measure the accuracy of a binary classification model.
MCC is essentially a balanced index that describes the correlation coefficient between the actual classification and the predicted classification, which is used to measure the classification performance of binary classification. The value range of MCC is [−1,1]. The closer the MCC value is to 1, the better the classifier performance.
There are 30 attributes in the original breast cancer data, each of which contains the corresponding information of breast tumor lesion tissue. Different attributes play different roles in the analysis of breast cancer data. Redundant and less important attributes will affect the establishment of breast cancer of a predictive model, which cannot achieve high prediction accuracy, but also increase the complexity of the model and reduce the efficiency of breast cancer prediction. Attribute selection based on RF of the method is used to select more important attributes to improve the efficiency of modeling and prediction ability. Before RF is used, we set the number of trees to 200, the number of leaf node samples to 1, and the number of fboot to 1.
The importance ranking of the first selected attribute is shown in
Ranking of attribute importance for RF initial selection.
The threshold value of attribute selection based on RF is set to 0.1, the attributes whose importance is lower than the threshold value are deleted, and the remaining 27 attributes are selected as the result of RF initial attribute selection. 27 attributes of the first reduction are continued to be selected by RF. We delete the redundant attributes whose importance is lower than the threshold value, calculate the importance of the remaining attribute sets and each attribute in it, and arrange them in descending order of importance.
The ranking of attribute importance for four iterations is shown in
Ranking of attribute importance. The threshold value of attribute selection based on RF is set to 0.1.
Evaluation indexes of five iterations.
1 | 27 | 0.4315 | 0.0335 |
2 | 26 | 0.4381 | 0.0320 |
3 | 22 | 0.4792 | 0.0337 |
4 | 21 | 0.5214 | 0.0318 |
5 | 21 | 0.4987 | 0.0322 |
After RF selection, the number of attributes is reduced by 9 compared with the original data, and there is a lot of redundant information in these 9 attributes. In order to achieve the requirement of accurate prediction of breast cancer, PCA needs to be used to further simplify the data attributes. When PCA is used to extract features, to prevent PCA from over capturing some features with large values, which results in the loss of a large amount of information and the impact of features with large values on the results, we will standardize each feature first, so that their sizes are within the same range. PCA is employed to extract the 21 attributes of breast cancer data after attribute selection, and the cumulative contribution rate is 95%.
The 160 samples of each group are selected, and a total of 320 samples of breast cancer data are used as the training set. The remaining 40 samples of each group are selected, and a total of 80 samples of breast cancer data are used as the test set. [0, 1], [−1, 1], and
The predictive results of different normalization methods are shown in
Predictive results of different normalization methods.
2 | 90.94 (291/320) | 96.25 (77/80) | |
[−1,1] | 2 | 90.63(290/320) | 95 (76/80) |
7 | 99.06 (317/320) | 98.75 (79/80) |
From the variance contribution rate of the principal components in
Variance contribution rate of the principal components.
Cumulative contribution of principal components.
Cumulative contribution rate/% | 56.43 | 71.56 | 80.15 | 86.97 | 91.17 | 94.22 | 95.99 |
The prediction performance of the ELM model is affected by the type of activation function. By comparing and analyzing the predictive results of breast cancer under three different activation functions of sin, hardlim and sigmoid, the activation function with the best prediction effect was selected. Seven feature data are used to establish the predictive model of ELM under different activation functions, and the predictive results are shown in
Predictive results of different activation functions.
Sin | 0.0067 | 97.81 (313/320) | 95 (76/80) | 104 |
Hardlim | 0.0029 | 98.13 (314/320) | 98.75 (79/80) | 53 |
Sigmoid | 0.0022 | 99.06 (317/320) | 98.75 (79/80) | 27 |
In the predictive model of ELM, the number of input layer neurons, hidden layer neurons, and output layer neurons and network structure should be determined. The number of extracted features is 7, so the number of input layer neurons is 7. Because two types of breast tumors are predicted, the number of output neurons is 2. The number of hidden layer neurons is the key parameter that affects the prediction ability and generalization performance of ELM. The initial number of neurons in the hidden layer is set to 1. It is necessary to analyze the prediction of breast cancer by the ELM model corresponding to the number of different hidden layers. In order to reduce the training time of the model, the number of hidden layer neurons is set within 200.
As shown in
Predictive accuracy of different hidden layer neurons.
Training time of different hidden layer neurons.
In order to prove the reliability of attribute selection and feature extraction algorithm for breast cancer data modeling, the predictive results of the original data, the data after attribute selection, and the data after feature extraction are compared and analyzed, and the results are shown in
Predictive results of different dimensionality reduction methods.
ELM | 30 | 95.31 (305/320) | 95 (76/80) | 0.0020 | 14 |
RF + ELM | 21 | 97.5 (312/320) | 96.25 (77/80) | 0.0023 | 24 |
PCA + ELM | 10 | 97.19 (311/320) | 97.5 (78/80) | 0.0028 | 13 |
RF-PCA + ELM | 7 | 99.06 (317/320) | 98.75 (79/80) | 0.0022 | 27 |
In order to verify the superiority of the predictive model based on breast cancer data after RF-PCA dimensionality reduction, we also compared and analyzed the prediction performance of several different modeling methods based on the data after dimension reduction, such as a PNN, SVM, BP neural network, and DT. The optimal parameter
The predictive results of different modeling methods are shown in
Predictive results of different modeling methods.
PNN | 0.0339 | 99.69% | 99.38% | 100% | 99.38% | 99.69% | 0.99 | 95% | 95% | 95% | 95% | 95% | 0.9 |
SVM | 1.4601 | 99.06% | 98.16% | 100% | 98.13% | 99.07% | 0.98 | 95% | 97.37% | 92.5% | 97.5% | 94.87% | 0.9 |
BP | 9.6259 | 100% | 100% | 100% | 100% | 100% | 1 | 93.75% | 92.68% | 95% | 92.5% | 93.83% | 0.88 |
DT | 0.1669 | 98.13% | 98.13% | 98.13% | 98.13% | 98.13% | 0.96 | 95% | 95% | 95% | 95% | 95% | 0.9 |
ELM | 0.0022 | 99.06% | 98.16% | 100% | 98.13% | 99.07% | 0.98 | 98.75% | 97.56% | 100% | 97.5% | 98.76% | 0.98 |
The same algorithm can be applied to different data sets to ensure the reliability of the algorithm. If the algorithm proposed in this article can achieve good prediction results for different data sets, it can show that the algorithm has strong adaptability and generalization performance. The generalization performance of the algorithm is verified by the data (
We also compare common methods of classifiers used in the literature about breast cancer recognition for the new data. The reduced dimension data is fed into other classifiers and a predictive model is established. When the number of neurons in the hidden layer was 97, the ELM model has a best prediction performance. The optimal parameter
Predictive results of dimensionality reduction by RF-PCA.
Raw + ELM | 0.0096 | 95% | 85.71% | 100% | 85.71% | 92.31% | 0.86 | 92.31% | 92.86% | 97.5% | 92.5% | 95.12% | 0.9 |
ELM | 0.0011 | 100% | 100% | 100% | 100% | 100% | 1 | 96.15% | 92.31% | 100% | 92.86% | 96% | 0.93 |
PNN | 0.0314 | 91.25% | 94.59% | 87.5% | 95% | 90.91% | 0.83 | 88.46% | 80% | 100% | 78.57% | 88.89% | 0.79 |
SVM | 0.1592 | 97.5% | 95.24% | 100% | 95% | 97.56% | 0.95 | 96.15% | 100% | 91.67% | 100% | 95.65% | 0.93 |
BP | 1.3080 | 100% | 100% | 100% | 100% | 100% | 1 | 84.62% | 78.57% | 91.67% | 78.57% | 84.62 | 0.7 |
DT | 0.0551 | 96.25% | 93.02% | 100% | 92.5% | 96.39% | 0.93 | 92.31% | 91.67% | 91.67% | 92.86% | 91.67% | 0.85 |
All of these shows that the method proposed in this article can still achieve a better prediction performance and faster speed when applied to the new dataset to predict new samples. To a certain extent, the proposed method can exclude the possibility of overfitting of the models.
In this article, we put forward a new solution based on attribute selection and feature extraction for rapid diagnosis of breast cancer, which is called RF-PCA. Firstly, we used the attribute selection based on RF of algorithm to select the useful attributes of quantitative feature data of breast tumor cell images and then used the feature extraction algorithm based on PCA to reduce the dimension of data after attribute selection. Finally, the ELM model was established to test the prediction effect of breast cancer. In order to verify the reliability of this algorithm, we compared the prediction accuracy of ELM model after using RF or PCA alone. To verify the superiority of this algorithm, we also compared the prediction performance of different models and used the impedance data of the breast tissue to verify the adaptability of the algorithm.
The results show that (1) The feature selection based on RF or feature extraction based on PCA of a method can not only reduce the complexity of the training model but also improve the prediction accuracy of the model to a certain extent; (2) Combining feature selection with feature selection, we use the advantages of the two methods to reduce the dimension of data. Compared with the single dimension reduction method, it can reflect the effective information of the original data with fewer features, make the model simple, and improve the efficiency and reliability of modeling; (3) ELM model has high prediction accuracy and short training time, which effectively avoids over-fitting and has a certain generalization ability; (4) RF-PCA combined with ELM model can significantly reduce the training time of the network, and more adapt to the requirements of a rapid and accurate breast cancer aided diagnosis.
Despite the achievement of some research results, there are some limitations in this study. When the proposed algorithm in this article is used in breast cancer diagnosis, the training time is reduced and the prediction accuracy is better. However, these advantages mainly focus on the fast prediction speed and does not reach the optimal accuracy of all samples. Therefore, in future work, it will be necessary to study some optimization algorithms to improve the performance of the model and achieve the highest prediction accuracy on the basis of ensuring faster prediction speed.
Publicly available datasets were analyzed in this study. This data can be found here:
KB conceived the study. MZ developed the method and supervised the study. KB and WL implemented the algorithms. KB and FH analyzed the data. KB wrote the manuscript. All authors read and approved the final version of the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank the Departments of Computer Science, Surgery, and Human Oncology at the University of Wisconsin, Madison, United States to make the database available.
The Supplementary Material for this article can be found online at: