Construction and validation of a nomogram for predicting the prognosis of patients with lymph node-positive invasive micropapillary carcinoma of the breast: based on SEER database and external validation cohort

Background Invasive micropapillary carcinoma (IMPC) of the breast is a rare subtype of breast cancer with high incidence of aggressive clinical behavior, lymph node metastasis (LNM) and poor prognosis. In the present study, using the Surveillance, Epidemiology, and End Results (SEER) database, we analyzed the clinicopathological characteristics and prognostic factors of IMPC with LNM, and constructed a prognostic nomogram. Methods We retrospectively analyzed data for 487 breast IMPC patients with LNM in the SEER database from January 2010 to December 2015, and randomly divided these patients into a training cohort (70%) and an internal validation cohort (30%) for the construction and internal validation of the nomogram, respectively. In addition, 248 patients diagnosed with IMPC and LNM at the Fourth Hospital of Hebei Medical University from January 2010 to December 2019 were collected as an external validation cohort. Lasso regression, along with Cox regression, was used to screen risk factors. Further more, the discrimination, calibration, and clinical utility of the nomogram were assessed based on the consistency index (C-index), time-dependent receiver operating characteristic (ROC), calibration curve, and decision curve analysis (DCA). Results In summary, we identified six variables including molecular subtype of breast cancer, first malignant primary indicator, tumor grade, AJCC stage, radiotherapy and chemotherapy were independent prognostic factors in predicting the prognosis of IMPC patients with LNM (P < 0.05). Based on these factors, a nomogram was constructed for predicting 3- and 5-year overall survival (OS) of patients. The nomogram achieved a C-index of 0.789 (95%CI: 0.759-0.819) in the training cohort, 0.775 (95%CI: 0.731-0.819) in the internal validation cohort, and 0.788 (95%CI: 0.756-0.820) in the external validation cohort. According to the calculated patient risk score, the patients were divided into a high-risk group and a low-risk group, which showed a significant difference in the survival prognosis of the two groups (P<0.0001). The time-dependent ROC curves, calibration curves and DCA curves proved the superiority of the nomogram. Conclusions We have successfully constructed a nomogram that could predict 3- and 5-year OS of IMPC patients with LNM and may assist clinicians in decision-making and personalized treatment planning.


Introduction
Invasive micropapillary carcinoma of breast (IMPC) is a special type of invasive breast cancer, accounting for 0.9% -8% of all breast cancer (1,2).The tumor cells of IMPC are arranged in a pseudopapillary structure without a fibrous vascular axis.Epithelial membrane antigen (EMA) immunohistochemistry confirmed the reverse polarity of the neoplastic cells, and there is an irregular narrow gap structure between the cancer cell cluster and the surrounding stroma (Figure 1).Previous studies have considered that the morphological characteristics of IMPC are related to tumor biological behavior, particularly to tumor invasion, metastasis, and prognosis (3,4).Even if the proportion of micropapillary structures is less than 10%, compared with breast cancer of the same pathological type without micropapillary components, the invasive capacity of cancer is also significantly higher (5).Compared with invasive ductal carcinoma of no special type (IDC NST), IMPC is prone to local recurrence, distant metastasis, and lymph node metastasis (LNM), with a high incidence of LNM of 44% -85% (6).24.9% of patients are accompanied by lymph node invasion at the primary diagnosis (3).Lymph node status is not only an important basis for breast cancer staging, but also an independent prognostic indicator of breast cancer (7).Research has shown that breast IMPC patients with LNM have a higher risk of recurrence and a poorer prognosis (8).In a population-based study by Chen et al. (9), 52.9% of breast IMPC patients had LNM, with a 5-year overall survival (OS) only 83.8%.
The American Joint Committee for Cancer (AJCC) staging system is a widely used tool for clinicians to predict disease outcomes and guide therapeutic decision making (10).However, this staging only includes anatomical factors and does not cover factors such as cancer biology and treatment, which is insufficient to accurately predict the prognosis of all IMPC patients.Due to the limited benefits of neoadjuvant chemotherapy for lymph nodepositive IMPC patients, using a unified predictive model to predict the survival of IMPC patients inevitably leads to erroneous estimates of survival.In the era of precision medicine, the application of nomograms in individualized risk prediction is well recognized in a wide variety of cancers.Although there are currently various nomograms to predict the prognosis of IMPC, there is still a lack of nomogram to predict the prognosis of IMPC patients with LNM.Given the crucial role of powerful prognostic prediction tools in determining appropriate treatment methods to improve survival, it is necessary to discuss and construct predictive models for IMPC patients with positive lymph nodes to improve the accuracy of FIGURE 1 (A) The tumor cells of IMPC are arranged in a pseudopapillary structure without fiber vascular axis, and there are irregular and narrow interstitial structures between the cancer cell cluster and the surrounding stroma (stained with hematoxylin and eosin (HE) 100×).(B) Epithelial membrane antigen (EMA) staining shows staining positive sites are located on the mesenchymal side of the cancer cell mass at the edge of the cell membrane and interstitial lumen, which is characteristics of polarity reversal (100×).

Data sources and patient selection
The Surveillance, Epidemiology, and End Results (SEER) database is a cancer database in the United States that includes 18 cancer registries, covering 47.9% of the US population (13).In this study, we used SEER * Stat 8.4.0 software to extract clinical pathological data and prognosis results of patients from the SEER 18 registration database.All patients we selected were lymph node-positive IMPC patients (n=495) from January 2010 to December 2015.In the process of data screening, we found that missing values and outliers (n=8) accounted for a small proportion of the total number of samples (495).Therefore, the missing data and outliers were removed to improve the confidence of the statistical results.The final sample (n=487) with complete clinical-pathological characteristics and followup data is used for subsequent analysis.According to previous studies, these patients were randomly divided into a training cohort (70%) and an internal validation cohort (30%), respectively, for the construction and validation of nomogram (14, 15).We consider 7:3 to be an appropriate ratio to apply to this study.Using most of the data to construct a column chart can ensure the accuracy of the model, while a smaller portion of the data is used for validation to prevent overfitting.In addition, we collected all IMPC patients with positive lymph nodes who visited the Fourth Hospital of Hebei Medical University from January 2010 to December 2019 as external validation cohorts to further validate the constructed nomogram(n=248).
Eligible patients were determined based on the following inclusion criteria: 1) IMPC confirmed by pathology; 2) Women with a diagnosis age of ≥ 18 years old; 3) Patients with pathologically confirmed lymph node metastases; 4) Received surgical treatment.The exclusion criteria are as follows: 1) Distant metastasis; 2) Bilateral breast cancer; 3) Lack of clinicalpathological characteristics and follow-up data.
There was no requirement for ethical approval since all of the data from the SEER database was obtained in a public method.The study was approved by the Ethics Review Committee of the Fourth Hospital of Hebei Medical University (Approval Number:2020K-1334) and received written informed consent from all participants.

Variable collection
We recorded the following patient information: baseline demographics (age, race, marital status), tumor characteristics (laterality, location, size, first malignant primary indicator, histological grade, tumor node metastasis (TNM) staging, estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), molecular typing, number of positive lymph nodes, etc.), treatment information (surgical methods, radiotherapy, and chemotherapy), and survival outcomes (survival status, survival time).We restaged all patients in the study under the AJCC 8th edition stage group (16).In the analysis, some continuous variables were converted into categorical variable, such as age, tumor size, and number of positive lymph nodes.Patients were divided into two groups by age at diagnosis (<50 and ≥ 50); tumor size was divided into three groups (<2cm, ≥ 2cm and<5cm and ≥ 5cm) and the number of lymph node metastases was divided into two groups: 1-3 and ≥ 4. The endpoint of this study was OS, which was defined as the time from surgery to the date of last follow-up or death from any cause.The patients in external validation cohort were followed up by means of inpatient medical record review, outpatient follow-up, and telephone.The last follow-up date was November 10, 2022.

Statistical analysis
The raw data of training cohort were preprocessed by Z-score normalization and the same preprocessing procedure was applied to the validation cohort (Supplementary Figures 1, 2).The LASSO regression algorithm was used to screen clinicopathological characteristics that were significantly correlated with prognosis.Then, based on the final results of LASSO regression, the independent prognostic factors for OS were identified using a multivariate Cox regression analysis in the training cohort, and the hazard ratio (HR) and 95% confidence interval (95% CI) of these variables were calculated.Based on these independent prognostic factors, a nomogram was constructed to predict the 3and 5-year OS of IMPC patients with positive lymph nodes.Internal and external validation was also performed to further evaluate the nomogram model.The consistency index (C-index), timedependent receiver operating characteristic (ROC), timedependent area under the ROC curve (AUC) were used to evaluate the discrimination ability.The AUC or C-index ranged from 0.5 to 1.0, with 0.5 indicating the random chance of the model correctly predicting outcomes and 1.0 indicating perfect predictive performance.Usually, C-index and AUC value >0.7 indicate the satisfactory discriminative ability of the predictive tool.Calibration curves were plotted to assess the calibration ability of the nomogram.A calibration curve was constructed using the bootstrap method (1000 cycles) to show the deviation between the predicted value and the actual probability of occurrence.The standard curve is a straight line passing through the origin of the coordinate axis with a slope of 1.If the predicted calibration curve is closer to the standard curve, the better the prediction ability of the nomogram.To compare the accuracy of the new model with that of the traditional AJCC staging model, the net reclassification improvement (NRI) and the integrated discrimination improvement (IDI) were determined.The clinical application value of the nomogram was evaluated using decision curve analysis (DCA).In addition, we divided all patients into high-risk and low-risk groups according to their risk values on the nomogram.The log rank test was calculated to compare the survival difference between two groups and Kaplan-Meier curves were used to visualize the results.
Continuous variables are described by mean ± standard deviation, and categorical variables are expressed as numbers (percentage).All statistical analyses were conducted using R software (version 4.1.1;http://www.Rproject.org).All statistical tests are bilateral, and a P value < 0.05 would be considered statistically significant.

Baseline characteristics of patients
According to the inclusion and exclusion criteria, a total of 487 IMPC patients were collected from the SEER database and randomly assigned to a training cohort (n=341) and a internal validation cohort (n=146) at a 7:3 ratio.In addition, we employed an external validation cohort composed of 248 Chinese patients who received treatment at the Fourth Hospital of Hebei Medical University from January 2010 to December 2019 using the same patient selection criteria as mentioned above.The study flow chart was shown in Figure 2 and baseline characteristics of the enrolled patients were summarized in Table 1.There were no statistically significant differences in clinicopathological characteristics between the training and internal validation cohorts (P > 0.05).

Construction of nomogram
A total of 19 related variables in our study were originally input into the LASSO regression method by 10-fold cross validation to determine the prognostic factors of 3-and 5-year overall survival (OS) in breast IMPC patients with LNM.Optimized lambda determined in LASSO regression model, with min lambda 0.04532564, there were 10 indexes selected: marital status, whether it is the first malignant primary indicator, tumor size, clinical T stage, TNM stage, tumor grade, molecular subtype of breast cancer, operation mode, whether it receives chemotherapy and radiotherapy (Figure 3).Then, variables selected by LASSO regression were included in the multivariable Cox regression analysis, and the results were presented as HR and 95% CI.The following factors are significantly related to the prognosis of patients: whether it is the first malignant primary indicator (HR=0.40,95% CI=0.21 4).
On the basis of multivariate Cox regression in the training cohort, a nomogram that integrated six independent risk factors was established to predict 3-year and 5-year OS in lymph nodepositive IMPC patients (Figure 5).The value of each risk factor is assigned a score on the point scale axis.A total score could be easily calculated by adding each single score and located this sum on the total point scale axis.The probability of 3-year and 5-year OS can be estimated by calculating the total number of points from the vertical line of the variable to the point axis.The breast IMPC prognosis nomogram established by this research institute can be obtained through https://liyifei-1996.shinyapps.io/IMPCDynNomapp/access and use online.

Evaluation and validation of the nomogram
The C-indices of the training cohort, internal validation cohort, and external validation cohort are 0.789 (95% CI: 0.759-0.819),0.775 (95% CI: 0.731-0.819),and 0.788 (95% CI: 0.756-0.820),respectively.The time-dependent ROC curves show that the nomogram has good predictive performance for 3-year and 5year OS in breast IMPC patients with LNM (Figure 6).As illustrated in Figure 6, the AUCs of 3-and 5-year OS for the training cohort are 0.741 and 0.748, respectively; meanwhile, the corresponding values for the internal validation cohort are 0.740 and 0.741, respectively; and 0.804 and 0.767 in the external validation cohort.The calibration curves of the training cohort, internal validation cohort, and external validation cohort indicate that the prediction probability of the nomogram is close to the actual observation probability, showing a strong consistency (Figures 7, 8).As a novel method for evaluating diagnostic and prognostic prediction models, DCA curves are also drawn to evaluate the clinical application value of the nomogram which show that compared with the traditional TNM staging method, the nomogram could more accurately predict the OS of IMPC patients at 3-and 5 years (Figures 9, 10).

Ability of nomogram to stratify patient risk
Based on the prognostic signature, we calculate the risk score for each patient and stratified all patients into a high-risk group (score≥152.884) or a low-risk group (score<152.884).Compared with the low-risk group, OS is significantly lower in patients with breast cancer in the high-risk group (Figure 11A).In addition, the Kaplan-Meier curves of the internal and the external validation cohort show similar performances to those of the training cohort, demonstrating the significant difference in survival prognoses between the predicted high-and low-risk groups (Figures 11B, C).

Discussion
IMPC is a special type of breast cancer with poor prognosis.Although recent studies have shown no statistical differences between IMPC and IDC-NST in OS and DFS (17, 18), due to its unique morphological structure and invasive biological behavior, most IMPC patients are more likely to receive intensive treatment in clinical decision-making.Therefore, an accurate risk model can guide clinicians to identify high-risk patients and formulate more personalized treatment plans for IMPC patients.As far as we know, this is the first study to construct a nomogram integrated clinicopathological characteristics for predicting the prognosis of IMPC patients with LNM.Our model has higher C-index in the training cohort and external validation cohort than the nomogram previously published by Chen et al. (11) (training cohort C-index: 0.789 vs 0.756, external validation cohort C-index: 0.788 vs 0.742), and a higher AUC value in the external validation cohort (3-year OS: 0.804 vs 0.766, 5-year OS: 0.767 vs 0.725), indicating that the nomogram has higher accuracy in predicting patient prognosis.In the training cohort and two validation cohorts, the calibration curves also showed a high degree of agreement between predicted and actual observed results, reflecting the reliability of prediction models.Further DCA analysis also demonstrated that our nomogram has promising clinical applicability compared to the traditional AJCC staging system.In addition, the risk stratification model based on this nomogram can effectively classify patients into high-risk and low-risk groups and OS can be distinguished.Patients in the low-risk group may had a good prognosis, while for patients with higher risk, clinicians can make treatment interventions and treatment plan adjustments in a timelier manner to improve the prognosis.
This study included 487 lymph node-positive IMPC patients from the SEER database and 248 patients from the Fourth Hospital

A B
LASSO coefficient distribution of predictive factors (A) and selection of the optimal parameter (lambda) in the LASSO model (B).Multifactor Cox regression analysis forest map.Our study determined six variables including molecular subtype of breast cancer, first malignant primary indicator, tumor grade, AJCC stage, radiotherapy and chemotherapy are independent risk factors for OS in breast IMPC patients with positive lymph nodes.Compared to the not routinely measured and costly molecular markers, these variables have advantages in convenience, easy access and low cost, which will improve the follow-up compliance and the survival rate of patients.Traditionally, histological grade and AJCC stage are key factors for the prognosis of breast cancer patients.The higher the histological grade and AJCC stage, the worse the prognosis of IMPC patients (24,25).A sizeable study of 1268 patients suggested that pathologic data (i.e., grade/stage) was sufficient to replace the use of the Oncotype RS distinguish between low-risk and high-risk populations.Our model also indicates that AJCC stage contributes the most to the prognosis, followed by histological grade.The inclusion of additional information regarding clinicopathological characteristics provides our nomograms with a more accurate prognosis prediction ability, which can be used to improve the AJCC TNM staging system or as a supplementary version tumor, clinicians can provide appropriate strategy for follow-up and treatment (30).In clinical practice, surgery combined with radiotherapy and chemotherapy is currently an important part of the standardized treatment system for breast cancer, and active treatment is of great clinical significance in improving the quality of life and prolonging the survival time for patients (31-33).Our research also confirmed that radiotherapy and chemotherapy are important factors affecting the prognosis of breast cancer.In the analysis of prognostic factors, we also found that patients whose breast cancer was not the first primary malignancy tended to have a poorer prognosis, which is also true in other cancers (34)(35)(36).In general, higher number of positive lymph nodes and lymph node metastasis rate are associated with poor prognosis of breast cancer patients (37)(38)(39).For IMPC patients, a literature (15) shows through univariate analysis that IMPC patients with ≥4 positive lymph nodes have shorter OS compared to lymph node negative patients, while the OS of IMPC patients with 1-3 positive lymph nodes is similar to that of patients with lymph node negative diseases.Our study also found that the number of positive lymph nodes has no impact on the prognosis of IMPC patients with LNM.
This finding may be unique to this particular subtype of breast cancer, although several contributing and confounding factors may also play a role.In addition, our nomogram excludes unimportant factors such as race and marital status, which helps doctors save time and effort in collecting unnecessary information.
There are limitations of the study.Firstly, this nomogram is constructed on the basis of retrospective cohort, and selection bias and recall bias may have influenced the results of our study.Prospective studies are required to validate our results.Secondly, the external validation cohort only comes from single center data, with a relatively small sample size.In the future, multicenter clinical trials with larger sample sizes and different ethnic groups are needed to evaluate the diagnostic performance of this prognostic model.

FIGURE 2 Study
FIGURE 2Study flow chart of IMPC patients with LNM.
of Hebei Medical University.The SEER database is the most comprehensive database including sociodemographic data, treatment history, clinical pathological and molecular factors, allowed us to adjust for a high number of important confounders and the interaction between them (19).And compared with the nomogram published by Wang et al.(12), we established an external validation dataset from different races, regions, and economic and social environment populations.The nomogram achieves good accuracy and stability in internal and external validation, and has applicability in various clinical scenarios.To avoid overfitting or underfitting the model, we tried to determine the optimal model using LASSO regression and Cox regression(20).The former can effectively screen variables, while the latter can be used to modeled and visualized for direct interpretation.Chen et al.(11) and Wang et al.(12) had established a nomogram for predicting the prognosis of IMPC patients through univariate and multivariate Cox analysis, respectively.However, we considered that too many predictors are unnecessary for clinical application

FIGURE 5 A
FIGURE 5A nomogram for predicting 3-and 5-year OS of IMPC patients with LNM.

6 7
FIGURE 6 Time-dependent ROC curves for the nomogram's prediction of 3-, and 5-year OS in the training cohort (A), internal validation cohort (B), and external validation cohort (C).

8 9 DCA
FIGURE 8 Calibration curves for the nomogram's prediction of 5-year OS in the training cohort (A), internal validation cohort (B), and external validation cohort (C).
(11,12)d with previous studies(11,12), in order to avoid redundancy or overfitting, our study used LASSO regression to screen for significant factors related to OS and construct a nomogram.In addition, we not only built a network calculator based on the nomogram, but also conducted risk stratification, creating more convenience for clinical practice.Finally, we validated the predictive performance of the developed nomogram internally and obtained validation from the largest external cohort in China.Therefore, the aim of this study is to construct and validate a nomogram based on clinicopathological characteristics to predict the prognosis of IMPC patients with LNM.The nomogram is used to divide patients into high-risk and low-risk groups, which has important clinical guiding significance in improving accurate staging, adjuvant treatment strategies and prognosis evaluation.

TABLE 1
Clinicopathological characteristics of IMPC patients in training, internal validation, and external validation cohorts .