Development and Validation of a Deep Learning Algorithm to Automatic Detection of Pituitary Microadenoma From MRI

Background: It is often difficult to diagnose pituitary microadenoma (PM) by MRI alone, due to its relatively small size, variable anatomical structure, complex clinical symptoms, and signs among individuals. We develop and validate a deep learning -based system to diagnose PM from MRI. Methods: A total of 11,935 infertility participants were initially recruited for this project. After applying the exclusion criteria, 1,520 participants (556 PM patients and 964 controls subjects) were included for further stratified into 3 non-overlapping cohorts. The data used for the training set were derived from a retrospective study, and in the validation dataset, prospective temporal and geographical validation set were adopted. A total of 780 participants were used for training, 195 participants for testing, and 545 participants were used to validate the diagnosis performance. The PM-computer-aided diagnosis (PM-CAD) system consists of two parts: pituitary region detection and PM diagnosis. The diagnosis performance of the PM-CAD system was measured using the receiver operating characteristics (ROC) curve and area under the ROC curve (AUC), calibration curve, accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1-score. Results: Pituitary microadenoma-computer-aided diagnosis system showed 94.36% diagnostic accuracy and 98.13% AUC score in the testing dataset. We confirm the robustness and generalization of our PM-CAD system, the diagnostic accuracy in the internal dataset was 96.50% and in the external dataset was 92.26 and 92.36%, the AUC was 95.5, 94.7, and 93.7%, respectively. In human-computer competition, the diagnosis performance of our PM-CAD system was comparable to radiologists with >10 years of professional expertise (diagnosis accuracy of 94.0% vs. 95.0%, AUC of 95.6% vs. 95.0%). For the misdiagnosis cases from radiologists, our system showed a 100% accurate diagnosis. A browser-based software was designed to assist the PM diagnosis. Conclusions: This is the first report showing that the PM-CAD system is a viable tool for detecting PM. Our results suggest that the PM-CAD system is applicable to radiology departments, especially in primary health care institutions.


INTRODUCTION
A pituitary microadenoma (PM) is a tumor <10 mm in diameter. PMs can occur in either sex. As many as 10% of the population may have a microadenoma, but most do not cause symptoms (1,2). However, some PMs cause symptoms by secreting hormones that exert harmful consequences, for example, in Cushing's disease, acromegaly, infertility, and hyperprolactinemia (1). Due to its small size and variable anatomical structure among individuals, the diagnosis of PM is not easy by applying the technique of MRI alone (3). Manual analysis of MRI data is usually biased and time-consuming, and the diagnostic accuracy is closely related to the experience of radiologists. A shortage of experienced radiologists may cause a delay in diagnosis and compromise the overall quality of service to patients with PM (4,5). Deep learning has the potential to revolutionize disease diagnosis and management by improving the diagnostic accuracy of PM while reducing the workload of radiologists. The development of a convolutional neural network (CNN) has significantly improved the performance of image classification and object detection (6). Recent reports showed that a computeraided diagnosis (CAD) system can accurately diagnose patients with pituitary adenoma from MR images (7)(8)(9). In this work, we have developed and validated an image-based deep learning model to aid the detection of PM.

Data Collection and Pre-procession of MRI Data
The original intention to develop and validate the technique of deep learning algorithms assisting PM diagnosis was prompted by several misdiagnosed PM cases in our hospital (Supplementary Figure 1). We developed and validated an automatic diagnosis model for the detection of PM. The training set was a retrospective study, the data were extracted from January 2012 to September 2019 at The Third Affiliated Hospital of Sun Yat-sen University (TianHe and LuoGang hospital). The validation set 1 was a prospective temporal validation using data from October 2019 to April 2021 at The Third Affiliated Hospital of Sun Yat-sen University. Validation sets 2 and 3 are geographic prospective external validation with data from two additional hospitals (Sun Yat-sen Memorial Hospital of Sun Yatsen University, and The Second Affiliated Hospital of Harbin Medical University) from March 2020 to April 2021. All data were recruited using the same inclusion and exclusion criteria.
The workflow diagram for the overall experimental design is in Figure 1 and Supplementary Figure 2. Inclusion criteria were participants suffered from infertility (defined as the inability of a sexually active couple to achieve pregnancy within a year or more with regular unprotected intercourse) and at least exhibited one or more of the following clinical symptoms/signs (menstrual irregularity, amenorrhea, galactorrhea, premature ejaculation, erectile dysfunction, or hypogonadism). Exclusion criteria were as follows: lactation, pregnancy, with primary thyroid, adrenal and/or gonadal diseases, malignant tumors, pituitary macroadenoma, sellar/pituitary masses or cyst, congenital disease of the pituitary gland, pituitaries, and MR images without complete pituitary scan or with too many MRI artifacts. Further examination was performed on the participants. We measured serum hormone levels of the participants (such as prolactin, adrenocorticotrophic hormone, follicle-stimulating hormone, luteinizing hormone, serum thyroid-stimulating hormone, and growth hormone) and performed a pituitary MR examination on those participants. Patients with functional and nonfunctional PM and patients with normal pituitary function were included for further deep learning analysis. The coronal dynamic enhancement T1-weighted imaging (T1WI) sequences of MRI (DICOM) from those participants were downloaded with a standard image format according to the software and instructions of the manufacturer. All pituitary images were read by two junior neuroradiologists (with <10 years of professional experience) and one senior neuroradiologist (with >10 years of professional experience), and the final diagnosis was mutually agreed upon by all three neuroradiologists have then proceeded for further investigation. In the training set, all images present with PM or normal pituitary images were selected by four general radiologists (>5 years of professional experience) and reviewed by two neuroradiologists (with >10 years of professional experience). All images of coronal dynamic enhancement T1WI sequence were used for the validation set without additional human intervention. MRI was performed with a 1.5 or 3.0 T MRI unit (GE, Philips company, Amsterdam, the Netherlands) in the headfirst supine position, 380 ms/12.5 ms (repetition time /echo time), and 1 or 3 mm thick sections. Six medical fellows in the division of clinical endocrinology were involved in collecting patient clinical information, and the dataset was reviewed and verified by two endocrinologists.

Model Structure (Overview of Our PM-CAD System)
The pipeline of our PM-CAD system is shown in Supplementary Figure 3, and it consists of two parts: (1) pituitary region detection and (2) PM diagnosis. All programs are implemented with Python (https://www.python.org/) language on PyTorch (https://pytorch.org/) platform. In pituitary region detection, we develop a pituitary detection model based on Faster R-CNN (10) [with ResNet-50 FPN (11) as its backbone]. The input MR image is processed by this model to generate classification and regression maps, which have been further used to extract the pituitary bounding box in MR images. The pituitary bounding box is used to crop the pituitary region patch from the MR image (Supplementary Method A). In PM diagnosis, we proposed a novel CNN (namely, PM-CAD) to diagnose the PM from the cropped MR images. All the cropped pituitary region images are resized to 256 × 256, normalized into (0,1), and processed with histogram matching normalization (HM) for the enhancement of microadenoma features. In the PM-CAD system, we modify the ResNet architecture to preserve fine-grained features during forward propagation. An attention module is used to further improve the discriminativeness of feature representation. To handle the overfitting problem, HM normalization, intensity shift data augmentation, and label-smoothing loss are used (Supplementary Method B). The training procedure is stopped after 500 epochs (iterations through the entire dataset) due to the absence of further improvement in terms of both the area under receiver operating curve (AUC) and label-smoothing loss (Supplementary Figure 4).

Model Discrimination and Calibration
A total of 1,520 participants were included for the further study. We partitioned the data into three non-overlapping sets, with 780 participants for model development, 195 participants for model testing (developing and testing dataset as 8:2), and 545 participants for model validation. To reduce the time bias, the training set was a retrospective study from January 2012 to September 2019. The validation set was a prospective validation from October 2019 to April 2021. The detailed statistics for each set are summarized in Figure 1 and Supplementary Figure 2.

Evaluation of the Diagnosis Performance of Our PM-CAD System and Statistical Analysis
In the testing set, we used accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1-score to evaluate our PM-CAD system. The validation set A had been used to evaluate the generalization ability and stability of our PM-CAD system. The receiver operating characteristics (ROC; showing both true-positive rate and false-positive rate for diagnosis performance) curves and AUC were used in testing, internal and external validation sets (12,13). We also used binary logistic regression methods to re-fit the prediction probability data rooted in PM-CAD, and calibration curves were used to test the fitting ability of the model (14). Validation set B consists of 100 participants and has been used to compare the performance of the PM-CAD system to general radiologists. A wide range of performance metrics has been adopted, such as diagnosis accuracy, sensitivity, specificity, PPV, NPV, F1-score, weighted error, positive likelihood ratio (PLR), negative likelihood ratio (NLR), and AUC (12). A weighted error was used for further analysis, specifically, a penalty weight of 2 was assigned to false-negative cases and a penalty weight of 1 was assigned to false-positive cases (12). Six radiologists were recruited for this study. Radiologists 1 and 2 have professional experience of <5 years, Radiologists 3 and 4 have professional experience between 5 and 10 years, and Radiologist 5 and 6 have professional experience over 10 years. Each radiologist read MR images of 100 participants independently. The Bland-Altman plot was used to evaluate the interobserver consistency of pituitary MRI finding independently measured by the six radiologists. The diagnostic accuracy of those radiologists was evaluated, and the experience of each radiologist in reading images of the cranial and pituitary MR or CT is shown in Supplementary Table 1. In validation set C, we tested the diagnosis accuracy of our PM-CAD system on three cases misdiagnosed by radiologists. Descriptive statistics included mean (SD) for continuous variables and proportions for categorical variables. All the metrics were calculated using Python-3.9.5 (https://www.python.org/), and R-4.0.3 (15) was used to provide visual analyses.

Browser-Based Software Application
A browser-based software was designed to assist the diagnosis from pituitary MR images. Once pituitary MR images (DICOM files) are uploaded to the software, PM diagnosis outputs can be presented.

Study Participants
A total of 11,935 infertility participants were initially recruited for this project. After applying the exclusion criteria, 1,520 participants (556 PM patients and 964 controls subjects) were included for further study whereby we have partitioned data from 975 participants (340 PM patients and 635 control subjects) for the training set, such as 780 participants (19,573 images) for development set and 195 participants (4,927 images) for the testing set. In the validation set, 545 participants (13,239 images) were recruited for the study. The validation set A consisted of 163 PM patients and 279 control subjects came from three hospitals. The validation set B consisted of 100 participants (50 PM patients, and 50 control subjects). In validation set C, we tested the diagnosis accuracy of our PM-CAD system on three misdiagnosed PM cases. The detailed statistics for each set are summarized in Figure 1 Table 1.

Performance of PM-CAD System
The PM-CAD system consists of two parts: pituitary region detection and PM diagnosis. In pituitary region detection, we use the well-known average precision (AP) as the evaluation metric. We achieved an AP of 0.9783 at an intersection-of-union (IOU) threshold of 0.5 (Supplementary Method A). For testing the accuracy of PM diagnosis, 975 participants have been used for the development and testing set (Supplementary Method B). We showed that our PM-CAD system achieved an AUC of 98.13% (Figure 2A), an F1-score of 92.09%, an accuracy of 94.36%, a sensitivity of 96.97%, a PPV of 87.67%, a specificity of 93.02%, and an NPV of 98.36% on the testing set. The calibration curve of the testing set is listed in Figure 3A, the intercept on the testing is −6.098, and the probability weight W is 10.069. We employed PM-CAD for further investigation.

Performance of the PM-CAD System vs. Radiologists
An independent validation set B (100 participants: 50 PM patients and 50 controls from hospital 1) was used to compare the performance of the PM-CAD system vs. radiologists. For this comparison, six radiologists were recruited. The diagnosis performance of PM-CAD system is F1-score (93.88%), accuracy (94.00%), sensitivity (92.00%), PPV (95.83%), specificity (96.00%), and NPV is (92.31%) (Supplementary Table 3). In contrast, the performance of our best radiologist #6 is F1-score (94.95%), accuracy (95.00%), sensitivity (94.00%),    Table 3). The ROC curves are shown in Supplementary Figure 5A, the AUC of the PM-CAD system was 95.56% and outperformed our six radiologists (best radiologist #6 as 95.00%), at the same false-positive rate, the true positive rate of the PM-CAD system was higher than six radiologists (Supplementary Figure 5A). Weighted error scoring (10) was incorporated during modeling and evaluation, the PM-CAD system produces a weighted error of 10.00%, which is far below the average weighted error of 21.67% achieved by six radiologists (Supplementary Figure 5B). The difference of NLRs or PLRs (10) between our PM-CAD system and radiologists is shown in Supplementary Figures 5C,D, our model demonstrates excellent diagnostic performance. The classification confusion matrices report the number of true positive, false positive, true negative, and false negative resulted for the PM-CAD system and radiologists in Supplementary Table 4. Thus, we showed that the diagnosis performance of our PM-CAD system is comparable to general radiologists with more than 10 years of professional

Further Assessment for the Diagnosis Performance of the PM-CAD System
We sampled three double positive cases of PM (both diagnosed by radiologists and PM-CAD system), which underwent surgical treatment, the double positive cases were confirmed by a subsequent pathological examination (one case of Cushing's disease, one case of Acromegalia, and one case of prolactinoma; Supplementary Figure 6A). A false-negative diagnosis leads to delay in treatment of PM, PM-CAD system showed 100% diagnosis accuracy of detecting three clinically misdiagnosed PM cases which subsequently underwent surgical treatment (two cases of Cushing's disease and

Browser-Based Software Application
The browser-based software was designed to assist the PM diagnosis of pituitary MR images from different hospitals, which is hosted at http://www.pituitarymicroadenoma.com. Even without graphics processing unit (GPU) acceleration, the application takes only 1-2 s to analyze all MR images from a patient. Once DICOM files (the coronal dynamic enhancement T1-weighted imaging (T1W) sequence) are uploaded to the software, PM diagnosis outputs can be presented. The software interface is presented in Supplementary Figure 7. In a prospective study, we have tested the efficacies of our PM-CAD in the division of endocrinology in our hospital. Our results indicate that the PM-CAD system is an excellent screening test for the presence of PM. Over a period of 1 month, our PM-CAD system was able to detect the presence of 11 PM patients with a 97% accuracy rate (of 48 infertile patients and 25 patients with pituitary MR examination).

DISCUSSION
In this work, we developed a deep learning system (namely, PM-CAD) to diagnose PM from MRI. As we know, it is the first attempt to focus on PM diagnosis by using deep learning, although similar works have been proposed for pituitary adenoma (7)(8)(9)16). Diagnosis of PM is challenging due to its tiny size and various anatomical structure (1-3). We found that our PM-CAD system can accurately diagnose PM from MRI without additional information, the system achieves a 96.5% diagnostic accuracy, which is comparable to radiologists with over 10 years of professional expertise.  (16) created an automated segmentation method for the sellar region, several tools to extract invasiveness-related features of pituitary adenoma and evaluate their clinical usefulness by predicting the tumor consistency. In this study, we focus on the diagnose of PM from the PM-CAD system with a large dataset. We show that our PM-CAD system outperforms the model developed by Qian et al. (7). Because of our PM-CAD system can specifically extract PM features from pituitary MR images and trained with more data. In addition, our model was validated in three hospitals and showed excellent generalization ability.

Strengths and Limitations
Our work has the following strengths. First, we showed that this PM-CAD system is a rapid, reliable tool to diagnose PM with a high accuracy in both internal and external datasets. Second, PM diagnosis requires experienced radiologists, but the exhausting workload raises the misdiagnose rate. Our PM-CAD system can be used as an assistant tool to reduce the workload of radiologists. Our PM-CAD system achieves comparable diagnostic accuracy to experienced radiologists and can make a decision in 1-2 s. Third, medical resources are not evenly distributed, that is, experienced radiologists mostly worked in economically developed areas hospitals while economically underdeveloped areas are lack experienced radiologists (4, 5). Our online accessible PM-CAD system can provide PM diagnosis to these areas and improve their PM diagnostic capabilities. Last, training a radiologist is costly and time consuming. It usually takes more than 10 years to train a qualified radiologist (4,5). Our PM-CAD system is trained from annotated data and takes few time (about 30 s per patient) to improve its performance when more data are provided.
Our PM-CAD system remains several problems to be solved. First, although our PM-CAD system achieves a 96.5% diagnostic accuracy, this implies that 3.5% of cases may potentially be misdiagnosed in practice. To further improve the diagnosis performance of the PM-CAD system, more data should be collected and used to train our models. Second, when more new data are available, it would be better than our PM-CAD system can perform model self-update, a continual learning approach can be introduced to keep our system learning. Third, MRI scan data are unique to patients, with privacy concerns, these data are not allowed to distribute out of the hospitals. Therefore, our PM-CAD system cannot be fine-tuned in a specific hospital. In future work, we will use a federated learning framework to fine-tune our models in a privacy-preserving manner.

CONCLUSIONS
In summary, we have developed a deep learning-based system (namely, PM-CAD) to detect PM from MRI. A Total of 1,520 participants datasets have been used to train, validate, and test our system. Our PM-CAD system achieves a diagnostic accuracy comparable to radiologists with over 10 years of professional expertise. In the study, our PM-CAD system shows excellent generalization ability. Results from this work highlight the potential applications of deep learning on the diagnosis of patients with PM. With the rapid development of computing power, deep learning algorithms can surpass the gold diagnosis standard for the detection of PM. Machine learning for the diagnosis of PM will serve as an important component in improving patient care and outcomes.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

CODE AVAILABILITY STATEMENT
The software and code of the proposed method have been separated into two files and are available as Supplementary Software files. https://github.com/MinglinChen94/ PituitaryMicroadenomaDiagnosis.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Research Ethics Committee of the Institute of Basic Research in Clinical Medicine, The Third Affiliated Hospital of Sun Yat-sen University ([2020]02-089-01). This research is registered at the Chinese Clinical Trials Registry (http://www.chictr.org.cn/index.aspx) with the number ChiCTR2000032762. The patients/participants provided their written informed consent to participate in this study.

ACKNOWLEDGMENTS
We thank all the patients and investigators for their participation in this study. We thank Chunkui Shao and Jing Wang (Professor, Department of Pathology and radiology, The Third Affiliated Hospital of Sun Yat-sen University) for their assistance in the research. We thank six fellows and six radiologists involved in data collecting and human-computer competition. The ROC curve shows that the PM-CAD system outperforms 6 radiologists. The AUC of PM-CAD system is 95.6% better than our best radiologist#6 (AUC 95.0%). (B) Weighted error. A penalty weight of 2 is applied to false-negatives and a penalty weight of 1 is assigned to false-positives. The PM-CAD system produces a weighted error of 10%, whereas the radiologists produce a weighted error of 21.67%. (C,D) The negative likelihood ratio and the positive likelihood ratio: The negative likelihood ratio is defined as the false-negative rate over the true negative rate, so that a decreasing likelihood ratio < 1 indicated increasing probability the absence of PM. The positive likelihood ratio is defined as the true positive rate over the false-positive rate, so that an increasing likelihood ratio > 1 indicated increasing probability the diagnosis of PM. The confidence intervals show that the PM-CAD system demonstrates statistically better screening performance in terms of both negative likelihood ratio and positive likelihood ratio than radiologists.  Table 5. PM, pituitary microadenoma; MRI, magnetic resonance imaging; AI, Artificial intelligence; HE, hematoxylin and eosin; ACTH, adrenocorticotropic hormone; GH, growth hormone; TSH, thyroid stimulating hormone; PRL, prolactin. MR bar = 5mm. Pathology bar =100 µm. The yellow arrow indicates a pituitary microadenoma. Figure 7 | The browser-based software to aid the diagnosis of PM. As long as we upload the pituitary MR images (DICOM), the software will tell you whether the patient suffering from PM disease. This browser based tool can be accessed at http://82.157.181.77/.

Supplementary
Supplementary Table 1 | The workload of radiologists with different professional experience in human-computer competition. All participating radiologists are general radiologists (no specialization). Workload analysis was performed on the participating radiologists for 1 year.
Supplementary Table 2 | Confusion Matrices for testing and validation of dataset A (internal and external datasets). Data are numbers of images. a, true-positive; b, false-positive; c, false-negative; d, true-negative.
Supplementary Table 3 | The diagnostic performance for Human-computer competition according to temporal validation set B (n = 100). Unless otherwise specified, data are percentages, with numbers of images in parentheses and 95% confidence intervals in brackets. F1 score, the harmonic mean of PPV and sensitivity. NPV, negative predictive value; PPV, positive predictive value. Radiologist 1 & 2, < 5 years professional experience; Radiologist 3 & 4, 5 -10 years professional experience; Radiologist 5 & 6, > 10 years professional experience.