Three-Dimensional Multi-Task Deep Learning Model to Detect Glaucomatous Optic Neuropathy and Myopic Features From Optical Coherence Tomography Scans: A Retrospective Multi-Centre Study

Purpose We aim to develop a multi-task three-dimensional (3D) deep learning (DL) model to detect glaucomatous optic neuropathy (GON) and myopic features (MF) simultaneously from spectral-domain optical coherence tomography (SDOCT) volumetric scans. Methods Each volumetric scan was labelled as GON according to the criteria of retinal nerve fibre layer (RNFL) thinning, with a structural defect that correlated in position with the visual field defect (i.e., reference standard). MF were graded by the SDOCT en face images, defined as presence of peripapillary atrophy (PPA), optic disc tilting, or fundus tessellation. The multi-task DL model was developed by ResNet with output of Yes/No GON and Yes/No MF. SDOCT scans were collected in a tertiary eye hospital (Hong Kong SAR, China) for training (80%), tuning (10%), and internal validation (10%). External testing was performed on five independent datasets from eye centres in Hong Kong, the United States, and Singapore, respectively. For GON detection, we compared the model to the average RNFL thickness measurement generated from the SDOCT device. To investigate whether MF can affect the model’s performance on GON detection, we conducted subgroup analyses in groups stratified by Yes/No MF. The area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, and accuracy were reported. Results A total of 8,151 SDOCT volumetric scans from 3,609 eyes were collected. For detecting GON, in the internal validation, the proposed 3D model had significantly higher AUROC (0.949 vs. 0.913, p < 0.001) than average RNFL thickness in discriminating GON from normal. In the external testing, the two approaches had comparable performance. In the subgroup analysis, the multi-task DL model performed significantly better in the group of “no MF” (0.883 vs. 0.965, p-value < 0.001) in one external testing dataset, but no significant difference in internal validation and other external testing datasets. The multi-task DL model’s performance to detect MF was also generalizable in all datasets, with the AUROC values ranging from 0.855 to 0.896. Conclusion The proposed multi-task 3D DL model demonstrated high generalizability in all the datasets and the presence of MF did not affect the accuracy of GON detection generally.


INTRODUCTION
Glaucoma is the leading cause of visual morbidity and blindness worldwide, and it is projected to affect 111.8 million people by 2040 (1,2). Prompt and accurate detection of glaucoma is extremely important in preventing and reducing irreversible visual loss. Spectral-domain optical coherence tomography (SDOCT), a non-contact and non-invasive imaging technology for cross-sectional and three-dimensional (3D) view of the retina and optic nerve head (ONH), is now commonly used to evaluate glaucomatous optic neuropathy (GON), the structural change of glaucoma (3)(4)(5). SDOCT is widely used to quantify retinal nerve fibre layer (RNFL), neuro-retinal rim, and other inner retinal layers (e.g., ganglion cell layer, inner plexiform layer). SDOCT is sensitive and specific for detecting glaucoma, especially when combined with other ophthalmoscopic modalities (3,4,6).
Nevertheless, myopic features (MF), including peripapillary atrophy (PPA), optic disc tilting, and fundus tessellation, could influence GON identification based on RNFL thickness measurement alone, which should be considered when interpreting the optic disc and its circumpapillary regions for diagnosis (7). For example, PPA beta zone correlates with glaucoma, while gamma zone is related to axial globe elongation. A higher degree of vertical optic disc tilting is associated with a more temporally positioned RNFL thickness peak, and a higher degree of fundus tessellation is associated with thinner RNFL (8)(9)(10). Eyes with longer axial length are associated with significantly higher percentages of false-positive errors based on an SDOCT built-in normative database (11). Hence, evaluating GON using SDOCT based on RNFL thickness and built-in normative databases alone may not be reliable. As illustrated in Supplementary Figure 1A, MF can also result in thinning of RNFL thickness (i.e., outside of the normal RNFL range) in eyes without GON which is similar to eyes with GON (Supplementary Figure 1B). Other diagraphs and metrics, such as topographical ONH measurements, RNFL thickness map, RNFL deviation map, circumpapillary RNFL thickness with "double-hump pattern" should also be evaluated to differentiate these two pathologies carefully. For example, in purely myopic eyes, the "double-hump pattern" is still existed but with temporal shift due to optic disc tilting. The RNFL thickness map also shows normal thickness except that the angle between superior and inferior RNFL bundles is smaller. While in GON eyes, RNFL "double hump pattern" is altered and thinner RNFL thickness appears at specific regions. Thus, interpretation of the results requires experienced glaucoma specialists or highly trained assessors who have good knowledge on both GON and OCT limitations.
Deep learning (DL), composed of multiple processing layers, allows computational models to learn representative features with multiple levels of abstraction. These models showed promise in pattern recognition and image analysis (12). Currently, automated image analysis based on DL technology has been developed to detect GON from different kinds of images, such as fundus photographs (FP) and OCT images (13)(14)(15)(16)(17)(18)(19)(20). We previously developed a 3D DL model to detect GON from SDOCT volumetric scans, which performed non-inferiorly to glaucoma specialists (21). However, all these DL algorithms only detected GON without learning features of Yes/No MF. Previous studies showed an increased risk of glaucoma among myopic eyes and high myopia is also a risk factor of GON progression (22,23). Besides, there is a lack of knowledge whether MF affects DL model's discriminative ability for GON detection. Having additional information on yes/no MF may help to evaluate subjects with glaucoma further. Multi-task learning is a training paradigm to train DL models with data from multiple tasks simultaneously, using shared representations to learn the common features between a collection of related tasks (24). It has been used in ultrasound, CT, and dermoscopic images (25)(26)(27), which showed potential advantages of integrating information across domains and extracting more general features for different tasks. Our previous work also applied multi-task technique for detecting GON and predicting visual field (VF) metrics (28), or detecting different kinds of retinal diseases (29,30) from OCT images.
In this study, we aimed to train and validate a multi-task 3D DL model to analyze SDOCT volumetric scans and identify GON and MF simultaneously. We hypothesise that the model with multi-task technique would extract common features and achieve high generalizability for both tasks. Besides, the proposed model could achieve better or comparable performance when comparing with conventional RNFL thickness.

MATERIALS AND METHODS
This study was a multi-centred retrospective study. It was approved by the Research Ethics Committee of the Hospital Authority, Hong Kong (HK), the SingHealth Centralised Institutional Review Board, Singapore, and Stanford University's Institutional Review Board/Ethics Committee, the United States (US). The study adhered to the Declaration of Helsinki. As the study involved only retrospective analysis using fully anonymized SDOCT images, informed consent was exempted.

Training, Tuning, and Internal Validation Dataset
The dataset for training, tuning, and internal validation was inherited from our previous study (21). It was collected from existing database of electronic medical and research records at the Chinese University of Hong Kong (CUHK) Eye Centre and the Hong Kong Eye Hospital (HKEH). The inclusion criteria were (1) age equal to or older than 18 years old, (2) gradable SDOCT optic disc scans and en face images, (3) reliable VF tests, and (4) confirmed diagnosis of glaucoma or healthy subjects. The exclusion criteria were (1) other ocular or systemic diseases that may cause optic disc abnormalities or VF defect; (2) missing data on VF, SDOCT optic disc scans, or en face images. The non-glaucomatous subjects from the research centre were volunteers from existing cohorts in CUHK Eye Centre. The non-glaucomatous subjects from the eye clinics were subjects who seek for opportunistic eye check-ups. All study subjects underwent SDOCT imaging by Cirrus HD-OCT (Carl Zeiss Meditec, Inc., Dublin, CA, United States) using the optic disc cube scanning protocol, which generated the RNFL thickness map (6 mm 2 × 6 mm 2 ) around the optic disc. The VF of each study subject was determined by static automated white-on-white threshold perimetry using the Humphrey Field Analyzer II (Carl Zeiss Meditec, Inc., Dublin, CA, United States). For each feasible eye, 3D SDOCT optic disc scan and 2D en face fundus image could be extracted simultaneously.

External Testing Datasets
We used five independent datasets from other eye centres to test the performance of the DL model: External testing 1, Prince of Wales Hospital (PWH), HK; External testing 2, Tuen Mun Eye Centre (TMEC), HK; External testing 3, Alice Ho Miu Ling Nethersole Hospitals (AHNH), HK; External testing 4, Byers Eye Institute, Stanford University (Stanford), United States; and External testing 5, Singapore Eye Research Institute (SERI), Singapore. The inclusion criteria, exclusion criteria, VF and OCT device, and ground truth labelling for the external testing datasets were the same as the development dataset.

Ground Truth Labelling
For the ground truth labelling, we first excluded ungradable images and then classified GON and MF in each gradable images, which was consistent with our previous studies (21,31). The criteria were as follow: Ungradable SDOCT images was defined as signal strength (SS) less than 5 or any artefacts influencing the measurement circle or > 25% of the peripheral area (31). GON was defined as RNFL thinning on gradable SDOCT images, with a structural defect that correlated in position with the VF defect which fulfilled the definition of glaucomatous VF defect (32). These eyes were labelled as "Yes GON." Eyes without GON were defined as normal VF with no obvious glaucomatous optic disc cupping and loss of neuro-retinal rim, and these eyes were labelled as "No GON." Myopic features included presence of (1) PPA, chorioretinal thinning and disruptive of the retinal pigment epithelium (RPE) (33), (2) optic disc tilting, the ratio between the shortest and longest metres (tilt ratio) less than 0.8 (34), and (3) fundus tessellation, increased visibility of the large choroidal vessels outside of the parapapillary area (8). Eyes with one or more of these features were labelled as "Yes MF, " while eyes without any features were labelled as "No MF" (Supplementary Figure 2).
Following the definitions, SDOCT scans and LSO en face images were subjected to a tiered grading system by trained assessors, ophthalmologists, and glaucoma specialists, for assessing image quality, MF, and GON, respectively. Images were labelled when two graders arrived at the same categorization separately. For those cases on which the two graders did not arrive at the same categorization in their independent evaluations, the cases were reviewed and categorised by senior graders.

Data Pre-processing
We applied standardisation and normalisation for data preprocessing. Specifically, standardisation was used to transfer data to have zero mean and unit variance, and normalisation rescaled the data to the range of 0-1. To alleviate the over-fitting issue, during the training process, we used several data augmentation techniques, including random cropping and random flipping at three axes, to enrich training samples for the 3D SDOCT volumetric data. Consequently, the final input size of the network was 200 × 1000 × 200.
We implemented the DL model using Keras package and python on a workstation equipped with 3.5 GHz Intel R Core TM i7-5930K CPU and GPUs of Nvidia GeForce GTX Titan X. We set the learning rate as 0.0001 and optimised the weights of the networks with Adam stochastic gradient descent algorithm.

Development of the Multi-Task Deep Learning Model
The proposed network included three modules, (1) shared feature extraction module, (2) GON classification module, and (3) MF detection module, respectively. The constructed network was similar to our previous study (21) with ResNet-37 as the backbone. We used shortcut connections to perform identity mapping and evade vanishing gradient problem during backpropagation. We removed the fully connected layer from the 3D ResNet-37. This module acted as the shared feature extraction module. In the GON classification module, a fully connected layer with softmax activation accepted the feature from the first module and output the classification probabilities for "Yes GON" and "No GON." Likewise, there was also a fully connected layer with softmax activation in the MF detection module. Figure 1 displays the structure of the multi-task 3D DL model.
All gradable SDOCT volumetric scans collected from CUHK Eye Center and HKEH were randomly divided for training (80%), tuning (10%), and internal validation (10%) at the patient level. In each set, the ratio of "Yes GON and Yes MF, " "Yes GON and No MF, " "No GON and Yes MF, " and "No GON and No MF" was similar, and multiple images from the same subjects were in the same set to prevent leakage or performance overestimation. We trained the multi-task DL model from scratch, and the tuning dataset was used to select and modify the optimum model during training. During the training, tuning, and internal validation, we observed the training-validation curve to evaluate overfitting issue, which could also provide a further reference to the generalizability of the models. Additionally, SDOCT volumetric scans from PWH, TMEC, AHNH, Stanford, and SERI were used for external testing.
Finally, we generated heatmaps for selected eyes by class activation map (CAM) (35) to evaluate the model performance qualitatively.

Development of Single-Task Models and a 2D Model for Performance Comparison
We trained and tested two 3D single-task DL models using the same split dataset as the proposed 3D multi-task model for GON and MF detection, respectively. We also used segmentation-free 2D OCT B-scans as the input to train and test a 2D multi-task DL model using the same split dataset as the proposed 3D multi-task model. Each OCT volumetric scan contained 200 B-scans and the mean predictions of these B-scans were used as the volume-level prediction. All the models were tested on the same testing sets for final performance comparison.

Statistical Analysis
The statistical analyses were performed by RStudio Version 1.1.463 (2009-2018 RStudio, Inc.). One-way ANOVA and chisquare test were performed for numerical and categorical data, respectively, to analyse demographic characteristics of all the participants and data variances of different datasets. The area under the receiver operating characteristic curve (AUROC) with 95% confidence interval (CI), sensitivity, specificity, and accuracy were calculated to evaluate the discriminative performance (Yes/No GON and Yes/No MF) of the 3D multi-task DL model, 3D single-task model, and 2D multitask DL model in all the datasets. For GON detection, we further compared the performance of the proposed multitask 3D DL model to that of the average RNFL thickness measurement generated from the SDOCT device. Delong test    Table 2 demonstrates the performance of the 3D multitask DL model for GON identification and the comparison to average RNFL thickness, a 3D single-task DL model, and a 2D multi-task model. In the internal validation dataset, the proposed model had significantly higher AUROC (0.949 vs. 0.913, p < 0.001) than that of RNFL thickness. In the five external testing datasets, the two methods (DL model vs.  Figures 2A,B shows the ROC curves and AUROC values using the 3D multi-task DL model and RNFL thickness to identify GON in internal validation and external testing. The proposed 3D multi-task model also achieved generally better performance than a 3D single-task model and a 2D multi-task model.

RESULTS
In  Table 3). Table 4  respectively. When comparing with a 3D single-task model and a 2D multi-task model, the proposed 3D multi-task showed better performance with higher generalizability in external testing.
The training-tuning curve (Supplementary Figure 4) showed that the multi-task 3D DL model converged approximately around the 30th epoch and kept stable without significant oscillation after the 50th epoch. This finding, combined with the discriminative performance results in all the datasets, suggested that the multi-task 3D DL model had learned general knowledge to identify both GON and MF, and was not overfitted.
On the heatmaps (Figure 3), the red-orange-coloured area has the most discriminatory power to detect MF. The optic disc and PPA areas were red-orange-coloured in the truly detected eye with PPA. It demonstrated that the 3D multi-task DL model could discriminate MF around the ONH. While for the truly detected SDOCT scans as "no MF, " the heatmaps showed that only the optic disc area was red-orange-coloured.

DISCUSSION
To the best of our knowledge, the proposed multi-task 3D DL model is the first reported attempt to automatically detect GON and MF from SDOCT volumetric data simultaneously. The results showed a generalised performance for both tasks among the datasets and the presence of MF did not significantly affect DL model's ability to identify GON. Compared with a single task model, a multi-task model learned shared features from multiple tasks simultaneously. These shared features can potentially increase data efficiency and yield faster learning speed for related or down-stream tasks, which may alleviate DL's weakness of requiring largescale data and computation power (24). The proposed 3D DL multi-task model showed higher generalizability to detect GON when comparing with our previous single task model (21). When tested on two newly collected unseen datasets from HK (External testing 3) and Singapore (External testing 5), it achieved AUROC values of 0.906 and 0.930, respectively. To be more precise, we also trained single-task models with the exact same split data as the multi-task model. We found that the multi-task model had generally better performance for GON detection. The sub-analysis also reflected that except External testing 3, the presence or absence of MF did not influence the discriminative performance of the proposed multi-task DL model for GON detection in internal validation and external testing, which proved that during the training process, the multi-task model learned effective features to identify GON in eyes with or without MF, so that when testing on unseen datasets, the performance of GON detection will not be influenced by presence or absence of MF. Hence, after applying the multi-task learning strategy and providing additional information of Yes/No MF during the training process, the proposed multi-task 3D DL model learned more general features and demonstrated a robust discriminative ability in the task of identifying GON when tested on different datasets.
Compared with conventional method, i.e., RNFL thickness measurement, the DL model performed better in internal validation and comparable in five external testing datasets for GON identification. In addition to the good performance, this automated DL model can provide a straightforward classification result, i.e., Yes or No GON, which was more advantageous than current discriminative capability based on RNFL thickness and built-in normative databases as the RNFL thickness is affected by various factors including axial length, myopia, age, optic disc size, and SS (36)(37)(38)(39)(40)(41). Experienced ophthalmologists are needed to interpret and classify GON based on a series of outputs from the SDOCT report. Therefore, the proposed multi-task model can potentially be applied in primary settings without glaucoma specialists or even ophthalmologists on site, which may help primary care clinicians to identify GON simply from the binary output of the DL model, and then refer to ophthalmologists.
In addition, the multi-task DL model also offered an output of "Yes/No MF" for each image with good and consistent performance in internal and external datasets. Myopia is one of the risk factors for glaucoma (23,(42)(43)(44) and the ONH deformations in myopic eyes may predispose toward glaucoma (45). Features observed on areas around ONH on the fundus, such as the location of principal RNFL bundles, optic disc tilting, and optic disc torsion, were also related to spherical error and glaucoma severity (7,46). Our multi-task 3D DL model can detect the presence or absence of these features and give clinicians more information. The heatmaps also proved that for MF discrimination, the 3D multi-task DL model paid more attention on the ONH and the surrounding areas. Besides, the multi-task model showed significantly higher generalizability when comparing with single-task model for MF detection. Thus, it further proved aforementioned advantage of the multi-task model that learning both GON and MF features during training could potentially improve the model's generalizability for both tasks when testing on unseen datasets.
Our study has several strengths. First, we collected our datasets from different eye centres from different countries and regions including different ethnic backgrounds. It performed consistently well in all the datasets. The training-tuning curves also illustrated that the proposed DL model was not overfitted. Thus, our multi-task 3D DL model could potentially be applied on other    (31), which will further strengthen SDOCT as a screening tool in settings without sufficient ophthalmologists on site. Fourth, we developed a 3D multi-task model which could use of all the information in the volumetric scan and showed generally better performance than 2D model trained with B-scans for both tasks. It will provide volumelevel output which would be more straightforward for physicians (non-ophthalmologists) and could also save manpower or computation power to deal with large number of B-scans. One of the limitations was that we used SDOCT paired en face photographs instead of colour FP to label MF due to the lack of unpaired OCT and FP in most patients. Besides, we only labelled the Yes/No MF instead of Yes/No myopia as the data of spherical error and AL for most of the patients was lacking due to the nature of retrospective study. However, information of anatomical features could be more useful when detecting GON in myopic eyes. In future, we may obtain a small dataset with paired OCT, FP, spherical error, or AL, for data annotation and apply advanced DL techniques, such as generative adversarial network (GAN) (47) or semi-supervised learning (28), to generate pseudolabels for other data, which will further refine our model and enhance its feasibility to detect GON in high myopic eyes. We also intend to investigate whether increase the number of related tasks will further enhance the DL model's discriminative ability and data-efficiency for GON detection.
In conclusion, with multi-task learning technique, the proposed 3D DL model demonstrated high generalizability in all the datasets to detect GON and MF simultaneously. It would potentially enhance clinicians' capability to identify GON in eyes with MF and be applied in primary settings without sufficient specialists on site.

DATA AVAILABILITY STATEMENT
The de-identified individual participant data, the study protocol, the statistical analysis plan, and the coding are available upon reasonable request. Such requests are decided on a case-by-case basis. Proposals should be directed to CYC, carolcheung@cuhk.edu.hk.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Research Ethics Committee of the Hospital Authority, Hong Kong; the SingHealth Centralized Institutional Review Board, Singapore; Stanford University's Institutional Review Board/Ethics Committee, the United States. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
CYC and AR conceived and designed the study and wrote the initial draft. XW developed and validated the DL model supervised by HC and P-AH with the help of clinical input from AR, CYC, PC, NC, MW, and CT. PC, MW, and CT provided data from CUHK Eye Centre and HKEH. NC, WY, and AY provided data from PWH and AHNH. H-WY provided the data from TMEC. SM and RC provided the data from the Stanford. YT and C-YC provided the data from SERI. AR and XW performed the statistical analysis. XZ and FL provided clinical support during the model development. All authors subsequently critically edited the report and read and approved the final report.