Analysis of EPID Transmission Fluence Maps Using Machine Learning Models and CNN for Identifying Position Errors in the Treatment of GO Patients

Purpose To find a suitable method for analyzing electronic portal imaging device (EPID) transmission fluence maps for the identification of position errors in the in vivo dose monitoring of patients with Graves’ ophthalmopathy (GO). Methods Position errors combining 0-, 2-, and 4-mm errors in the left-right (LR), anterior-posterior (AP), and superior-inferior (SI) directions in the delivery of 40 GO patient radiotherapy plans to a human head phantom were simulated and EPID transmission fluence maps were acquired. Dose difference (DD) and structural similarity (SSIM) maps were calculated to quantify changes in the fluence maps. Three types of machine learning (ML) models that utilize radiomics features of the DD maps (ML 1 models), features of the SSIM maps (ML 2 models), and features of both DD and SSIM maps (ML 3 models) as inputs were used to perform three types of position error classification, namely a binary classification of the isocenter error (type 1), three binary classifications of LR, SI, and AP direction errors (type 2), and an eight-element classification of the combined LR, SI, and AP direction errors (type 3). Convolutional neural network (CNN) was also used to classify position errors using the DD and SSIM maps as input. Results The best-performing ML 1 model was XGBoost, which achieved accuracies of 0.889, 0.755, 0.778, 0.833, and 0.532 in the type 1, type 2-LR, type 2-AP, type 2-SI, and type 3 classification, respectively. The best ML 2 model was XGBoost, which achieved accuracies of 0.856, 0.731, 0.736, 0.949, and 0.491, respectively. The best ML 3 model was linear discriminant classifier (LDC), which achieved accuracies of 0.903, 0.792, 0.870, 0.931, and 0.671, respectively. The CNN achieved classification accuracies of 0.925, 0.833, 0.875, 0.949, and 0.689, respectively. Conclusion ML models and CNN using combined DD and SSIM maps can analyze EPID transmission fluence maps to identify position errors in the treatment of GO patients. Further studies with large sample sizes are needed to improve the accuracy of CNN.


INTRODUCTION
As a treatment method for Graves' ophthalmopathy (GO), an eye disease related to autoimmune thyroid disease (1), radiotherapy can be applied with satisfactory control while producing relatively slight post-radiotherapeutic complications (2). The organs-at-risk (OARs) around the target volumes of GO include the lens, optic nerves, and similar tissues. To enable the delivery of conformal and uniform doses to these target volumes while reducing the dose received by normal tissue, intensity-modulated radiotherapy (IMRT) and volumetric modulated arc therapy (VMAT) are often used for GO patients (1-3) because these approaches generate a steep dose gradient between the target volume and OARs (4,5). However, this implies that errors during treatment, such as position errors, have a significant impact on the treatment results (6). Cone beam computed tomography (CBCT), which is often used for correcting position errors before treatment, introduces additional radiation doses to patients (7); furthermore, intra-fractional movement is still present after pre-treatment CBCT scanning (8)(9)(10). As a method for monitoring treatment, in vivo dosimetry for obtaining information on the doses delivered to patients has significant potential.
Currently, amorphous silicon electronic portal imaging devices (a-Si EPIDs), which have high spatial resolutions, large twodimensional arrays, and approximately linear dose responses (11)(12)(13), are commonly used in clinical in vivo dosimetry (14)(15)(16)(17)(18). Although gamma pass rate threshold-based EPID in vivo dosimetry can be used to monitor treatment through single pass rate values (16)(17)(18)(19), research on EPID dosimetry by Hsieh Emmelyn S et al. (20) has revealed that, under 3%/3 mm and 95% pass rate threshold criteria, position errors greater than 2 cm can be detected, which is unsatisfactory. Meanwhile, gamma pass rate threshold-based dosimetry compresses 2D image information and therefore causes information loss (21). In addition, using gamma pass rate, errors can be detected; however, their direction cannot be detected. Thus, a tool for more comprehensive detection of errors and their direction is required.
Structural similarity (SSIM), which can measure the similarities between pairs of images based on the image luminance, contrast, and structural similarity, has been used for dose distribution error detection (22). Meanwhile, research showed that linac mechanical errors can be detected by combining dose difference (DD) maps with SSIM maps to analyze EPID fluence (23). However, DD maps and SSIM maps cannot be analyzed manually to detect errors.
The collection of methods for extracting information such as shape, grayscale, and texture contained in digital medical images into high-dimensional data is known as radiomics (24,25). Because radiomics information contains significant quantities of data that are difficult to process manually, radiomics is often combined with artificial intelligence for diagnosis, treatment selection, and prediction (26)(27)(28)(29). In several recent studies, radiomics has been combined with machine learning (ML) for conducting EPID fluence-based dose verification (21,23,30).
Convolutional neural networks (CNNs) can effectively perform image-related tasks by analyzing images at different scales using convolutional layers to extract useful information and generate final outputs (31). Accordingly, a number of researchers have proposed CNN-based patient-specific dose verification using dose maps as the CNN input (32)(33)(34)(35).
This study aims to find a suitable method for analyzing EPID transmission fluence maps in the identification of position errors during the in vivo dose monitoring of GO patients. To this end, we measured EPID transmission fluence maps with and without position errors obtained from the treatment of a head phantom and used the maps without position errors as a baseline for calculating dose difference and structural similarity maps reflecting position errors. The radiomics features of these DD and SSIM maps were then combined with ML models to classify position errors. CNN was also established to classify position errors using the DD and SSIM maps as inputs, and the ML and CNN classification results were compared.

Position Error Simulation and EPID Transmission Fluence Map Measurement
Forty VMAT plans of patients with GO who received radiotherapy in our department from November 2019 to October 2020 were selected, with a prescription dose of 20 Gy in 10 fractions. VMAT with two partial arcs rotating from 240 to 120°clockwise and from 120 to 240°counterclockwise was used to design the radiotherapy plans. To simulate clinical treatment, the plans were delivered to a head phantom (Chengdu Dosimetric Phantoms, Chengdu, China) on an EDGE linac (Varian Medical Systems, Palo Alto, CA) ( Figure 1A). The beam transmission fluence maps were measured by an as1200 EPID (Varian Medical Systems, Palo Alto, CA) with 1190 × 1190 pixels and pixel size of 0.336 mm × 0.336 mm in dosimetry mode. For each patient, one measurement without position errors was conducted first to obtain a baseline transmission fluence map. Then, 27 fluence maps with position errors were measured while 27 possibilities of position errors ( Table 1) combining 0-, 2-, and 4-mm left-right (LR), anterior-posterior (AP), and superior-inferior (SI) position errors were simulated by translating the head phantom along the LR, AP, and SI axis. The 0, 0, 0 error position was the same as the position where baseline maps were measured. Prior to measurement, the background and  pixel response of the EPID were calibrated through a dark field and a flood field and the EPID dose was calibrated.

Calculation of DD and SSIM Maps
The DD map DD(x,y) used in this study is defined as where V x and V y are the pixel values at a given spatial location on the baseline fluence map and fluence map with position errors, respectively. SSIM is an indicator used to measure the similarity of two images, with SSIM = 1 indicating that the two images are the same (36). We define SSIM as where a, b, and g are constants to control the relative weight of the three components, and l(x,y), c(x,y), and s(x,y) are the luminance, contrast, and structure maps, respectively, which are given by where m x and where m y are the local mean values of the pixels of images x and y, respectively, d x and d y are the local standard deviations of the pixel values of images x and y, respectively, and d xy is the covariance of images x and y. In this study, a square window with a side length of 11 pixels was used as the local window to calculate m x , m y , d x , d y , and d xy , the constants C 1 = (K 1 L) 2 , C 2 = (K 2 L) 2 , and C 3 = C 2 /2 were used to avoid zero denominators, the default values K 1 = 0.01 and K 2 = 0.03 were obtained from Wang (36), and L = 200 was selected to represent the fraction dose 200 cGy. As shown in the lower left corner of Figure 5, to quantify EPID transmission fluence changes caused by the position error, DD maps and SSIM maps, including luminance, contrast, and structure maps, were calculated. The baseline fluence maps were compared to fluence maps with 27 possibilities of position errors; thus, for each patient, 27 sets of DD, luminance, contrast, and structure maps were calculated. Some of the measured fluence maps and calculated maps are shown in Figure 2.

Position Error Classification
Three types of position error classifications were performed ( Figure 3): type 1 classified errors into two classes based on whether the isocenter error was over 3 mm, and the isocenter is defined as where LR errors, SI errors, and AP errors are the position errors in LR, SI and AP directions, respectively. Type 2, which contained three classification tasks, classified errors into two classes depending on whether the error in the AP, SI, or LR direction was over 3 mm; type 3 combined the three types of classification methods of type 2 and classified errors into eight classes with the goal of predicting the direction from which any error larger than 3 mm originated.

Machine Learning Method
The ML method used in the study is shown in Figure 4. The Python opensource library Pyradiomics (37) was used to extract radiomics features from the DD, luminance, contrast, and structure maps. Ninety-four types of features derived from six classes were extracted from the 512 × 512 center pixel matrix of each map, including first-order (19 features), gray level cooccurrence matrix (GLCM, 24 features), gray level dependence matrix (GLDM, 14 features), gray level run length matrix (GLRLM, 16 features), gray level size zone matrix (GLSZM, 16 features), and neighboring gray tone difference matrix (NGTDM, 5 features) features. Because of the constant shape of the fluence maps, the shape 2D (10 features) and shape 3D (17 features) features remained the same and were excluded. The classification models in the Python scikit-learn (38) and XGBoost (39) libraries, which included linear discriminant classifier (LDC), linear function kernel-supporting vector machine (linear-SVM), radial basis function kernel-supporting vector machine (RBF-SVM), Knearest neighbor (KNN), and extreme gradient boosting (XGBoost) models, were used to classify the position errors.
Three types of ML models-ML 1 models, which used the radiomics features of DD maps as inputs, ML 2 models, which used the features of SSIM maps as inputs, and ML 3 models, which used the features of DD and SSIM maps as inputs-were used to classify the position errors. Before using the radiomics features for ML, Pearson correlation analysis was performed between the radiomics features and classification labels to exclude features with low correlation and prevent the model from overfitting. To evaluate the performance of each classifier, the data for 32 out of the 40 patients were randomly selected as the training set and the data for the remaining eight patients were used as the testing set. In the training set, leave-one-out crossvalidation was combined with a grid search to tune the hyperparameters of the classifiers. The classification labels are listed in Table 1.
For the KNN classification, GridSearchCV was used to find a suitable number of neighbors from 1 to 10 with a step of 1. For SVM, GridSearchCV was used to find suitable kernels (kernel = linear, rbf), suitable C (C = 0.01, 0.1, 1, 10), and gamma (gamma = 0.001, 0.01, 0.1, 1) values, with the gamma being used only to find the rbf kernel. For LDC, GridSearchCV was used to find a suitable solver (solver = svd, lsqr, eigen). For XGBoost, 10-100 boosting rounds with a step of 10 and a maximum tree depth of 3-10 were searched for the type 1 and type 2 classification; for the type 3 classification, 100-250 boosting rounds with a step of 10 and a maximum tree depth of 3-10 were searched. The final hyperparameters are shown in Supplementary Material A.

CNN Method
The process flow of the CNN method is illustrated in Figure 4.
The input data of the CNN method comprised a 512 × 512 center pixel matrix of DD, luminance, contrast, and structure maps. As shown in Figure 5, the overall network included a feature extraction module and a feature integration module. In the feature extraction module, four types of maps were input into four independent feature extractors to extract features from different angles. Each extractor comprised 13 convolutional layers, four max pooling layers, and one average pooling layer. The convolutional layers, which did not change the dimension of the data, could extract features of a given dimension deeply, and the rectified linear unit (ReLu) function was used to improve the nonlinear ability of the network. The max pooling layer was used to extract the largest features within a 2 × 2 area and to double the number of channels. The average pooling layer calculated all the eigenvalues of each channel on average. The feature integration module contained two fully connected layers. After the extracted features were integrated, they were each connected to the same latitude within the integration module. The two fully connected layers extracted the relationships between features and output the probabilities of the respective predictions of n classes, with the class with the largest predicted value used as the final predicted class. To enable comparison of the ML model and CNN results, the ML training and testing sets were used for the CNN.

Comparison of ML Models and CNN
The classification accuracies of the ML models and CNN on the test sets were calculated. To evaluate the type 1 and type 2 error classification, the receiver operating characteristic (ROC) curves and the areas under the ROC curve (AUCs) were calculated; to evaluate the type 3 classification, precision, recall, f1-score, and confusion matrices were calculated for the CNN and the best ML 1, 2, and 3 models, respectively. The accuracy, precision, recall, and f1-score are defined as follows: where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.

Classification Results of Machine Learning Method
As shown in Table 2, the best ML 1 model was XGBoost, which had classification accuracies of 0.889, 0.755, 0.778, 0.833, and 0.532 on the type 1, 2-LR, 2-AP, 2-SI, and 3 test sets, respectively. The AUCs for type 1, 2-LR, 2-AP, and 2-SI were 0.945, 0.787, 0.834, and 0.903, respectively ( Figure 6). As shown in Table 3 and Figure 7, for type 3 errors, the recall value of XGBoost ranged from 0.062 to 0.859, and only the recall values of the first, fourth, and seventh classes were equal to or greater than 0.5.  Table 3 and Figure 7, for type 3 errors, the recall value of XGBoost ranged from 0.438 to 0.875, whereas the recall values of all classes were equal to or greater than 0.5, except the sixth class, which had a recall value of 0.438. However, confusions can arise between the third and fifth, fourth and sixth, and seventh and eighth classes.

Classification Results of CNN Method
As shown in Table 2, the CNN achieved classification accuracies of 0.925, 0.833, 0.875, 0.949, and 0.689 on the type 1, 2-LR, 2-AP,  2-SI, and 3 test sets, respectively. The AUCs of the type 1, 2-LR, 2-AP, and 2-SI errors were 0.943, 0.832, 0.926, and 0.992, respectively ( Figure 6). As shown in Table 3 and Figure 7, for type 3 errors, the recall value ranged from 0.375 to 0.906, with the recall value of all the classes equal to or greater than 0.5, with the exception of the eighth class, for which the recall value is 0.375. The second and fifth, fourth and seventh, and seventh and eighth classes are easy to confuse.

DISCUSSION
In this study, methods for analyzing EPID transmission fluence maps for the identification of position errors in the treatment of GO patients were developed. Although CBCT is generally used clinically to correct position errors prior to treatment, the dose delivered to a patient cannot be verified using CBCT alone because position or other errors can potentially occur during treatment. To overcome this problem, EPID-based in vivo dosimetry has been developed as a monitoring method to help therapists detect problems in a timely manner and enable the accurate treatment of patients. For patients with GO, the rigid anatomical structure around the target volumes ensures that anatomical change has a negligible impact ( Figure 1B), and the effect of random mechanical error on fluence maps is negligible relative to the effect of position errors. Therefore, this study focused on detecting positional errors in patients with GO. Most previous studies analyzed EPID fluence maps to detect MLC or MU errors (21,23,30). Cecile (32) used CNNs to analyze EPID fluence maps in classifying anatomical changes, position errors, and linac mechanical errors in the treatment of lung cancer patients; however, they used software to simulate fluence maps and applied a 1-cm position error classification boundary. In this study, we measured transmission fluence maps using a linac and classified position errors using a more precise standard; to the best of our knowledge, ours was the first study to classify patient positioning errors using measured EPID transmission fluence maps.
The method of calculating DD is often used clinically to compare dose maps; in this study, the radiomics features of DD maps were used as inputs for the ML 1 models. For type 1 and type 2 classification, the accuracy of XGBoost exceeded 0.75 and the AUCs all exceeded 0.78; however, for type 3 classification, the accuracy was only 0.532 which was unsatisfactory. Meanwhile, for XGBoost in ML 2 models, which used the features of SSIM maps as input, the accuracy and AUCs exceeded 0.73 in type 1 and 2 classification, whereas in type 3 classification, the accuracy was 0.491, which also needed improvement. Combining the DD map features with those of luminance, contrast, and structure maps to classify position errors improved both the accuracies and AUC values; for the type 1 and 2 classification, the accuracy of LDC exceeded 0.79 and the AUCs all exceeded 0.83. A considerable improvement in accuracy was also achieved for the type 3 classification, with the recall value of all but one class equal to or greater than 0.5. These results suggest the usefulness of combining SSIM maps with DD maps in the analysis and detection of errors in EPID transmission fluence maps. SSIM, which can measure the similarities between pairs of images, has  been previously investigated as a tool for analyzing EPID fluence maps. Peng Jiayuan et al. demonstrated that the three components of an SSIM map-the luminance, contrast, and structure maps-are capable of detecting the absolute dose error, gradient discrepancy, and dose structure error on two dose planes (22). The use of DD maps can quantify the pixelwise relative differences between baseline maps and maps with errors to compensate for the insensitivity of luminance maps to small luminance differences between images (23). The type 1 classification, which initially detects the presence of significant position errors and prompts the therapist and physicist to determine the error in time, was the simplest approach used in this study; the type 2 and type 3 classifications, in contrast, were designed to detect the direction of position errors. The type 3 classification is essentially a combination of the three classification types of the type 2 classification. For the ML 1 and ML 3 models XGBoost and LDC, respectively, simply combining the type 2 classification approaches to conduct type 3 classification produced accuracies of only 0.489 and 0.641, which were both lower than the accuracies of 0.532 and 0.671 achieved by the corresponding type 3 approaches. Therefore, it was necessary to train models specifically for the type 3 classification. All the radiotherapy plans in this study used coplanar VMAT with partial arcs rotating from 240 to 120°c lockwise and from 120 to 240°counterclockwise ( Figures 1C-E), and position errors in the SI direction were perpendicular to the gantry rotation plane. Therefore, the final EPID transmission fluence map integrated the fluence maps from each angle, which contained information on SI direction errors; by contrast, the LR and AP errors all lay on the linac rotational plane and, as a result, the final fluence maps contained less information on LR and AP direction position errors. In addition, the final fluence maps did not contain information covering the gantry angle ranges from 181 to 240°and from 120 to 180°, which could have been used to help classify errors in the LR direction; as a result, in the type 2 classification, the accuracy along the SI direction was the highest whereas that along the LR direction was the lowest. The results of the EPID-based 3D in vivo dosimetry study of Li Yinghui also revealed that the SI position error had the most significant impact on the g pass rate (40). For the type 3 classification, several classesincluding the third and fifth classes for the ML 3 models and the seventh and eighth classes for the CNN-were easy to confuse. This confusion arose from the fact that both the fifth and eighth classes introduced LR errors, which had a considerably lesser influence on the fluence maps than that of the AP and SI errors introduced by the third and seventh classes. Linear discriminant analysis was used to project the ML 3 model input data onto a two-dimensional scatter plot for visualization (Figure 8), from which it was observed that the data for classes that were easily confused (such as the third and fifth classes) overlapped ( Figure 8A) and were more difficult to classify relative to the data for easier-to-distinguish classes ( Figure 8B), such as the second and seventh classes.
The EPID dosimetry study conducted by Ma Chaoqiong (23) used LDC, SVM, and random forest (RF) methods to train models, whereas Nyflot Matthew J (30) used SVM, KNN, and decision tree (DT) approaches. In this study, LDC, SVM, and KNN were used and XGBoost was used instead of RF or DT in training the ML models, and the four approaches were found to have equivalent classification power. Although the performance of KNN in the type 1 and type 2 classifications was found to be acceptable, the accuracies of KNN in the ML 1, 2, and 3 models for the type 3 classification were sub-optimal at 0.458, 0.324, and 0.477, respectively. These low accuracies may be attributed to the unbalanced distribution of sample sizes among the eight categories, as KNN is prone to classification errors in situations involving imbalances in sample size (41). The ML 1 and 2 model achieved relatively good results using XGBoost; for the ML 3 model, by contrast, XGBoost was not as effective as LDC because the dimensionality of the ML 3 model input data was higher than that of the ML 1 and 2 model data and XGBoost is prone to overfitting when processing high-dimensional feature data (42). The CNN achieved a slight improvement in classification accuracy relative to that of the ML 3 model. The two models also produced similar AUC values. In this study, for each patient, 28 fluence maps were measured, and there were 1120 fluence maps in total. After classification, there were less maps corresponding to each label, the size of the data set (40 patients) used as CNN input may have been insufficient for the network, as CNNs generally require large-scale datasets for training (43,44); as a result, the advantage of using the CNNs was not obvious. In this study, a method to identify position errors was developed by analyzing the EPID fluence maps, which can be combined with CBCT in clinical treatment. At the first few fractions, fluence maps should be acquired after using CBCT to correct the position errors. The maps of the first fraction should be used as baseline, and a comparison should be conducted between the baseline maps and other fluence maps. If the results show no obvious errors, the frequency of using CBCT should be reduced, and the method of using fluence maps to identify position errors should be used to monitor the next fractions. However, small position errors such as 1 mm errors can be more common in clinical, therefore the feasibility of this monitoring process needs to be further studied. All the EPID transmission fluence maps in this study were measured on the same day for each patient plan and were based on the Chengdu dosimetric phantoms using as1200 EPID; however, deviation of absolute dose, flatness, and symmetry should be considered when monitoring the patients clinically. Meanwhile, the impact brought by different resolution of different EPID detectors and different anatomical structures of different patients on the method we proposed in this study requires further exploration. In addition, the fluence map analysis was limited to the identification of translational position errors in the treatment of GO patients. Study of Cecile (32) has shown different types of errors such as rotational position errors, anatomical changes, and linac mechanical errors that can be detected by EPID dosimetry for lung cancer patients. Further research involving the use of EPID transmission fluence maps to monitor more types of errors and diseases should be conducted.

CONCLUSION
DD and SSIM maps can be combined to analyze EPID transmission fluence maps. ML models as well as CNN trained on small-sized samples can be used to identify position errors in the treatment of GO patients. Further studies with large sample sizes are required to improve the accuracy of CNN. The feasibility of using this method in clinical treatment should be further investigated.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the ethics committee of the West China Hospital (ChiCTR2100043576). Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
XZ, GL, and GD contributed conception and design of the study. SB and GL contributed to administrative support and provision of study materials or patients. GD, XZ, ZL, GW, YL, QX, LD, JL, and XS collected the data. WL, GD, XZ, and GL analyzed the data. All authors contributed to the article and approved the submitted version.