Lung Cancer Segmentation With Transfer Learning: Usefulness of a Pretrained Model Constructed From an Artificial Dataset Generated Using a Generative Adversarial Network

Purpose: The purpose of this study was to develop and evaluate lung cancer segmentation with a pretrained model and transfer learning. The pretrained model was constructed from an artificial dataset generated using a generative adversarial network (GAN). Materials and Methods: Three public datasets containing images of lung nodules/lung cancers were used: LUNA16 dataset, Decathlon lung dataset, and NSCLC radiogenomics. The LUNA16 dataset was used to generate an artificial dataset for lung cancer segmentation with the help of the GAN and 3D graph cut. Pretrained models were then constructed from the artificial dataset. Subsequently, the main segmentation model was constructed from the pretrained models and the Decathlon lung dataset. Finally, the NSCLC radiogenomics dataset was used to evaluate the main segmentation model. The Dice similarity coefficient (DSC) was used as a metric to evaluate the segmentation performance. Results: The mean DSC for the NSCLC radiogenomics dataset improved overall when using the pretrained models. At maximum, the mean DSC was 0.09 higher with the pretrained model than that without it. Conclusion: The proposed method comprising an artificial dataset and a pretrained model can improve lung cancer segmentation as confirmed in terms of the DSC metric. Moreover, the construction of the artificial dataset for the segmentation using the GAN and 3D graph cut was found to be feasible.


INTRODUCTION
Segmentation of lung cancer is an important research topic, and various studies have been conducted so far. Segmentation results are used to determine the effectiveness of anticancer drugs (Mozley et al., 2012;Hayes et al., 2016) and to perform texture analyses on medical images (Bashir et al., 2017;Yang et al., 2020). To use the segmentation results of lung cancer effectively, the segmentation accuracy is an important factor. Segmentation is typically done manually by radiologists; however, manual segmentation can sometimes yield inaccurate results because of interobserver variability. Semiautomatic segmentation has lower interobserver variability than manual segmentation (Pfaehler et al., 2020). To overcome this interobserver variability, an automatic segmentation of lung cancer is desirable.
Recent years have witnessed significant development in the application of deep learning to various domains, including in the area of segmentation. For example, deep learning has been applied to the automatic segmentation of organs, such as the lungs, liver, pancreas, uterus, and bones, and to the automatic segmentation of tumors in these organs, with good segmentation performance (Roth et al., 2015;Chlebus et al., 2018;Isensee et al., 2018;Chen et al., 2019;Gordienko et al., 2019;Kurata et al., 2019;Noguchi et al., 2020;Hodneland et al., 2021).
One of the problems in the application of deep learning is a dataset. Deep learning does not perform well when the dataset is small. In general, it is difficult to increase the size of datasets containing medical images compared with other domains. This is due to the high cost of acquiring medical images and the need to protect personal information. To this end, transfer learning with pretrained models (Shin et al., 2016;Tschandl et al., 2019), data augmentation (Zhang et al., 2017;Yun et al., 2019), and artificial generation of datasets using generative adversarial networks (GANs)  have been developed.
The GAN was first proposed by Goodfellow et al. (2014). The recent improvements made to the GAN have made it possible to generate high-quality, high-resolution images. Various attempts have been made to apply the GAN to medical image processing. Several studies have shown that it is possible to generate CT images of lung nodules using the GAN (Jin et al., 2018;Onishi et al., 2019;Yang et al., 2019;Yi et al., 2019;Armanious et al., 2020).
To overcome the small dataset problem for segmentation, we proposed to use deep learning models pretrained with an artificially generated dataset using the GAN. We hypothesized that transfer learning with the proposed pretrained models could improve the automatic segmentation accuracy when using the lung cancer dataset. In general, a segmentation model obtained through supervised learning requires an image and its label as the dataset. In our study, to generate a dataset for segmentation, we used the GAN for image generation and the 3D graph cut method for generating labels of the generated images. No manual task for labeling was required to generate the dataset for pretraining.

MATERIALS AND METHODS
Our study used anonymized data extracted from public databases. Therefore, institutional review board approval was waived in accordance with the regulations of our country. Figure 1 shows the outline of the proposed method for the segmentation model.

Dataset
Three public datasets containing computed tomography (CT) images were used: LUng Nodule Analysis 2016 (LUNA16) dataset, Decathlon lung dataset, and NSCLC radiogenomics. Table 1 shows a summary of the three datasets.
The LUNA16 dataset includes 888 sets of 3D CT images (Grand-Challenges, 2016;Setio et al., 2017) constructed for lung nodule detection. Therefore, the original LUNA16 dataset is unsuitable for segmentation. A previous study used the LUNA16 dataset to generate images of lung nodules using the GAN (Nishio et al., 2020a). We used the same dataset and a GAN model to generate the dataset for segmentation. For image preprocessing, the voxel size of the 3D CT images in the LUNA16 dataset was changed (1 mm × 1 mm × 1 mm isotropic). To generate lung cancer-like nodules and their labels in the LUNA16 dataset, large true nodules are problematic because labels of true nodules are not available in the LUNA16. Therefore, sets of 3D CT images with small lung nodules (the size of each nodule was <6 mm) were selected. As a result, 165 sets of 3D CT images in the LUNA16 dataset were used to generate an artificial dataset for segmentation.
The Decathlon challenge (http://medicaldecathlon.com/) was held to provide a fully open source and comprehensive benchmark for general purpose algorithmic validation and testing, covering several segmentation tasks. Decathlon includes several segmentation datasets, from which the Decathlon lung dataset (Task06) was used as the training and validation sets for our study. The Decathlon lung dataset includes 63 sets of 3D CT images and their segmentation labels. To simulate the small dataset, 10 and 30 sets of 3D CT images were selected from the Decathlon lung dataset; the image files of Decathlon lung dataset (NIfTI files) were sorted by file name, and the first 10 or 30 files were selected. As a result, three types of training datasets were prepared from the Decathlon lung dataset: 63 sets from the original Decathlon lung dataset (Decathlon full ), 30 sets (Decathlon mid ), and 10 sets (Decathlon small ). No image preprocessing was performed on the Decathlon lung dataset.
The NSCLC radiogenomics dataset (https://wiki. cancerimagingarchive.net/display/Public/NSCLC-Radiomics) contains images from 211 patients with non-small-cell lung cancer (Cancer Imaging Archive, 2021; Bakr et al., 2018;Gevaert et al., 2012;Clark et al., 2013). The dataset comprises CT, positron emission tomography/CT images, and segmentation maps of tumors in the CT scans. From the 211 patients, 3D CT images of 144 patients and their segmentation labels were selected for the current study. Segmentation labels are not available for the other 67 patients. The NSCLC radiogenomics dataset was used as the test set. For image preprocessing, the voxel size of the 3D CT images in the NSCLC radiogenomics dataset was changed (1 mm × 1 mm × 1 mm isotropic). The median volume of the lung Frontiers in Artificial Intelligence | www.frontiersin.org July 2021 | Volume 4 | Article 694815 2 cancer was 8,219 mm 3 (interquartile range: 3,461.5-25,357 mm 3 ) in the NSCLC radiogenomics dataset.

Dataset Generation
The LUNA16 dataset was used to generate an artificial dataset for segmentation. First, lung segmentation was performed for the chest CT images of the LUNA16 dataset, covering the lungs entirely. A pretrained deep learning model (a variant of U-net (Ronneberger et al., 2015)) was used for the lung segmentation (https://github.com/JoHof/lungmask (Hofmanninger et al., 2020)). Subsequently, 3D images of the nodule were generated using the GAN model, which is based on the variant of 3D pix2pix (Nishio et al., 2020a). While the GAN model can generate lung nodules at any location in the lungs, we used locations of true nodules for nodule generation. In addition, we generated only one nodule per CT scan. To determine the location of the generated lung nodule, one location of true nodule was selected from the annotation of the LUNA16 dataset, for each CT scan. Therefore, the locations of generated lung nodules were fixed (no randomness). The true nodule was replaced with the nodule generated using the GAN model. For the nodule generation, 3D CT images were cropped with a volume of interest of 40 × 40 × 40 voxels for the location of the true nodules, and the cropped images were fed to the GAN model. While the size of the generated lung nodules can be adjusted with the GAN model, the GAN model generated the largest nodule as the model (the generation target size was 3 cm or higher). After nodule generation, the segmentation label was automatically generated  using the 3D graph cut and Gaussian mixture (https://github. com/mjirik/imcut) (Jirík et al., 2013). Because the intensity of the seed point on the CT images was used to train the Gaussian mixture model, the center area of the generated images (40 × 40×40 voxels) was specified as seed points of the nodule, and the marginal area of the generated images was specified as seed points of the non-nodule (background). The output of the 3D graph cut was used as the segmentation label of the generated nodule. Next, the generated CT images of the nodule were merged with the original CT images. When merging the CT images of the generated nodules, only the areas that were assigned as lung labels in the lung segmentation were targeted for the merging. The areas of the generated CT images that were assigned as nonlung labels were not merged. Figure 2 shows the representative images of the generated nodules and their labels. In total, 165 lung nodules were generated for the 165 sets of 3D CT images in the LUNA16 dataset.

Segmentation Model
Open-source software (nnUnet) (Isensee et al., 2018) was used for the deep learning model of lung cancer segmentation, which is available at https://github.com/MIC-DKFZ/nnUNet. nnUnet is a Frontiers in Artificial Intelligence | www.frontiersin.org July 2021 | Volume 4 | Article 694815 4 variant of U-net (Ronneberger et al., 2015). Originally, nnUnet was used for the Decathlon datasets (Isensee et al., 2018). Because the original nnUnet has no functionality of transfer learning, we modified the source code of nnUnet. With the modification, nnUnet could use a pretrained model and perform transfer learning. In addition, the number of epochs in the training nnUnet could be changed. Except for these two points, no modifications were made to nnUnet. Dataset splitting (training and validation sets) was performed with the default setting of nnUnet.
First, the generated dataset for segmentation obtained from the LUNA16 dataset was used to construct the pretrained model. Two pretrained models were built: one obtained from 300 epochs of training (PM 300 ) and the other obtained from 500 epochs of training (PM 500 ). Next, transfer learning using the two pretrained models was performed for the three Decathlon lung datasets (Decathlon full , Decathlon mid , and Decathlon small ) using the modified nnUnet. At this stage, no new layer was added to the model. Although several studies used layer freezing in transfer learning (Nishio et al., 2020b), no layers of the pretrained model were frozen in this study. To evaluate the effect of transfer learning, models were constructed without transfer learning (original nnUnet). Here, "original nnUnet" means that the source code of nnUnet was not changed, except for changing the number of epochs. The original nnUnet and its default setting were used for the model construction without transfer learning. In the model training, the epochs were set to 100, 300, and 500. The training of each model was started from epoch 1.

Evaluation of Segmentation Models
As the test dataset, 144 sets of 3D CT images from the NSCLC radiogenomics dataset were used. For each set, the Dice similarity coefficient (DSC) was used to evaluate the segmentation models. In addition, the Jaccard index (JI), sensitivity (SE), and specificity (SP) were calculated as the evaluation metrics, which is expressed as follows: where |P|, |L|, and |I| denote the number of voxels for the segmentation results, label of the lung cancer segmentation, and 3D CT images, respectively. |P ∩ L| represents the number of voxels where nnUnet can accurately segment the lung cancer (true positive). Before calculating the four metrics, a threshold of 0.5 was used to obtain the segmentation mask from the output of nnUnet.
Differences of DSC were statistically tested with the Wilcoxon signed rank test. To control the family-wise error rate, the Bonferroni correction was used; p-values less than 0.01666 were considered statistically significant. Statistical analyses were performed using R (version 4.0.4, https://www.r-project. org/). Figures 3-5 show the mean DSC of the test set with and without PM 300 and PM 500 when Decathlon full , Decathlon mid , and Decathlon small are used as the training sets, respectively. In these figures, the results without PM correspond to those of original nnUnet. Generally, PM 300 and PM 500 improved the mean DSC of nnUnet, compared with the original nnUnet (without the  pretrained model). In particular, the effectiveness of the pretrained model was high when using Decathlon mid as the training set. Neither PM 300 nor PM 500 was useful for DSC improvement when Decathlon full and Decathlon small were used in the 500-epoch training. The DSC improvement was greater in the 100 and 300 epochs than that in the 500 epochs.

RESULTS
Tables 2-4 list the mean and standard deviation of the four metrics of the test set with and without PM 300 and PM 500 when Decathlon full , Decathlon mid , and Decathlon small are used as the training sets, respectively. Because the volume ratio between cancer and noncancerous regions is extremely low, SP was extremely high in the current study. Regarding DSC, JI, and SE, the same trend can be observed. PM 300 and PM 500 improved the mean values of the three metrics; improvement in JI and SE was greater in the 100 and 300 epochs than that in the 500 epochs. Table 5 shows p-values for differences of DSC in Decathlon full , Decathlon mid , and Decathlon small . Figure 6 shows all the DSC values of the test set when using Decathlon mid with and without the pretrained model. Figures  7, 8

DISCUSSION
In this study, we proposed a pretrained model for segmentation constructed from an artificial dataset of lung nodules generated using the GAN and 3D graph cut. Our results show that the   accuracy of lung cancer segmentation could be improved when this pretrained model was used for transfer learning in the segmentation process. The effectiveness of the pretrained model was higher on the Decathlon mid and Decathlon small datasets than that of the pretrained model on the Decathlon full dataset, suggesting that our proposed method may be effective on small datasets. The pretrained model was more effective when the number of training epochs was low. In other words, the number of epochs required to achieve a sufficient segmentation performance was lower with the pretrained model than without it. This may be attributed to the fact that the pretrained model provides good initial values for the trainable parameters of nnUnet.
Previously, a study used U-net and GAN combinedly for multiorgan segmentation on 3D CT images (Dong et al., 2019). However, the study did not use a pretrained model. Another study was conducted on a classification model using a dataset generated with GANs and a pretrained model (Onishi et al., 2020). To the best of our knowledge, no studies have been reported on segmentation models with GANs and a pretrained model. Our results and those of Onishi et al. (2020) indicate that the GAN generated dataset, and its pretrained models may be useful for various tasks.
Several studies have reported the use of artificially generated datasets using the GAN for data augmentation (Jin et al., 2018;Onishi et al., 2019;Yang et al., 2019;Muramatsu et al., 2020). Similarly, in this study, we tried to use a dataset generated using the GAN for data augmentation. However, we could not obtain effective results for lung cancer segmentation when the artificial dataset was used as data augmentation (data not shown in this article). Instead, we constructed a pretrained model for the segmentation using the generated lung nodules and performed transfer learning based on the pretrained model, yielding higher lung cancer segmentation accuracy. Although it was difficult to perform accurate classification between the generated lung nodules and the true lung nodules (Nishio et al., 2020a), the generated lung nodules had little variation as lung cancer. It is speculated that mixing the generated lung nodules with the true lung nodules could distort the distribution as the dataset of lung    Generally, supervised learning (e.g., nnUnet) requires annotation data as the dataset. On the datasets of lung cancer segmentation, clinicians frequently annotate lung cancer on CT images to build lung cancer datasets, which is time consuming and labor intensive. Although it is possible to manually annotate the generated data of our dataset, we decided to use the 3D graph cut to obtain annotation data of the generated lung nodules. This made it possible to build an artificial dataset for the segmentation without requiring any manual task.
Although the generated lung nodules and the pretrained model based on them could effectively improve the accuracy of lung cancer segmentation, this pretrained model is not always effective. For example, the effectiveness of the pretrained model was not observed in the 500-epoch training of Decathlon full and Decathlon small . For the former case, this was attributed to the fact that Decathlon full had sufficient amount of data and the number of training epochs was high. In the latter, the number of datasets was very small (10 cases). Therefore, even when the pretrained model was used, the training segmentation model was unstable, and the effectiveness of the pretrained model was limited.
Our study has some limitations. First, we used three public datasets containing images of lung nodules and/or lung cancer. However, we did not verify whether the generalizability of our segmentation model can be improved under external variation. Second, we focused on lung nodules and/or lung cancer in the current study. Therefore, the effectiveness of our method for other diseases or other organs has not been validated. In particular, it is necessary to confirm whether the automatic generation of annotation data using the 3D graph cut can be applied to other diseases and other organs. Third, because of the GAN model's limitation (Nishio et al., 2020a), it was impossible to generate lung nodules larger than 40 mm. Therefore, the effect of large generated nodules is not investigated in the current study.
In conclusion, the proposed method comprising an artificial dataset and a pretrained model can improve the accuracy of lung cancer segmentation; however, it should be further investigated for other diseases and other organs.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
MN contributed to conception and design of the study. MN and KF organized the database. MN and HM developed the software. MN performed the statistical analysis. MN wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

FUNDING
This study was supported by JSPS KAKENHI (grant numbers: 19H03599 and JP19K17232). The funder had no role in the present study.