U-NTCA: nnUNet and nested transformer with channel attention for corneal cell segmentation

Background Automatic segmentation of corneal stromal cells can assist ophthalmologists to detect abnormal morphology in confocal microscopy images, thereby assessing the virus infection or conical mutation of corneas, and avoiding irreversible pathological damage. However, the corneal stromal cells often suffer from uneven illumination and disordered vascular occlusion, resulting in inaccurate segmentation. Methods In response to these challenges, this study proposes a novel approach: a nnUNet and nested Transformer-based network integrated with dual high-order channel attention, named U-NTCA. Unlike nnUNet, this architecture allows for the recursive transmission of crucial contextual features and direct interaction of features across layers to improve the accuracy of cell recognition in low-quality regions. The proposed methodology involves multiple steps. Firstly, three underlying features with the same channel number are sent into an attention channel named gnConv to facilitate higher-order interaction of local context. Secondly, we leverage different layers in U-Net to integrate Transformer nested with gnConv, and concatenate multiple Transformers to transmit multi-scale features in a bottom-up manner. We encode the downsampling features, corresponding upsampling features, and low-level feature information transmitted from lower layers to model potential correlations between features of varying sizes and resolutions. These multi-scale features play a pivotal role in refining the position information and morphological details of the current layer through recursive transmission. Results Experimental results on a clinical dataset including 136 images show that the proposed method achieves competitive performance with a Dice score of 82.72% and an AUC (Area Under Curve) of 90.92%, which are higher than the performance of nnUNet. Conclusion The experimental results indicate that our model provides a cost-effective and high-precision segmentation solution for corneal stromal cells, particularly in challenging image scenarios.


Introduction
Corneal stroma layer comprises collagen fibers, accounting for 90% of the overall thickness of cornea.The corneal stroma cells, as the major cell type of the stroma, produce proteins that provide structure to the stroma and maintain the homeostasis of cornea (Barrientez et al., 2019).The injury of stromal cells tend to cause corneal irreversible damage (Barrientez et al., 2019).Previous studies have shown that the segmentation of corneal stromal cells provide the possibility to quantify cell density and other morphological changes (Arıcı et al., 2014).This process assists ophthalmologists in intuitively acquiring geometric variations to support clinical analysis (Al-Fahdawi et al., 2018).Consequently, it enables the identification of deformities or erosion caused by viruses, helping prevent irreversible pathological damage that could lead to significant visual impairment or even blindness in patients (Subramaniam et al., 2021).In particular, when compared with healthy corneas, keratoconus presents a conical protrusion and the stroma becomes significantly thinner (Lagali, 2020).Thus, the segmentation of stromal cells and the subsequent morphological measurements are helpful for ophthalmologists to judge the severity and progress of the disease.
The utility of automatic cell segmentation approaches significantly enhances the efficiency of ophthalmologists, thereby reducing the dependency on highly experienced experts (Shang et al., 2022).Various widely employed algorithms, including K-means clustering (Yan et al., 2012), edge detection (Pan et al., 2015), and watershed (Sharif et al., 2012) have been utilized to achieve automatic cell segmentation.Among them, watershed stands out due to its ability to identify challenging regions by incorporating distance transform, variance filtering, and gradient analysis (Lux and Matula, 2020).Dagher and El Tom (2008) proposed a hybrid snake-shape parameter optimization by combining the watershed algorithm with active contour, employing region merging and multi-scale techniques to alleviate issues associated with insufficient segmentation.Al-Fahdawi et al. (2018) employed Fourier transform to mitigate image noise and combined watershed for endothelial cell boundary detection.However, it is important to note that watershed approaches are prone to cause over-segmentation and often require extensive reliance on empirically tuned parameter settings.
Recent advancements in deep learning techniques provide promising possibilities for achieving more accurate cell segmentation performance.Many researchers have exploited representative networks including U-Net (Ronneberger et al., 2015), SegNet (Badrinarayanan et al., 2017), and DeepLab (Chen et al., 2017) to segment and quantify cell morphological changes.Fabijańska (2018) trained the U-Net to differentiate pixels surrounding cell boundaries and skeletons, finally obtain the segmenation results via binarizing a boundary probability map.Vigueras-Guillén et al. (2019) introduced a local sliding window in UNet and generated probability labels to enhance the contrast between positive samples and background.Subsequently, they proposed a plug-and-play attention mechanism called feedback non-local attention to assist in inferring occluded cell regions (Vigueras-Guillén et al., 2022).Given the challenges of boundary discontinuity encountered when neural networks predict ambiguous cell boundaries, some studies considered combining the advantages of CNN and watershed.Lux and Matula (2020) integrated label-controlled watershed and convolutional networks to segment densely distributed cells, incorporating segmentation function criteria to describe object boundaries.
The CNN-based models are suitable for segmenting large cells, but for cells exhibiting artifacts within their bodies, complex postprocessing algorithms are essential for separating cells that are in proximity, or for reconstructing fragmented cells to form a complete cellular structure.On the other hand, the segmentation performance of CNN decreases when facing cells of different sizes within the same field of view.With the popularity of Transformer (Vaswani et al., 2017), some studies have introduced Transformer with a global perspectives to support the segmentation process (Zhang et al., 2021;Zhu et al., 2022).Zhang et al. (2021) proposed a multi-branch hybrid transformer (MBT-Net) based on edge information, which utilized Transformer and residual connection to establish long-term dependencies between space and channels.Additionally, it also incorporated body edge branches to provide edge position.Zhu et al. (2022)  Previous methods frequently employed Transformer to model dependency relationships among features within the layer of same size.Simultaneously, a feature within a specific layer only interacts directly with its adjacent feature layers, making it difficult to transmit hierarchical difference information of features.This poses a challenge to integrating multi-scale information from non adjacent layers at the macro scale.Our method leverages Transformer to model the hierarchical relationships among features across different layers, with the aim of reducing the deviation and loss of edge pixels caused by interpolation and sampling between features in different layers.We recursively convey context information across different feature layers within the structure of nnUNet.This approach allows for the acquisition of high-dimensional semantic relationships between pixel points and their neighbors from various perspective.Our contributions can be summarized as follows: • We propose a Transformer-based network called U-NTCA to segment corneal stromal cell.It integrates with dual high-order channel attention and allows for the recursive transmission of crucial contextual features to better preserve detailed cell information.• We introduce a high-order channel attention mechanism that extends the spatial interaction among pixels from second-order to higher-order.This procedure enables feature interaction within a low computational complexity by recursively increasing the channel width.• We design a novel transformer-based method that combines a channel attention to generate multi-scale features.This facilitates direct feature transmission across non-adjacent layers in the network.

Dataset
All study subjects were scanned with a laser scanning corneal confocal microscopy HRTIII (Heidelberg Engineering, Heidelberg, Germany) at the affiliated Eye Hospital of Wenzhou Medical University.The study adhered to the tenets of the Declaration of Helsinki, and was approved by the Institutional Review Board of the Affiliated Eye Hospital of Wenzhou Medical University.All the participants provided a written informed consent after receiving an explanation of the risks/benefits of the study.The dataset utilized for this study on corneal stromal cells includes 136 images, each with a resolution of 384 × 384.The training dataset contains 96 images, while the testing dataset consists of 40 images.The segmentation labels of this dataset was manually annotated by one senior ophthalmologist using the ITK-SNAP software.During training, the data augmentation operations used in the training images include rotation, increasing contrast, adding noise, translation, and flipping.This dataset comprises corneal stromal cells source from three conditions: healthy corneas (named as "normal"), corneas with keratoconus (named as "cone"), and corneas that have been eroded by viruses (named as "HSK").In general, these cells are presented in three different types.The first type exhibits a clear field of view and clear cell structure; The second type shows that the blood vessels in the background traverse the majority of visual field, causing partial occlusion of some corneal

Methodology . nnUNet
In medical image segmentation, researchers often develop specific algorithms tailored to address distinct research tasks and solve targeted problems.This practice, however, can result in weak generalization and robustness for general models.nnUNet is proposed to specifically solve such issues of semantic segmentation tasks in medical imaging.It places a greater emphasis on aspects such as pre-processing, training, and post-processing procedures, with a primary focus on images.By systematically modeling various configuration strategies as a set of fixed parameters (learning rate and batch size), it proves adaptable to a range of medical image segmentation tasks.
The network architecture of nnUNet is the same as that of UNet, following the encoder-decoder paradigm, which comprises a series of dense convolutional blocks.Skip connections are employed between the encoder and decoder.By concatenating the generated features for use as complementary information, efficient feature mapping occurs between internal blocks, establishing convolutional and nonlinear connections.It is noteworthy that nnUNet, aiming to enhance stability and adaptability during training while avoiding limitations imposed by batch size, substitutes the original ReLU activation functions in UNet with leaky ReLUs.In addition, it replaces the more popular batch normalization with Instance normalization.This adaptation improves nnUNet with a stronger adaptive capability, effectively resolving training instability stemming from variations in imaging methods, sizes, and voxel spacing.This enables nnUNet to be employed across a variety of scenarios.
. U-NTCA network Considering nnUNet's outstanding data processing capability and parameter adaptive adjustment, we utilize it as a backbone network and enhance it to improve information interaction between pixels and the utilization of feature information.Figure 2 shows the overall structure of the proposed U-NTCA network.First, to highlight the relationship between neighboring pixels, our focus is on the three adjacent feature layers in the UNet.For the three feature layers with the same channel, we conduct feature dimension transformations on their heights and widths.The transformed outputs are used as inputs for the proposed g n Conv channel attention, facilitating higher-order operations.This process fosters efficient interaction between neighboring pixel regions.Subsequently, the enhanced features are integrated into the current aggregated features, which are then fed into the nested transformer to aid in generating full-resolution features.Additionally, the recursive transfer of underlying feature information mitigates ambiguity and reduces information loss resulting from the sampling process.

. . g n Conv high order attention mechanism
To enhance the interactive capabilities local context across varying resolutions, we introduce g n Conv module (Rao et al., 2022), which achieves explicit higher-order spatial interaction strategies within neighborhood.g n Conv is a module that implements channel attention through a combination of gated convolution and recursive strategy.It consists of three components: standard convolution, linear projections, and element-wise multiplications.It inherits the translation equivariant of standard convolution, thereby introducing inductive biases and avoiding the asymmetry arising from local attention.
Unlike the conventional approach of using g n Conv to directly interact with attention, we perform a morphological operation on feature x 0 ∈ R H 0 ×W 0 ×C 0 .This involves reshaping the dimensions of width and height x ∈ R H×W×C , where H = W = √ C 0 and C = H 0 × W 0 .This strategy aims to achieve high-order interaction between global pixels across diverse fields of view.It enables the network to learn the morphological characteristics and distribution patterns from varying perspectives and directions.For transformed feature x, we obtain mapping feature set φ in (x) and feature auxiliary set q k n−1 k=0 with rich information embedding through the application of operation φ in .The operation increases the feature dimension by two times, and then divides the expanded dimension according to rule C k .It can be written as Subsequently, recursively execution of gated convolution is performed, introducing the interaction between adjacent features p 0 and q 0 through element-wise multiplications.This process achieves a spatial mixing input function with adaptive self-attention via The channel dimension of each order can be written as Unlike the way that Transformer achieves spatial global interactions through mixing space tokens, g n Conv incrementally increases the channel width.It utilizes global computation of convolution and fully connected layers to expand the spatial interaction between pixels, progressing from second-order to higher-order interactions within less complexity.

. . Transformer nested with channel attention mechanism
In nnUNet, we transmit the features processed by g n Conv module as part of multi-scale features to Transformer.For downsampling image x d ∈ R H×W×d and upsampling image x u ∈ R H×W×d , we flatten them to generate features x d ∈ R d×HW and x u ∈ R d×HW .We utilize the g n Conv to encode x u and generate g n (x u ) that interacts with neighboring pixels in a high-order space.Then, x u , x d , and g n (x u ) are sent to encoder to generate enhanced xu through self-attention.
On one hand, the upsampling feature x u is sent into the encoder, accompanied by its corresponding feature g n (x u ) that has undergone spatial point multiplication to facilitate higher-order interactions.This prompts the network to devote more attention to the decisive channels, implicitly reflecting the position of cells; On the other hand, xu could bring more semantic information by fully interacting with the multi-scale features x c transmitted from lower layers in decoder, guiding x c to learn the constraint relationship between pixels and their neighbors from multi-scale perspectives.This aids in the inference of missing or incorrect cell regions caused by rough interpolation process.The specific formula for the attention mechanism is given as follows The upsampling feature x u , downsampling feature x d , and enhanced feature xu are encoded into xu .xu contains information from higher-order pixel and their highly reliable distribution.At the same time, x u also benefits from their attention interaction, creating conditions for comprehensive learning of the morphological structure and layout information of corneal cells in original image.The formula is written as follows Subsequently, the joint multi-scale feature x c transmitted from lower layers is updated to xc through the cross-attention mechanism.xu and x d collaboratively guide xc in learning the potential mapping relationship between low-resolution targets and current targets of different scales.There is a size difference between the concatenated features transmitted from the bottom layer and the current layer features.We further feed the concatenated features transmitted from the bottom layer into the decoder to interact with the current layer features, exploring the implicit correspondence between downsampling and upsampling features between adjacent layers.We pass the concatenated multi-scale features as a medium for direct interaction among different layers.This approach facilitates the discrimination capability of ambiguous pixels.The formula is given as The xc generated by x c after cross attention is fed into the FFN (feedforward neural network) in residual form, which is a linear neural network with the following formula The process of generating xc through a FFN is written as We fuse the advanced multi-scale feature xc generated by decoder with upsampling feature of current layer in proportion to form xu , providing more low-level local contextual information to the upsampling feature x u that has information loss.xu is given by

. . Recursive transmission of multi-scale features in U-shaped structures
We recursively implement the nested mechanism consisting of g n Conv and Transformer to deliver multi-scale features from different layers.Figure 3 displays the strategy of recursive transmission.In the process of generating upsampling features at full resolutions, we need to consider cascaded features transmitted from lower layers.
For the upsampling feature of the i + 1 layer, its multi-scale feature x i+1 u consists of the downsampling feature x i+1 c of the i layer, the advanced encoding feature x i u , the decoding multi-scale feature x i d , xi c and g n x i u .Thus, x i+1 c (i > 1) is formulated as For the lowest level features, the composition of its multi-scale features is illustrated in the following formula

Experiments . Parameter settings
The experiments were conducted using PyTorch 1.7.1 on a GeForce RTX 3090 with 24GB of RAM.For the parameterization of g n Conv, the number of iteration layers was set to n = 3, and the input features had a width (W) and height (H) of 22.The input feature channels followed the normal form rule 9 × 2 2i (i = 1, 2, 3, 4).Regarding the converter network parameters, the overfitting value for the converter identification header was set to 0.1, and the forward feedback value was set to 2048.For the nested network features across different layers, the first three layers had 484 channels, and the fourth layer consisted of 256 channels.The training process employed a 5-fold cross validation method, further dividing the training and validation sets of the images in an 8:2 ratio.The fusion ratio of up-sampled features to corresponding multi-scale features was set to 3:7.

. Evaluation metrics
In this experiment, we employ Dice, Acc, recall, pre (precision) and AUC as evaluation metrics to assess the segmentation performance.Dice quantifies the similarity between two samples, with values ranging from [0,1].Pre (precision) denotes the proportion of correctly identified positive samples among all predicted positive samples, while recall represents the percentage of positive samples that were correctly predicted among all predicted samples.To clearly reflect the model's superior segmentation ability, Acc directly reflects the classification accuracy of the classifier.AUC quantifies the area under the ROC (Receiver Operating Characteristic) curve.

. Comparative analysis
To verify the effectiveness of the proposed method, we compared the results of UNet++ (Zhou et al., 2018), Segformer (Xie et al., 2021), SwinUNet (Cao et al., 2022) and TransUNet (Chen et al., 2021) with the segmentation results of our method on the test set, as shown in Table 2.We can clearly observe that the proposed method outperforms other models in terms of all metrics.In Dice measure, the improved nnUNet reaches 82.71%, which was 23.35% higher than UNet++, 20.73% higher than Segformer, 10.85% higher than SwinUNet, 11.15% higher than TransUNet and 0.95% higher than nnUNet, respectively.Compared to nnUNet, the quantitative measurements of Dice, Acc, recall, pre, and AUC are improved by 0.08%, 0.62%, 0.55%, and 0.29%, respectively.It is demonstrated that our algorithm meets the requirement of accurate localization, thereby validating  the effectiveness of the improved model.The results on the three classification datasets of Cell, HSK, and Cone intuitively show that our algorithm achieved the optimal performance on the Dice and Acc measures within these datasets.These results indicate that our method contributes comprehensively to the improvement of segmentation performance of nnUNet in multiple scenarios, rather than solving a single segmentation challenge alone.Figure 4A shows the comparison results of our method with other methods on different metrics, while Figure 4B shows the Dice values on the corneal test images of different methods.It can be intuitively seen that our method has achieved the best in all indicators, and at the same time, it outperforms other approaches in most of the test images. .

Comparisons of di erent encoding strategies
As shown in Figure 5, we performed two comparative experiments to verify the influence of different encoding strategies of g n Conv and Transformer.To align with the dimension of high-level features, our method applied a concatenation on four smaller low-level features.In the comparative analysis in Table 3, we initially expanded the dimension of four lowlevel features via interpolation and then fused them with fixed proportional weights.The comparisons in Table 3 reveals that the concatenation strategy is superior to the interpolation strategy on most of the evaluation metrics.The multi-scale features based on concatenation achieve the performace of 82.72%, 97.43% and 83.06% respectively on Dice, Acc and recall.These values are respectively 0.29%, 0.11% and 2.51% higher than those achieved via interpolation.The above performance demonstrates the effectiveness of the concatenation strategy in conveying cell morphology and position distribution.This capability improves the localization of corneal cells with weaker contrast at upper layers, while the features generated through the interpolation strategy have certain information loss and ambiguous pixels, consequently diminishing the segmentation accuracy.

. Ablation experiments
The ablation experiment in Table 4   .Qualitative evaluation As illustrated in Figure 6, a detailed visualization comparison is performed between nnUNet and our method on local image patches.In Patch 1 (a), nnUNet exhibits a larger area of false positives (magenta).In Patch 2 (a), nnUNet predicted more false positive cell parts compared to our method which has a more precise detection of cell boundary in patch 2 (b).The two cells in patch 3 belong to the challenge case of low visibility.Obviously, nnUNet missed one of the corneal stromal cells, while our method which is capable of detecting both of the cells.Figure 7 visualizes the heatmap of TransUNet, nnUNet, and our method.It can be intuitively seen that the TransUNet, which is designed based on Transformer, has less cells in warm colors (such as red and yellow) compared to the other two methods.However, it shows a significantly larger number of cells in cold colors (cyan and blue).In the heatmap of nnUNet, cells are predominantly warm-colored, with clear classification boundaries for positive and negative samples.The comparison between TransUNet and nnUNet highlights the distinction between CNN and Transformer.The latter focuses on the interaction between global context, and thus it performs better at identifying more cells (in cyan) that are difficult to recognize in a blurring condition.Our algorithm effectively combines the advantages of both approaches.
As demonstrated in the two zoomed patches, our method not only has high predictive scores (with more red area) for the majority of cells in patch 1, but also successfully identifies a larger number of cells (in cyan) that were overlooked by the nnUNet in patch 2.
Figure 8 discusses the segmentation visualization results of different algorithms.In Image 1, the background vascular occlusion results in some intact cells being segmented into small fragments.nnUNet struggles to recognize some of the tiny cell fragments, whereas the proposed U-NTCA network successfully extracts the overall cell structures.Due to uneven illumination in Image 2, some cell edges are blurry with significant feature differences.This condition brings challenges for recognizing cells in dim illumination.Nevertheless, our method is able to detect more cells in low-contrast conditions.In Image 3, it can be observed that severe background interference obscures cell edges.Although the cells locate in areas with fair illumination, the accurate recognition of cell morphology and structure remains a challenging task.All the state-of-the-art approaches exhibit a notable disparity in achieving precise cell recognition, while nnUNet and our method outperforms the others in detecting a more complete cell contours.

Conclusion
The automatic and accurate segmentation of corneal stromal cells are essentially important to the rapid identification of abnormal lesions and timely prevention of the relevant diseases.To deal with the low segmentation accuracy of the existed methods under uneven illumination and occlusion, we designed a nested Transformer incorporated with nnUNet to model the implicit feature transmission across layers.The proposed model generates low-level positional and morphological features and are subsequently transmitted to upper layers to facilitate multi-scale feature fusion.In our future research, we intend to incorporate edge constraints to address challenges such as incorrectly connected cells or cells with broken edges.We will also further consider to establish a multi-task framework to achieve designed a domain adaptive Transformer for atomy aware landmark detection for multi-domain learning.Oh and Jeong (2023) introduced a diffusion model-based data synthesis method aimed at mitigating variance among nuclear classes in tasks related to cell nucleus segmentation.To alleviate the learning bias caused by artificially designed disturbances in semi-supervised models, Zhou et al. (2023) proposed a consistency training method based on wavelet to address low-frequency and high-frequency information.Wang et al. (2023) introduced a two-stage knowledge distillation method designed to prevent the accumulation of errors resulting from noise artifacts.

FIGURE
FIGUREExample of di erent types of corneal stromal cells.

FIGURE
FIGURESchematic diagram of recursive transmission strategy.

FIGURE
FIGURE Comparison of visualization results of di erent methods on test set.(A) Comparison on various indicators; (B) Comparison on Dice index.

FIGUREA
FIGUREA detailed visualization comparison between nnUNet and our algorithm on local image patches.

FIGURE
FIGUREHeatmap visualization of di erent methods.
TABLE Comparison of experimental results of di erent encoding strategies for multi-scale features.Bold value indicates the best performance among all the methods in comparison.
verifies the impact of g n Conv and Transformer in the proposed framework.When leveraging only g n Conv information to enhance feature interactions, Dice and AUC were increased by 0.96% and 0.46%, respectively; Moreover, by incorporating a recursive Transformer into the U-shaped architecture of nnUNet, the improved model achieved Dice and AUC values of 82.72% and 90.93%, indicating further improvements accuracy.The experimental FIGURESchematic diagram of di erent encoding strategies.
TABLE Comparison of experimental results of di erent encoding strategies for multi-scale features.