Self-supervised monocular depth estimation for high field of view colonoscopy cameras

Optical colonoscopy is the gold standard procedure to detect colorectal cancer, the fourth most common cancer in the United Kingdom. Up to 22%–28% of polyps can be missed during the procedure that is associated with interval cancer. A vision-based autonomous soft endorobot for colonoscopy can drastically improve the accuracy of the procedure by inspecting the colon more systematically with reduced discomfort. A three-dimensional understanding of the environment is essential for robot navigation and can also improve the adenoma detection rate. Monocular depth estimation with deep learning methods has progressed substantially, but collecting ground-truth depth maps remains a challenge as no 3D camera can be fitted to a standard colonoscope. This work addresses this issue by using a self-supervised monocular depth estimation model that directly learns depth from video sequences with view synthesis. In addition, our model accommodates wide field-of-view cameras typically used in colonoscopy and specific challenges such as deformable surfaces, specular lighting, non-Lambertian surfaces, and high occlusion. We performed qualitative analysis on a synthetic data set, a quantitative examination of the colonoscopy training model, and real colonoscopy videos in near real-time.


Introduction
robotics has also shown interest in new technologies for colonoscopy (Ciuti et al., 2020), capsules (Formosa et al., 2019), endorobots (Manfredi (2021(Manfredi ( , 2022), meshworm robots (Bernth et al., 2017), soft robots (Manfredi et al., 2019), and autonomous robots (Kang et al., 2021). A colonoscope is a long, flexible tubular instrument, generally 12 mm in diameter with a single camera head, lights, irrigation, and an instrument channel port. The miniature camera and the lights allow visual inspection of the colon, and the irrigation port equipment with water helps clean the colon from residual stools. The instrument port enables the passage of surgical tools to remove tissue or polyps for further examination. The size of the colonoscope is restricted by the colon size; thus, the number of sensors to be included, such as three-dimensional (3D) cameras or other sensors, is limited. This has led researchers to investigate 3D mapping from images from a monocular camera inside the colon .
In recent years, monocular depth estimation has been studied with supervised learning (Nadeem and Kaufman, 2016), conditional random fields , generative adversarial networks (GANs) , conditional GANs (Rau et al., 2019), and self-supervised learning (Hwang et al., 2021). However, previous works had not considered that the colonoscope camera yields highly distorted images as it uses high field-ofview (FOV) cameras. Cheng et al. (2021) used synthetic data with ground-truth depth to train the baseline supervised depth model used in a self-supervised pipeline and used optical flow from a pretrained flow network for temporal structural consistency. However, optical flow in a deformable environment like a colon is unreliable. In Bae et al. (2020), the training pipeline follows multiple steps: sparse depth from Structure from Motion (SfM), sparse depth used to train a depth network, pixel-wise embedding, and multiview reconstruction, which makes it slower and unsuitable to run alongside a robot. The main contributions of the proposed work are: 1. Introduces self-supervised monocular depth estimation methods specifically designed for wide FOV colonoscopy cameras, addressing the challenges posed by the distorted and wide-angle nature of the images. 2. The proposed approach effectively handles low-texture surfaces in view synthesis by incorporating a validity mask, which successfully removes pixels affected by specular lighting, lens artifacts, and regions with zero optical flow, ensuring accurate depth estimation even in challenging areas of the colon.

Motivation
Colonoscopy cameras generally have a wide FOV that enables faster diagnosis with complete coverage of the colon wall. The stateof-the-art colonoscopy devices have a 140°−170°FOV camera. As a result, these cameras suffer from high lens distortion, especially radial, as shown in Figure 1, but have not yet been studied sufficiently because most endoscopy depth estimation methods are trained on low FOV images or synthetic images without distortion.

Low FOV cameras
Low FOV cameras have <120 • horizontal FOV lenses as shown in the UCL 2019 data set (Rau et al., 2019). The Brown-Conrady camera model (Kannala and Brandt, 2006) considers these lenses with both radial and tangential distortion. The UCL 2019 data set (Rau et al., 2019) contains synthetic images generated with the ideal pinhole camera approximation. Image distortion is prevalent in real images caused by varying lens magnifications along the increasing angular distance. Due to the imperfect alignment of the lens or sensors, distortion may get decentered. The Brown-Conrady (Kannala and Brandt, 2006) projection function π(P, i): P (x, y, z) → p(u, v) is defined as where P is a 3D camera point with coordinates (x, y, z), p is a 2D image point with coordinates (u, v), and r 2 = x 2 + y 2 . Camera parameters i consist of radial distortion coefficients (k 1 , k 2 , k 3 ), tangential distortion coefficients (p 1 , p 2 ) of the lens, focal lengths ( f x , f y ), and principal points (c x , c y ).

Wide FOV colonoscopy cameras
Omnidirectional camera model (Scaramuzza et al., 2006), unified camera model (UCM) (Geyer and Daniilidis, 2000), extended UCM (eUCM) (Khomutenko et al., 2015), or double sphere (DS) camera model (Usenko et al., 2018) can map wide FOV lenses. The omnidirectional camera model along with the pinhole radial distortion (Fryer and Brown, 1986) and Brown-Conrady (Kannala and Brandt, 2006) camera models falls in the high-order polynomial distortion family. These models require solving the root of a high-order polynomial during unprojection, which is computationally expensive. On the other hand, UCM, eUCM, and DS fall into a unified camera model family that is easy to compute and has a closed-form unprojection function. For briefness, we only describe the DS in this article, but this pipeline can be extended to all unified camera models. DS is modeled with six camera parameters i: f x , f y , c x , c y , α, and ξ.
In the study by Geyer and Daniilidis (2000), it was observed that the UCM effectively represents systems with parabolic, hyperbolic, elliptic, and planar mirrors. Furthermore, the UCM has been successfully employed for cameras equipped with fisheye lenses (Ying and Hu, 2004). However, it is important to acknowledge that the UCM may not provide an ideal match for most fisheye lenses, often necessitating the incorporation of an additional distortion model. The UCM projects a 3D point onto the unit sphere and then onto the image plane of the pinhole camera, which is shifted by ξ from the center of the unit sphere. The eUCM can be considered a generalization of the UCM, where the point is projected onto an ellipsoid symmetric around the z-axis using the coefficient β. Alternatively, the DS (Usenko et al., 2018) camera model is a more suitable choice for cameras with fisheye lenses. It offers a closedform inverse and avoids computationally expensive trigonometric operations. In DS, a 3D point is projected onto two unit spheres with centers shifted by ξ. Then, the point is projected onto an image plane Illustration of wide FOV colonoscopy camera and its distortion: raw and rectified images, respectively, when the camera is placed inside the colonoscopy training model (A) and (B), and in front of a calibration target (C) and (D). The yellow box in (B) and (D) crops out the black pixels after undistortion, causing a loss of FOV. using a pinhole shifted by α. It corrects the image distortion that occurs when the camera's image plane 1−α is not perfectly aligned with the object plane. The projection function π(P, i): P(x, y, z) → p (u, v) is given as follows: where The unprojection function π −1 (p, i): p(u, v) → P (x, y, z) is given as follows: where

Methods
In this work, we aim to estimate depth from a colonoscopy image stream via view synthesis. View synthesis enables us to train the depth estimation model in self-supervised mode. The depth network takes one image at a time, source I s or target I t image, and predicts the corresponding per-pixel depth maps. Target depth maps D t from the depth network and T t→s from pose estimation enable the reconstruction of the target image Iˆt from the source image I s with geometric projection. The pose network predicts the rigid transformation with six degrees of freedom. With a chosen camera model, the unprojection function π −1 can map the target image coordinate I t (p) to 3D space, forming a 3D point cloud. An overview of the training pipeline is shown in Figure 2. The projection function π maps the 3D points to the target image coordinates. Section 2 describes the camera models for different lenses with varying lens distortions. For camera models that have to find the root of a highorder polynomial, a pre-calculated lookup table can be used for a computationally efficient unprojection operation.

Generic image synthesis
Irrespective of the camera models shown in Eq 1 or Eq 2-4, the generic image synthesis block can reconstruct the target image from source images with projection π and unprojection π −1 functions, as shown in Figure 3. A point cloud P t can be generated from the estimated depth D t for the corresponding input image I t as follows: where p t is the set of image coordinates and P t is the 3D point cloud corresponding to the target image I t . The relative pose T t→s from the Frontiers in Robotics and AI 03 frontiersin.org

FIGURE 2
Self-supervised depth training pipeline: input images (I s , I t ) are adjacent frames sampled from a video sequence. The depth network predicts independent per-pixel dense depth maps (D s , D t ) from the source I s and target I t images, respectively. The pose network takes the concatenated image sequence (I s , I t ) as input and predicts the rotation R and translation t that define the transformation T between the source and target images. The warp function takes the predicted depths and poses to warp the source image coordinates to the target or vice versa. The image synthesis projects the source image I s to the target image I t with the warped image coordinates. The synthesized images are compared with ground-truth images to construct the loss. The image synthesis loss outliers are removed using the clip function. Image pixels affected by specular lighting and violation of the Lambertian surface assumption are ignored in loss calculation with the valid mask. The depth consistency loss ensures that the depth is consistent in the sequence. At inference, only the depth network will be used.

FIGURE 3
Generic image coordinate warp block: source I s and target I t images are captured at time t − 1 and t. The depth D t predicted by the depth network along with the relative pose T t→s from the source to target estimated by the pose network is used to synthesize the source image. The view synthesizing block allows the usage of various camera models that can handle cameras with higher lens distortion. We have demonstrated this use case on the UCL 2019 synthetic data set with a 90 • horizontal FOV camera (no lens distortion) and on our colonoscopy training model data set with a 132 • horizontal FOV camera (high radial lens distortion).
target to source image is estimated by the pose network. This relative pose can be used to transform the target point cloud into the source point cloud P s = T t→s P t . With the projection model π, the 3D point cloud P s can be mapped to the source image coordinate p s as follows.
The target image can be synthesized by inverse warping the source image coordinates and sampling (Godard et al., 2017) with ζ. p s = π(T t→s π −1 (p t , D t )).
The projected source image coordinate p s (i ′ , j ′ ) will be used to sample pixel values with the sampling function ζ to generate the target imageÎ t . The differentiable spatial transformer network (Jaderberg et al., 2015) is used to synthesize images with bilinear interpolation.

Image reconstruction loss
The photometric loss L v between the reference frame and reconstructed frame acts as a self-supervised loss for the depth network. Following the previous works of Zhou et al. (2017), we used structural similarity (SSIM) (Wang et al., 2004) along with the least absolute deviation (L1) for the photometric loss with weight η. This leverages the ability of SSIM, denoted as S, to handle complex illumination changes and add robustness to outliers as follows: Here, V denotes valid pixels and |V| denotes its cardinal. Masking out pixels with V that violates non-zero optical flow and the Lambertian surface assumption avoid the propagation of corrupted gradients into the networks. The valid mask is computed as follows: Here, M ego is a binary mask that initially marks all mapped pixels as valid. The pixels that appear to be static between the source and target images caused by the absence of ego motion, special lighting, lens artifacts, and low texture are then marked invalid. Parts of the photometric loss that violate the assumptions generate a great loss, yielding a large gradient that potentially worsens the performance. Following the study of Zhou et al. (2018), clipping the loss with θ, which is the p-th percentile of L v , removes the outlines as L v (s i ) = min(s i ,θ), where s i is the i-th cost in L v .

Depth consistency loss
Depth consistency loss L c (Bian et al., 2021) ensures depth maps D s and D t estimated by the depth network with the source and target images I s and I t that adhere to the same 3D scene. Minimizing this loss encourages depth consistency not only in the sequence but also to the entire sequence with overlapping image batches. Depth consistency loss can be defined as follows: whereD t s is the projected depth of D t with pose T t→s that corresponds to the depth at image I s , but comparingD t s with D s cannot be done as the projection does not lie in the grid of D s . Thus,D t s is generated using differentiable bilinear interpolation on D s and compared withD t . The inverse of DC will act as an occlusion mask M o that can mask out occluded pixels from the source to the target view, mainly around the haustral folds of the colon.
The image reconstruction loss L s is reweighted with M o mask as follows:

Edge-guided smoothness loss
The depth maps are regularized with edge-guided smoothness loss as the homogeneous and low-texture regions do not induce information loss in image reconstruction. The smoothness loss is adapted from Wang et al. (2018) as follows: where ▽ denotes the first-order gradient and D ′ t = (1/D t )/mean(1/D t ) is the normalized inverse depth. To avoid the optimizer getting trapped in the local minima (Zhou et al., 2017;Godard et al., 2019), the depth estimation and loss are computed at multiple scales. The total loss is averaged over scales (s = 4) and image batches as follows:

Experiments
In this work, all models are trained in the UCL 2019 synthetic (Rau et al., 2019) and colonoscopy training model data sets and evaluated on a variety of synthetic (Rau et al., 2019;Azagra et al., 2022) and real (Azagra et al., 2022;Ma et al., 2021) colonoscopy data.

Implementation details
The depth network is inspired by Godard et al. (2019), with ResNet18 (He et al., 2016) as the encoder and multiscale decoder that predicts the depth at four different scales. The pose network shares the same encoder architecture as the depth network, followed by four 2D convolution layers. The models are built using PyTorch (Paszke et al., 2017) with the Adam optimizer (Kingma and Ba, 2014) to minimize the training error. The models are trained with Nvidia A6000 with batch size 25 for 20 epochs at an initial learning rate of 0.0001 and halves after 15 epochs. The depth network sigmoid output d is scaled as D = 1/(d · 1/(d max − d min ) + 1/d min ), where d min = 0.1 and d max = 20.0 units corresponding to 0.1-20 cm. The network input image size of the UCL 2019 synthetic image (Rau et al., 2019) is 256 × 256 (height × width) and that of the CTM is 512 × 768. These hyperparameters are set empirically as follows: η = 0.85, σ 1 = 1.0, σ 2 = 0.5, and σ 3 = 0.001. We perform horizontal flipping with 50% of the images in the UCL 2019 data set but not in the CTM data set because of the off-centered principal point of the wide-angle camera. We perform random brightness, contrast, saturation, and hue jitter within ±0.2, ±0.2, ±0.2, and ±0.1 range, respectively. For stability, the first epoch of the training is a warmup run with minimum reprojection loss (Godard et al., 2019) when trained on the CTM, whereas no warm-up is used for the UCL 2019 synthetic data set. The depth estimation model runs at 10 FPS (frames per second) on the GPU at inference, making it suitable for robotic control.

UCL 2019 synthetic data set
The UCL synthetic data (Rau et al., 2019) are generated based on human CT colonography with manual segmentation and meshing. The unity engine is used to simulate the RGB images with a virtual pinhole camera (90 • FOV and two attached light sources) and their corresponding depth maps with a maximum range of 20 cm. The data set contains 15,776 RGB images and ground-truth depth maps, which are split into the 8:1:1 train, validation, and test sets. The data set is recorded as nine subsets with three lighting conditions and one of three different materials. The light sources vary in spot angle, range, color, and intensity, while the materials vary in color, reflectiveness, and smoothness.

Colonoscopy training model data set
Our colon training model data set is recorded with an offthe-shelf full HD camera (1920 × 1,080, 30 Hz, 132 • HFOV, 65 • VFOV, 158 • DFOV, FID 45-20-70, white ring light around camera) inside a plastic phantom used for training medical professionals. The phantom mimics the 1:1 anatomy of the human colon, such as the internal diameter, overall length, and haustral folds, simulating images from an optical colonoscopy. The camera is attached to a motorized shaft that is moved forward and backward while recording videos. The data set contains around 10,000 images split into an 8:2 train and validation set with varying internal and external light settings and does not provide ground-truth depth maps.

Quantitative study
The unique structural restriction of a real colon precludes the collection of ground-truth depth with a true depth sensor. Thus, quantitative depth evaluation can only be conducted on synthetic data such as UCL 2019, even though the current synthetic data generation is not realistic enough to mimic real colon surface properties. The UCL synthetic data contains some texture, such as blood veins and haustral folds. The proposed depth estimation is assessed on the UCL 2019 test data set with evaluation metrics by Eigen et al. (2014) and compared with the other state-of-theart methods like SfMLearner (Zhou et al., 2017), Monodepth1 (Godard et al., 2017), DDVO , Monodepth2 (Godard et al., 2019), HR Depth (Lyu et al., 2021), AF-SfMLearner (Shao et al., 2021), and AF-SfMLearner2 (Shao et al., 2022) in Table 1.

Qualitative analysis
Quantitative depth evaluation of real colonoscopy data like EndoMapper (Azagra et al., 2022) or LDPolypVideo  is challenging. Much of the real characteristics of the colon, like reflectance of the surface and light diffusion of different tissue types, are hard to mimic in a simulator. In addition, a real colon is not rigid; the deformable property of the colon wall makes the view synthesis even more problematic. It was observed that the depth network was leaving holes in the depth maps when trained on the low-textured CTM with just image reconstruction loss. Lowtextured regions often occur in the real colon wall, and handling these regions is critical for data-driven methods that rely on view synthesis. The valid mask described in Section 3.2 works well for removing pixels affected by specular lighting, lens artifacts, and zero optical flow regions. The data collection of the CTM was limited to the linear motion of the steady-speed motorized shaft inside the phantom on which the wide-angle camera was fixed. This limits the prediction of the depth network when introduced to real colon images with sharp turns. The depth consistency loss described in Section 3.3 helps the network to learn geometric consistency and avoid depth map flickering between frames. The depth network trained on the CTM learns to directly predict depth from highly distorted images with the help of the generic image synthesizer briefed in Section 3.1 that can handle a variety of camera models. The model trained with a pinhole camera was tested against   an unseen EndoMapper synthetic data set (Azagra et al., 2022). The model trained on the wide-angle camera was tested against real colonoscopy data sets like EndoMapper (Azagra et al., 2022) and LDPolypVideo . It is to be noted that the model trained on a wide-angle camera performed better in real colonoscopy data sets that used wide FOV cameras than the model trained on synthetic data, as shown in Figure 4. The model trained on rectified or perfect pinhole camera images, as in UCL 2019, is suboptimal to the target colonoscopy camera, which is distorted and wide angled. If trained on raw images like in the CTM, the network learns distortion as part of the transfer function, and it is only weakly encoded and thus expected to be more robust. Table 2 demonstrates the impact of each loss when trained with UCL synthetic data. The depth evaluation reflects the benefit of the valid mask in Section 3.2, which ignores pixels that do not contribute to the photometric loss. It is also found that clipping off high photometric loss improves model performance. Even though depth consistency in Section 3.3 leads to consistent depth in the sequence, it did not improve quantitative results but helped qualitatively on the CTM data set.

Conclusion
This work presents a self-supervised depth estimation model for wide-angle colonoscopy cameras. To the best of our knowledge, this is the first time data from a wide-angle colonoscopy camera is used to estimate depth from a video sequence. Most of the previous works had assumed a pinhole model for the colonoscopy camera, but real colonoscopy cameras have a wide FOV lens. Our network predicts depth directly on highly distorted raw images typical for such real cameras. The pipeline also focuses on handling texture regions in the images that generate low photometric loss and geometrically consistent depth estimation in the sequence. Our methods fill a gap in modeling wide FOV cameras and low-texture regions in colonoscopy imaging. We also achieved near-real-time prediction with the depth estimation models that allow us to operate alongside a colonoscopy robot.
A limitation of our work is that the UCL 2019 synthetic data set does not fully capture the complex surface properties and characteristics of real colon tissue. The absence of groundtruth depth maps for the CTM data set further limits the ability to quantitatively assess performance on this data set. Further exploration and evaluation of diverse and realistic data sets can enhance the generalizability and reliability of the proposed depth estimation method.

Data availability statement
The data sets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article.