Self-supervised monocular depth estimation for high field of view colonoscopy cameras

Mathew, Alwyn; Magerand, Ludovic; Trucco, Emanuele; Manfredi, Luigi

doi:10.3389/frobt.2023.1212525

TECHNOLOGY AND CODE article

Front. Robot. AI, 25 July 2023

Sec. Biomedical Robotics

Volume 10 - 2023 | https://doi.org/10.3389/frobt.2023.1212525

This article is part of the Research TopicSmart Endorobots for Endoluminal Procedures: Design, Ethics and Future TrendsView all 7 articles

Self-supervised monocular depth estimation for high field of view colonoscopy cameras

Alwyn Mathew¹

Ludovic Magerand²

Emanuele Trucco²

Luigi Manfredi¹*

¹Division of Imaging Science and Technology, School of Medicine, University of Dundee, Dundee, United Kingdom
²Discipline of Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom

Optical colonoscopy is the gold standard procedure to detect colorectal cancer, the fourth most common cancer in the United Kingdom. Up to 22%–28% of polyps can be missed during the procedure that is associated with interval cancer. A vision-based autonomous soft endorobot for colonoscopy can drastically improve the accuracy of the procedure by inspecting the colon more systematically with reduced discomfort. A three-dimensional understanding of the environment is essential for robot navigation and can also improve the adenoma detection rate. Monocular depth estimation with deep learning methods has progressed substantially, but collecting ground-truth depth maps remains a challenge as no 3D camera can be fitted to a standard colonoscope. This work addresses this issue by using a self-supervised monocular depth estimation model that directly learns depth from video sequences with view synthesis. In addition, our model accommodates wide field-of-view cameras typically used in colonoscopy and specific challenges such as deformable surfaces, specular lighting, non-Lambertian surfaces, and high occlusion. We performed qualitative analysis on a synthetic data set, a quantitative examination of the colonoscopy training model, and real colonoscopy videos in near real-time.

1 Introduction

Colorectal cancer (CRC) causes approximately 900,000 deaths worldwide (Bray et al., 2018) and around 16,800 deaths in the United Kingdom annually. A total of 42,900 new cases were reported in the UK in 2018, making CRC the third most common cancer, accounting for 11% of all new cancer cases. Optical colonoscopy is the gold standard procedure for CRC screening. However, its quality is strongly dependent on the gastroenterologist’s skill level (Baxter et al., 2009). Recent studies by le Clercq et al. (2014) have shown that around 60% of CRC cases are related to missed lesion detection from a previous colonoscopy procedure. Chromoendoscopy can increase the visibility of a lesion by sparing and rinsing the colon with a topical dye that increases the colonic wall surface contrast (Durr et al., 2014). However, this procedure requires twice as much time (Pohl et al., 2011), making it undesirable when compared to optical colonoscopy.

The recent introduction of computer-assisted technologies using artificial intelligence for colonoscopy has demonstrated promising results in polyp detection (Lee et al., 2020), size classification (Itoh et al., 2018), and colon coverage (Freedman et al., 2020). The field of robotics has also shown interest in new technologies for colonoscopy (Ciuti et al., 2020), capsules (Formosa et al., 2019), endorobots (Manfredi (2021, 2022)), meshworm robots (Bernth et al., 2017), soft robots (Manfredi et al., 2019), and autonomous robots (Kang et al., 2021). A colonoscope is a long, flexible tubular instrument, generally 12 mm in diameter with a single camera head, lights, irrigation, and an instrument channel port. The miniature camera and the lights allow visual inspection of the colon, and the irrigation port equipment with water helps clean the colon from residual stools. The instrument port enables the passage of surgical tools to remove tissue or polyps for further examination. The size of the colonoscope is restricted by the colon size; thus, the number of sensors to be included, such as three-dimensional (3D) cameras or other sensors, is limited. This has led researchers to investigate 3D mapping from images from a monocular camera inside the colon (Mahmood et al., 2018).

In recent years, monocular depth estimation has been studied with supervised learning (Nadeem and Kaufman, 2016), conditional random fields (Mahmood and Durr, 2018), generative adversarial networks (GANs) (Mahmood et al., 2018), conditional GANs (Rau et al., 2019), and self-supervised learning (Hwang et al., 2021). However, previous works had not considered that the colonoscope camera yields highly distorted images as it uses high field-of-view (FOV) cameras. Cheng et al. (2021) used synthetic data with ground-truth depth to train the baseline supervised depth model used in a self-supervised pipeline and used optical flow from a pre-trained flow network for temporal structural consistency. However, optical flow in a deformable environment like a colon is unreliable. In Bae et al. (2020), the training pipeline follows multiple steps: sparse depth from Structure from Motion (SfM), sparse depth used to train a depth network, pixel-wise embedding, and multi-view reconstruction, which makes it slower and unsuitable to run alongside a robot. The main contributions of the proposed work are:

1. Introduces self-supervised monocular depth estimation methods specifically designed for wide FOV colonoscopy cameras, addressing the challenges posed by the distorted and wide-angle nature of the images.

2. The proposed approach effectively handles low-texture surfaces in view synthesis by incorporating a validity mask, which successfully removes pixels affected by specular lighting, lens artifacts, and regions with zero optical flow, ensuring accurate depth estimation even in challenging areas of the colon.

2 Motivation

Colonoscopy cameras generally have a wide FOV that enables faster diagnosis with complete coverage of the colon wall. The state-of-the-art colonoscopy devices have a 140°−170° FOV camera. As a result, these cameras suffer from high lens distortion, especially radial, as shown in Figure 1, but have not yet been studied sufficiently because most endoscopy depth estimation methods are trained on low FOV images or synthetic images without distortion.

FIGURE 1

FIGURE 1. Illustration of wide FOV colonoscopy camera and its distortion: raw and rectified images, respectively, when the camera is placed inside the colonoscopy training model (A) and (B), and in front of a calibration target (C) and (D). The yellow box in (B) and (D) crops out the black pixels after undistortion, causing a loss of FOV.

2.1 Low FOV cameras

Low FOV cameras have <120^◦ horizontal FOV lenses as shown in the UCL 2019 data set (Rau et al., 2019). The Brown–Conrady camera model (Kannala and Brandt, 2006) considers these lenses with both radial and tangential distortion. The UCL 2019 data set (Rau et al., 2019) contains synthetic images generated with the ideal pinhole camera approximation. Image distortion is prevalent in real images caused by varying lens magnifications along the increasing angular distance. Due to the imperfect alignment of the lens or sensors, distortion may get decentered. The Brown–Conrady (Kannala and Brandt, 2006) projection function $π (P, i) : P (x, y, z) \to p (u, v)$ is defined as

x^{'} = \frac{x}{z} y^{'} = \frac{y}{z},

x^{″} = x^{'} (1 + k_{1} r^{2} + k_{2} r^{4} + k_{3} r^{6}) + 2 p_{1} x^{'} y^{'} + p_{2} (r^{2} + 2 x^{' 2}),

y^{″} = y^{'} (1 + k_{1} r^{2} + k_{2} r^{4} + k_{3} r^{6}) + p_{1} (r^{2} + 2 y^{' 2}) + 2 p_{2} x^{'} y^{'},

u = x^{″} f_{x} + c_{x} v = y^{″} f_{y} + c_{y}, (1)

where P is a 3D camera point with coordinates (x, y, z), p is a 2D image point with coordinates (u, v), and r² = x² + y². Camera parameters i consist of radial distortion coefficients (k₁, k₂, k₃), tangential distortion coefficients (p₁, p₂) of the lens, focal lengths (f_x, f_y), and principal points (c_x, c_y).

2.2 Wide FOV colonoscopy cameras

Omnidirectional camera model (Scaramuzza et al., 2006), unified camera model (UCM) (Geyer and Daniilidis, 2000), extended UCM (eUCM) (Khomutenko et al., 2015), or double sphere (DS) camera model (Usenko et al., 2018) can map wide FOV lenses. The omnidirectional camera model along with the pinhole radial distortion (Fryer and Brown, 1986) and Brown–Conrady (Kannala and Brandt, 2006) camera models falls in the high-order polynomial distortion family. These models require solving the root of a high-order polynomial during unprojection, which is computationally expensive. On the other hand, UCM, eUCM, and DS fall into a unified camera model family that is easy to compute and has a closed-form unprojection function. For briefness, we only describe the DS in this article, but this pipeline can be extended to all unified camera models. DS is modeled with six camera parameters i: f_x, f_y, c_x, c_y, α, and ξ.

In the study by Geyer and Daniilidis (2000), it was observed that the UCM effectively represents systems with parabolic, hyperbolic, elliptic, and planar mirrors. Furthermore, the UCM has been successfully employed for cameras equipped with fisheye lenses (Ying and Hu, 2004). However, it is important to acknowledge that the UCM may not provide an ideal match for most fisheye lenses, often necessitating the incorporation of an additional distortion model. The UCM projects a 3D point onto the unit sphere and then onto the image plane of the pinhole camera, which is shifted by ξ from the center of the unit sphere. The eUCM can be considered a generalization of the UCM, where the point is projected onto an ellipsoid symmetric around the z-axis using the coefficient β. Alternatively, the DS (Usenko et al., 2018) camera model is a more suitable choice for cameras with fisheye lenses. It offers a closed-form inverse and avoids computationally expensive trigonometric operations. In DS, a 3D point is projected onto two unit spheres with centers shifted by ξ. Then, the point is projected onto an image plane using a pinhole shifted by α. It corrects the image distortion that occurs when the camera’s image plane 1−α is not perfectly aligned with the object plane. The projection function π(P, i): P(x, y, z) → p (u, v) is given as follows:

π (P, i) = [\begin{array}{c} f_{x} \frac{x}{x α d_{2} + (1 - α) (ξ d_{1} + z)} \\ f_{y} \frac{y}{α d_{2} + (1 - α) (ξ d_{1} + z)} \end{array}] + [\begin{array}{c} c x \\ c y \end{array}], (2)

where

d_{1} = \sqrt{x^{2} + y^{2} + z^{2}} d_{2} = \sqrt{x^{2} + y^{2} + {(ξ d_{1} + z)}^{2}} . (3)

The unprojection function $π^{- 1} (p, i) : p (u, v) \to P (x, y, z)$ is given as follows:

π^{- 1} (p, i) = \frac{m_{z} ξ + \sqrt{m_{z}^{2} + (1 - ξ^{2}) r^{2}}}{m_{z}^{2} + r^{2}} [\begin{array}{c} m_{x} \\ m_{y} \\ m_{z} \end{array}] - [\begin{array}{c} 0 \\ 0 \\ ξ \end{array}], (4)

where

m_{x} = \frac{u - c_{x}}{f_{x}} m_{y} = \frac{u - c_{y}}{f_{y}} r^{2} = m_{x}^{2} + m_{y}^{2} .

m_{z} = \frac{1 - α^{2} r^{2}}{a \sqrt{1 - (1 - (2 α - 1) r^{2} + 1 - α)}} . (5)

3 Methods

In this work, we aim to estimate depth from a colonoscopy image stream via view synthesis. View synthesis enables us to train the depth estimation model in self-supervised mode. The depth network takes one image at a time, source I_s or target I_t image, and predicts the corresponding per-pixel depth maps. Target depth maps D_t from the depth network and T_t→s from pose estimation enable the reconstruction of the target image I^ˆ_t from the source image I_s with geometric projection. The pose network predicts the rigid transformation with six degrees of freedom. With a chosen camera model, the unprojection function π⁻¹ can map the target image coordinate I_t(p) to 3D space, forming a 3D point cloud. An overview of the training pipeline is shown in Figure 2. The projection function π maps the 3D points to the target image coordinates. Section 2 describes the camera models for different lenses with varying lens distortions. For camera models that have to find the root of a high-order polynomial, a pre-calculated lookup table can be used for a computationally efficient unprojection operation.

FIGURE 2

FIGURE 2. Self-supervised depth training pipeline: input images (I_s, I_t) are adjacent frames sampled from a video sequence. The depth network predicts independent per-pixel dense depth maps (D_s, D_t) from the source I_s and target I_t images, respectively. The pose network takes the concatenated image sequence (I_s, I_t) as input and predicts the rotation R and translation t that define the transformation T between the source and target images. The warp function takes the predicted depths and poses to warp the source image coordinates to the target or vice versa. The image synthesis projects the source image I_s to the target image I_t with the warped image coordinates. The synthesized images are compared with ground-truth images to construct the loss. The image synthesis loss outliers are removed using the clip function. Image pixels affected by specular lighting and violation of the Lambertian surface assumption are ignored in loss calculation with the valid mask. The depth consistency loss ensures that the depth is consistent in the sequence. At inference, only the depth network will be used.

3.1 Generic image synthesis

Irrespective of the camera models shown in Eq 1 or Eq 2-4, the generic image synthesis block can reconstruct the target image from source images with projection π and unprojection π⁻¹ functions, as shown in Figure 3. A point cloud P_t can be generated from the estimated depth D_t for the corresponding input image I_t as follows:

P_{t} = π^{- 1} (p t, D_{t}), (6)

where p_t is the set of image coordinates and P_t is the 3D point cloud corresponding to the target image I_t. The relative pose T_t→s from the target to source image is estimated by the pose network. This relative pose can be used to transform the target point cloud into the source point cloud P_s = T_t→sP_t. With the projection model π, the 3D point cloud P_s can be mapped to the source image coordinate p_s as follows. The target image can be synthesized by inverse warping the source image coordinates and sampling (Godard et al., 2017) with ζ.

p_{s} = π (T_{t \to s} π^{- 1} (p_{t}, D_{t})) .

{\hat{I}}_{t}^{i j} = ζ (I_{s}^{i^{'} j^{'}}) . (7)

FIGURE 3

FIGURE 3. Generic image coordinate warp block: source I_s and target I_t images are captured at time t − 1 and t. The depth D_t predicted by the depth network along with the relative pose T_t→s from the source to target estimated by the pose network is used to synthesize the source image. The view synthesizing block allows the usage of various camera models that can handle cameras with higher lens distortion. We have demonstrated this use case on the UCL 2019 synthetic data set with a 90^◦ horizontal FOV camera (no lens distortion) and on our colonoscopy training model data set with a 132^◦ horizontal FOV camera (high radial lens distortion).

The projected source image coordinate p_s(i^′, j^′) will be used to sample pixel values with the sampling function ζ to generate the target image ${\hat{I}}_{t}$ . The differentiable spatial transformer network (Jaderberg et al., 2015) is used to synthesize images with bilinear interpolation.

3.2 Image reconstruction loss

The photometric loss L_v between the reference frame and reconstructed frame acts as a self-supervised loss for the depth network. Following the previous works of Zhou et al. (2017), we used structural similarity (SSIM) (Wang et al., 2004) along with the least absolute deviation (L1) for the photometric loss with weight η. This leverages the ability of SSIM, denoted as S, to handle complex illumination changes and add robustness to outliers as follows:

L_{v}^{s} (I_{t} {\hat{I}}_{t}) = \frac{1}{|V|} \sum_{p \in V} η \frac{1 - S (I_{t} (p), {\hat{I}}_{t} (p))}{2} + (1 - η) |I_{t} (p) - {\hat{I}}_{t} (p)| . (8)

Here, V denotes valid pixels and |V| denotes its cardinal. Masking out pixels with V that violates non-zero optical flow and the Lambertian surface assumption avoid the propagation of corrupted gradients into the networks. The valid mask is computed as follows:

V = M_{e g o} \cdot (|I_{t} (p) - {\hat{I}}_{t} (p)| < |I_{t} (p) - I_{s} (p)|) . (9)

Here, M_ego is a binary mask that initially marks all mapped pixels as valid. The pixels that appear to be static between the source and target images caused by the absence of ego motion, special lighting, lens artifacts, and low texture are then marked invalid. Parts of the photometric loss that violate the assumptions generate a great loss, yielding a large gradient that potentially worsens the performance. Following the study of Zhou et al. (2018), clipping the loss with θ, which is the p-th percentile of L_v, removes the outlines as L_v(s_i) = min(s_i,θ), where s_i is the i-th cost in L_v.

3.3 Depth consistency loss

Depth consistency loss L_c (Bian et al., 2021) ensures depth maps D_s and D_t estimated by the depth network with the source and target images I_s and I_t that adhere to the same 3D scene. Minimizing this loss encourages depth consistency not only in the sequence but also to the entire sequence with overlapping image batches. Depth consistency loss can be defined as follows:

L_{c}^{s} (D_{s}, D_{t}) = \frac{1}{|V|} \sum_{p \in V} {D C}^{s},

D C^{s} = \frac{|{\hat{D}}_{s}^{t} (p) - {\hat{D}}_{s} (p)|}{{\hat{D}}_{s}^{t} (p) - {\hat{D}}_{s} (p)}, (10)

where ${\hat{D}}_{s}^{t}$ is the projected depth of D_t with pose T_t→s that corresponds to the depth at image I_s, but comparing ${\hat{D}}_{s}^{t}$ with D_s cannot be done as the projection does not lie in the grid of D_s. Thus, ${\hat{D}}_{s}^{t}$ is generated using differentiable bilinear interpolation on D_s and compared with ${\hat{D}}^{t}$ . The inverse of DC will act as an occlusion mask M_o that can mask out occluded pixels from the source to the target view, mainly around the haustral folds of the colon. The image reconstruction loss L_s is reweighted with M_o mask as follows:

L_{v}^{s^{'}} = \frac{1}{|V|} \sum_{p \in V} M_{0}^{s} ∙ L_{v}^{s} .

M_{0}^{s} = 1 - D C^{s} . (11)

3.4 Edge-guided smoothness loss

The depth maps are regularized with edge-guided smoothness loss as the homogeneous and low-texture regions do not induce information loss in image reconstruction. The smoothness loss is adapted from Wang et al. (2018) as follows:

L_{e}^{s} (I_{t}, D_{t}) \sum_{p \in I} e^{- ▽ I t (p)} ▽ D^{'} (p), (12)

where ▽ denotes the first-order gradient and $D_{t}^{'}$ = (1/D_t)/mean(1/D_t) is the normalized inverse depth. To avoid the optimizer getting trapped in the local minima (Zhou et al., 2017; Godard et al., 2019), the depth estimation and loss are computed at multiple scales. The total loss is averaged over scales (s = 4) and image batches as follows:

L = \frac{1}{4} \sum_{s = 1}^{4} σ_{1} L_{v}^{s^{'}} + σ_{2} L_{c}^{s} + σ_{3} L_{e}^{s} . (13)

4 Experiments

In this work, all models are trained in the UCL 2019 synthetic (Rau et al., 2019) and colonoscopy training model data sets and evaluated on a variety of synthetic (Rau et al., 2019; Azagra et al., 2022) and real (Azagra et al., 2022; Ma et al., 2021) colonoscopy data.

4.1 Implementation details

The depth network is inspired by Godard et al. (2019), with ResNet18 (He et al., 2016) as the encoder and multiscale decoder that predicts the depth at four different scales. The pose network shares the same encoder architecture as the depth network, followed by four 2D convolution layers. The models are built using PyTorch (Paszke et al., 2017) with the Adam optimizer (Kingma and Ba, 2014) to minimize the training error. The models are trained with Nvidia A6000 with batch size 25 for 20 epochs at an initial learning rate of 0.0001 and halves after 15 epochs. The depth network sigmoid output d is scaled as D = 1/(d · 1/(d_max − d_min) + 1/d_min), where d_min = 0.1 and d_max = 20.0 units corresponding to 0.1–20 cm. The network input image size of the UCL 2019 synthetic image (Rau et al., 2019) is 256 × 256 (height × width) and that of the CTM is 512 × 768. These hyperparameters are set empirically as follows: η = 0.85, σ₁ = 1.0, σ₂ = 0.5, and σ₃ = 0.001. We perform horizontal flipping with 50% of the images in the UCL 2019 data set but not in the CTM data set because of the off-centered principal point of the wide-angle camera. We perform random brightness, contrast, saturation, and hue jitter within ±0.2, ±0.2, ±0.2, and ±0.1 range, respectively. For stability, the first epoch of the training is a warm-up run with minimum reprojection loss (Godard et al., 2019) when trained on the CTM, whereas no warm-up is used for the UCL 2019 synthetic data set. The depth estimation model runs at 10 FPS (frames per second) on the GPU at inference, making it suitable for robotic control.

4.1.1 UCL 2019 synthetic data set

The UCL synthetic data (Rau et al., 2019) are generated based on human CT colonography with manual segmentation and meshing. The unity engine is used to simulate the RGB images with a virtual pinhole camera (90^◦ FOV and two attached light sources) and their corresponding depth maps with a maximum range of 20 cm. The data set contains 15,776 RGB images and ground-truth depth maps, which are split into the 8:1:1 train, validation, and test sets. The data set is recorded as nine subsets with three lighting conditions and one of three different materials. The light sources vary in spot angle, range, color, and intensity, while the materials vary in color, reflectiveness, and smoothness.

4.1.2 Colonoscopy training model data set

Our colon training model data set is recorded with an off-the-shelf full HD camera (1920 × 1,080, 30 Hz, 132^◦ HFOV, 65^◦ VFOV, 158^◦ DFOV, FID 45-20-70, white ring light around camera) inside a plastic phantom used for training medical professionals. The phantom mimics the 1:1 anatomy of the human colon, such as the internal diameter, overall length, and haustral folds, simulating images from an optical colonoscopy. The camera is attached to a motorized shaft that is moved forward and backward while recording videos. The data set contains around 10,000 images split into an 8:2 train and validation set with varying internal and external light settings and does not provide ground-truth depth maps.

4.2 Quantitative study

The unique structural restriction of a real colon precludes the collection of ground-truth depth with a true depth sensor. Thus, quantitative depth evaluation can only be conducted on synthetic data such as UCL 2019, even though the current synthetic data generation is not realistic enough to mimic real colon surface properties. The UCL synthetic data contains some texture, such as blood veins and haustral folds. The proposed depth estimation is assessed on the UCL 2019 test data set with evaluation metrics by Eigen et al. (2014) and compared with the other state-of-the-art methods like SfMLearner (Zhou et al., 2017), Monodepth1 (Godard et al., 2017), DDVO (Wang et al., 2018), Monodepth2 (Godard et al., 2019), HR Depth (Lyu et al., 2021), AF-SfMLearner (Shao et al., 2021), and AF-SfMLearner2 (Shao et al., 2022) in Table 1.

TABLE 1

TABLE 1. Quantitative study on proposed depth model on UCL 2019 synthetic data. Best values are in bold, and the second best values are underlined. Abs Rel, Sq Rel, RMSE, and RMSE log are in mm, and δ < 1.25¹, δ < 1.25², and δ < 1.25³ are in percentage.

4.3 Qualitative analysis

Quantitative depth evaluation of real colonoscopy data like EndoMapper (Azagra et al., 2022) or LDPolypVideo (Ma et al., 2021) is challenging. Much of the real characteristics of the colon, like reflectance of the surface and light diffusion of different tissue types, are hard to mimic in a simulator. In addition, a real colon is not rigid; the deformable property of the colon wall makes the view synthesis even more problematic. It was observed that the depth network was leaving holes in the depth maps when trained on the low-textured CTM with just image reconstruction loss. Low-textured regions often occur in the real colon wall, and handling these regions is critical for data-driven methods that rely on view synthesis. The valid mask described in Section 3.2 works well for removing pixels affected by specular lighting, lens artifacts, and zero optical flow regions. The data collection of the CTM was limited to the linear motion of the steady-speed motorized shaft inside the phantom on which the wide-angle camera was fixed. This limits the prediction of the depth network when introduced to real colon images with sharp turns. The depth consistency loss described in Section 3.3 helps the network to learn geometric consistency and avoid depth map flickering between frames. The depth network trained on the CTM learns to directly predict depth from highly distorted images with the help of the generic image synthesizer briefed in Section 3.1 that can handle a variety of camera models. The model trained with a pinhole camera was tested against an unseen EndoMapper synthetic data set (Azagra et al., 2022). The model trained on the wide-angle camera was tested against real colonoscopy data sets like EndoMapper (Azagra et al., 2022) and LDPolypVideo (Ma et al., 2021). It is to be noted that the model trained on a wide-angle camera performed better in real colonoscopy data sets that used wide FOV cameras than the model trained on synthetic data, as shown in Figure 4. The model trained on rectified or perfect pinhole camera images, as in UCL 2019, is suboptimal to the target colonoscopy camera, which is distorted and wide angled. If trained on raw images like in the CTM, the network learns distortion as part of the transfer function, and it is only weakly encoded and thus expected to be more robust.

FIGURE 4

FIGURE 4. Qualitative analysis on synthetic data sets like UCL 2019 synthetic and EndoMapper synthetic data sets, CTM data set, and real colonoscopy images like LDPolypVideo and EndoMapper data set (top to bottom) (Azagra et al., 2022; Ma et al., 2021).

4.4 Ablation study

Table 2 demonstrates the impact of each loss when trained with UCL synthetic data. The depth evaluation reflects the benefit of the valid mask in Section 3.2, which ignores pixels that do not contribute to the photometric loss. It is also found that clipping off high photometric loss improves model performance. Even though depth consistency in Section 3.3 leads to consistent depth in the sequence, it did not improve quantitative results but helped qualitatively on the CTM data set.

TABLE 2

TABLE 2. Ablation study on UCL 2019 synthetic data. View, Valid, Consistency, and Clip represent image view synthesis, valid mask, depth consistency, and loss clipping, respectively. Best values are in bold and second best values are underlined. Abs Rel, Sq Rel, RMSE, and RMSE log are in mm, and δ < 1.25¹, δ < 1.25², and δ < 1.25³ are in percentage.

5 Conclusion

This work presents a self-supervised depth estimation model for wide-angle colonoscopy cameras. To the best of our knowledge, this is the first time data from a wide-angle colonoscopy camera is used to estimate depth from a video sequence. Most of the previous works had assumed a pinhole model for the colonoscopy camera, but real colonoscopy cameras have a wide FOV lens. Our network predicts depth directly on highly distorted raw images typical for such real cameras. The pipeline also focuses on handling texture regions in the images that generate low photometric loss and geometrically consistent depth estimation in the sequence. Our methods fill a gap in modeling wide FOV cameras and low-texture regions in colonoscopy imaging. We also achieved near-real-time prediction with the depth estimation models that allow us to operate alongside a colonoscopy robot.

A limitation of our work is that the UCL 2019 synthetic data set does not fully capture the complex surface properties and characteristics of real colon tissue. The absence of ground-truth depth maps for the CTM data set further limits the ability to quantitatively assess performance on this data set. Further exploration and evaluation of diverse and realistic data sets can enhance the generalizability and reliability of the proposed depth estimation method.

Data availability statement

The data sets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article.

Author contributions

AM contributed to the design and implementation of the code, experimental setup, measurement, and data analysis and drafted the first version of the manuscript. LiM, ET, and LdM contributed to the scientific contents, coordination, and manuscript writing. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) (grant no. EP/W00433X/1).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, editors, and reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Azagra, P., Sostres, C., Ferrandez, Á., Riazuelo, L., Tomasini, C., Barbed, O. L., et al. (2022). EndoMapper dataset of complete calibrated endoscopy. ArXiv. /abs/2204.14240.

Google Scholar

Bae, G., Budvytis, I., Yeung, C.-K., and Cipolla, R. (2020). “Deep multi-view stereo for dense 3d reconstruction from monocular endoscopic video,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 29 September 2020 (Springer), 774–783.doi:10.1007/978-3-030-59716-0_74