Edited by: Inti Zlobec, University of Bern, Switzerland
Reviewed by: Pier Paolo Piccaluga, University of Bologna, Italy; Thomas Menter, University Hospital of Basel, Switzerland
This article was submitted to Pathology, a section of the journal Frontiers in Medicine
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The widespread adoption of whole slide imaging has increased the demand for effective and efficient gigapixel image analysis. Deep learning is at the forefront of computer vision, showcasing significant improvements over previous methodologies on visual understanding. However, whole slide images have billions of pixels and suffer from high morphological heterogeneity as well as from different types of artifacts. Collectively, these impede the conventional use of deep learning. For the clinical translation of deep learning solutions to become a reality, these challenges need to be addressed. In this paper, we review work on the interdisciplinary attempt of training deep neural networks using whole slide images, and highlight the different ideas underlying these methodologies.
The adoption of digital pathology into the clinic will arguably be one of the most disruptive technologies introduced into the routine working environment of pathologists. Digital pathology has emerged with the digitization of patient tissue samples and in particular the use of digital whole slide images (WSIs). These can be distributed globally for diagnostic, teaching, and research purposes. Validation studies have shown correlation between digital diagnosis and glass based diagnosis (
Recently, NHS Greater Glasgow and Clyde, one of the largest single site pathology services in Europe, has begun proceedings to undergo full digitization. As the adoption of digital pathology becomes wider, automated image analysis of tissue morphology has the potential to further establish itself in pathology and ultimately decrease the workload of pathologists, reduce turnaround times for reporting, and standardize clinical practices. For example, known or novel biomarkers and histopathological features can be automatically quantified (
Successful application of deep learning to WSIs has the potential to create new clinical tools that surpass current clinical approaches in terms of accuracy, reproducibility, and objectivity while also providing new insights on various pathologies. However, WSIs are multi-gigabyte images with typical resolutions of 100, 000 × 100, 000 pixels, present high morphological variance, and often contain various types of artifacts. These conditions preclude the direct application of conventional deep learning techniques. Instead, practitioners are faced with two non-trivial challenges. On the one hand, the visual understanding of the images, impeded by the morphological variance, artifacts, and typically small data sets, and, on the other hand, the inability of the current state of the hardware to facilitate learning from images with such high resolution, thereby requiring some form of dimensionality reduction to the images. These two problems are sometimes referred to as the
The majority of WSIs are captured using brightfield illumination, such as for slides stained with clinically routine haematoxylin and eosin (H&E). The wider accessibility of H&E stained WSIs, compared to more bespoke labeling reagents, at present makes this modality more attractive for deep learning applications. H&E stained tissue is excellent for the characterization of morphology within a tissue sample which corroborates to its long use in clinical practice.
However, H&E stained slides lack
Unlike in numerous other fields which have adopted supervised deep learning techniques (
There are currently multiple whole slide scanners from different vendors available on the market with the capacity for both brightfield and fluorescence imaging. Each scanner captures images using different compression types and sizes, illumination, objectives, and resolution and also outputs the images in a different proprietary file format. The lack of a universal image format can delay the curation of large data sets. The field of radiology has overcome this issue with the adoption of DICOM open source file formats allowing large image data sets to be accessed and interrogated (
To be clinically translatable, deep learning algorithms must work across large patient populations and generalize over image artifacts and color variability in staining (
Examples of artifact in both fluorescence and brightfield captured images. Images
Through training, the human brain can become adept at ignoring artifacts and staining variability, and honing in the visual information necessary for an accurate diagnosis. To facilitate an analogous outcome in deep learning models, there are generally two approaches that can be followed. The first involves explicit removal of artifacts (e.g., using image filters), as well the normalization of color variability (
Most successful approaches to training deep learning models on WSIs do not use the whole image as input and instead extract and use only a small number of patches (
Patch level annotations enable strong supervision since all of the extracted training patches have class labels. Typically, patch based annotations are derived from pixel level annotations which requires experts to annotate all pixels. For instance, given a WSI which contains cancerous tissue, a pathologist would need to localize and annotate all cancerous cells.
A simple approach to patch based learning would make use of all tiled (i.e., non-overlapping) patches. Nevertheless, this simplicity comes at the cost of excessive computational and memory overhead, along with a number of other issues, such as imbalanced classes and slow training. Randomly sampling patches may lead to an even higher class imbalance considering how in most cases a patch is much smaller than the original WSI. It is therefore imperative that sampling is guided.
One way to guide the sampling procedure is with the use of patch level annotations. For example, on breast cancer metastasis detection, in multiple papers, patches from normal and tumor regions were extracted based on pixel level labels that were provided by pathologists (
Due to practical limitations, in most cases ground truth labeling is done on the level of WSIs as opposed to individual patches. Despite this lower granularity of labeling, a number of deep learning based approaches have demonstrated highly promising results. Techniques vary and often take on the form of multiple instance learning, unsupervised learning, reinforcement learning, and transfer learning, or a combination of thereof. Intuitively, the goal is usually to identify patches that can collectively or independently predict the whole slide label.
Preprocessing based on image filters can be employed to reduce the number of patches that need to be analyzed. Multiple studies also employ the Otsu, hysteresis, or other types of threshold as an automatic way of identifying tissue within the WSI. Other operations such as contrast normalization, morphological operations, and a problem specific patch scoring system can also be employed to reduce further the number of candidate patches and even enable automatic patch localization. However, verifying that indeed each patch has the same label as the slide often requires domain-specific expertise and even the process of coming up with the best image filters requires at the very least some human intuition.
In order to avoid potential human bias, most approaches employ unsupervised or multi-instance learning, or a combination of both. Tellez et al. (
Several other ideas enable persistent improvement of patch localization and visual understanding either by iteratively revising each process or by learning both in an end-to-end fashion. Hou et al. (
Instead of extracting features from all or most patches before selecting a few to learn on, recent work has employed attention models (
Usually, multiple WSIs can be acquired for each patient since the initial tissue occupies a 3D space and therefore multiple cuts can be made. In this case, the available ground truth can be specific to the patient, but not to each individual WSI (
In many cases, training takes place at a lower level, e.g., patch level, but the end goal resides at a higher level, e.g., slide level. For example, in the case of cancer diagnosis, a CNN may be trained to identify the presence of cancerous cells within a patch. However, some type of aggregation is needed in order to infer whether a WSI contains cancerous cells. This may take the form of a maximum or average operation over some or all patch predictions. In other cases, traditional machine learning models, or recurrent neural networks may be employed and trained using features extracted by a CNN and the ground truth that is available at a higher level.
A primary limitation of patch based analysis emerges as a consequence of analysing a large input image by means of independent analysis of smaller regions. In particular, such approaches are inherently unable to capture information distributed over scales greater than the patch size. For example, although cell characteristics can be extracted from individual patches, higher level structural information, such as the shape or extend of a tumor, can only be captured when analysing larger regions. Explicitly modeling spatial correlations between patches has been proposed as a potential solution (
The aim of computer vision is to create algorithmic solutions capable of visual understanding. Applications can range from object identification and detection to image captioning and scene decomposition. In the past decade most areas of computer vision have seen remarkable progress, much of it effected by advances in neural network based learning algorithms (
In the previous decade most approaches focused on finding ways to explicitly extract features from images for models subsequently to employ (
The analysis of multi-gigabyte images is a new challenge for deep learning that has only appeared along the emergence of digital pathology and whole slide imaging. Building deep learning models capable of understanding WSIs presents novel challenges to the field. When patch level labels are available, patch sampling coupled with hard negative mining can train deep learning models that in many cases match and even surpass the accuracy of pathologists (
To work with slide or patient level labels, current approaches focus on the
Deep learning is already demonstrating its potential across a wide range of medical problems associated with digital pathology. However, the need for detailed annotations limits the applicability of strongly supervised techniques. Other techniques from weakly supervised, unsupervised, reinforcement, and transfer learning are employed to counter the need for detailed annotations while dealing with massive, highly heterogeneous images and small data sets. This emerging direction away from strong supervision opens new opportunities in WSI analysis, such as addressing problems for which the ground truth is only known at a higher than patch level, e.g., patient survivability and recurrence prediction.
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.