Multi_Scale_Tools: A Python Library to Exploit Multi-Scale Whole Slide Images

Algorithms proposed in computational pathology can allow to automatically analyze digitized tissue samples of histopathological images to help diagnosing diseases. Tissue samples are scanned at a high-resolution and usually saved as images with several magni ﬁ cation levels, namely whole slide images (WSIs). Convolutional neural networks (CNNs) represent the state-of-the-art computer vision methods targeting the analysis of histopathology images, aiming for detection, classi ﬁ cation and segmentation. However, the development of CNNs that work with multi-scale images such as WSIs is still an open challenge. The image characteristics and the CNN properties impose architecture designs that are not trivial. Therefore, single scale CNN architectures are still often used. This paper presents Multi_Scale_Tools, a library aiming to facilitate exploiting the multi-scale structure of WSIs. Multi_Scale_Tools currently include four components: a pre-processing component, a scale detector, a multi-scale CNN for classi ﬁ cation and a multi-scale CNN for segmentation of the images. The pre-processing component includes methods to extract patches at several magni ﬁ cation levels. The scale detector allows to identify the magni ﬁ cation level of images that do not contain this information, such as images from the scienti ﬁ c literature. The multi-scale CNNs are trained combining features and predictions that originate from different magni ﬁ cation levels. The components are developed using private datasets, including colon and breast cancer tissue samples. They are tested on private and public external data sources, such as The Cancer Genome Atlas (TCGA). The results of the library demonstrate its effectiveness and applicability. The scale detector accurately predicts multiple levels of image magni ﬁ cation and generalizes well to independent external data. The multi-scale CNNs outperform the single-magni ﬁ cation CNN for both classi ﬁ cation and segmentation tasks. The code is developed in Python and it will be made publicly available upon publication. It aims to be easy to use and easy to be improved with additional functions.


INTRODUCTION
The implicit multi-scale structure of digitized histopathological images represents an open challenge in computational pathology. Training machine learning algorithms that can simultaneously learn both microscopic and macroscopic tissue structures comes with technical and computational challenges that are not yet well studied.
As of 2021, histopathology represents the gold standard to diagnose many diseases, including cancer (Aeffner et al., 2017;Rorke, 1997). Histopathology images include several tissue structures, ranging from microscopic entities (such as single cell nuclei) to macroscopic components (such as tumor bulks). Whole Slide Images (WSIs) are digitized histopathology images that are scanned at high-resolution and are stored in a multi-scale (pyramidal) format. WSI resolution is related to the spatial resolution and the optical resolution used to scan the images (Wu et al., 2010). The spatial resolution is the minimum distance that the scanner can capture so that two objects are still distinguished, measured in terms of μm per pixel (Sellaro et al., 2013). The optical resolution (or magnification) is the magnification factor (x) of the lens within the scanner (Sellaro et al., 2013). Currently, the de facto standard spatial resolutions adopted to scan tissue samples (for example in The Cancer Genome Atlas) are usually 0.23-0.25 μm (magnification ×40) or 0.46-0.50 μm (magnification ×20). Tissue samples such as surgical resection samples (or specimens) are often approximately 20 mm × 15 mm in size 1 , while samples such as biopsies are approximatively 2 mm × 6 mm in size. The size of the samples combined with the spatial resolution of the scanners leads to gigapixel images: image size can reach 200 000 × 200 000 pixels, meaning gigabytes of pixel data. The multi-scale WSI format ( Figure 1) includes several magnification levels (with a different spatial resolution) of the sample, stored in a pyramid, usually varying between ×1.25 and 40x. The baseline image of the pyramid is the one at the highest resolution. The multi-scale structure of the images allows pathologists to analyze the image from the lowest to the highest magnification level. Pathologists analyze the images by first identifying a few regions of interest and zooming afterwards through them to visualize different details of the tissue (Schmitz et al., 2019). Each magnification level includes different types of information (Molin et al., 2016), since tissue structures appear in different ways according to their magnification level. Therefore, it is essential to detect an abnormality and detect it in a specific range of levels. The characteristics of microscopes and scanners often lead to a scale-dependent analysis. For example, at middle magnification levels (such as 5-10x) it is possible to distinguish between glands, while at the highest ones (such as 20-40x) it is possible to better resolve cells. Figure 2 includes examples of tissues scanned at different magnification levels.
Computational pathology is the computational analysis of digital images obtained through scanning slides of cells and tissues (van der Laak et al., 2021). Currently, deep Convolutional Neural Networks (CNNs) are the state-of-the-art machine learning algorithms in computational pathology tasks, in particular for classification Arvaniti and Claassen, 2018;Coudray et al., 2018;Komura and Ishikawa, 2018;Ren et al., 2018;Campanella et al., 2019;Roy et al., 2019;Iizuka et al., 2020) and segmentation (Ronneberger et al., 2015;Paramanandam et al., 2016;Naylor et al., 2017;Naylor et al., 2018;Wang et al., 2019) of images. Their success relies on automatically learning the relevant FIGURE 1 | An example of WSI format including multiple magnification levels. The size of each image of the pyramid is reported under the magnification level in terms of pixels. FIGURE 2 | An example of tissue represented at multiple magnification level (5x, 10x, 20x, 40x). The tissues come from colon, prostate and lung cancer images.
features from the input data. However, usually, CNNs cannot easily handle the multi-scale structure of the images since they are not scale-equivariant by design (Marcos et al., 2018;Zhu et al., 2019) and because of WSI size. The equivariance property of a transformation means that when a transformation is applied, it is possible to predict how the representation will change (Lenc and Vedaldi, 2015;Tensmeyer and Martinez, 2016). This is not normally true for CNNs, because if a scale transformation is applied to the input data, it is usually not possible to predict its effect on the output of the CNN. The knowledge about the scale is essential for the model to identify diseases, since the same tissue structures, represented at different scales, include different information (Janowczyk and Madabhushi, 2016). CNNs can identify abnormalities in tissues, but the information and the features related to the abnormalities are not the same for each scale representation (Jimenez-del Toro et al., 2017). Therefore, the proper scale must be selected to train CNNs (Gecer et al., 2018;Otálora et al., 2018b). Unfortunately, scale information is not always available into images. This is the case, for instance, of pictures taken with standard cameras or processed in compression and resolution, such as images downloaded from the web or images included in scientific articles. Furthermore, modern hardware (Graphic Processing Units, GPUs) cannot easily handle WSIs, due to their large pixel size and the limited video random access memory space that has to temporally store it. The combination of different magnification levels leads to larger images, making it even harder to analyze the images.
The code library (called Multi_Scale_Tools) described in this paper contributes to alleviate the mentioned problems by presenting tools that allow handling and exploiting histopathological images' multi-scale structure end-to-end CNN architectures. The library includes pre-processing tools to extract multi-scale patches, a scale detector, a component to train a multi-scale CNN classifier and a component to train a multi-scale CNN for segmentation. The tools are platformindependent and developed in Python. The code is publicly available at https://github.com/sara-nl/multi-scale-tools. Multi_Scale_Tools is aimed at being easy to use and easy to be improved with additional functions.

METHODS
The library includes four components: a pre-processing tool, a scale detector tool, a component to train a multi-scale CNN classifier and a component to train a multi-scale segmentation CNN. Each tool is described in a dedicated subsection as follows: • Pre-processing component, Sub-section 2.1 • Scale detector, Sub-section 2.2 • Multi-scale CNN for classification, Sub-section 2.3 • Multi-scale CNN for segmentation, Sub-section 2.4

Pre-Processing Component
The pre-processing component allows researchers to generate multi-scale input data. The component includes two parametric and scalable methods to extract patches from the different magnification levels of a WSI: the grid extraction and the multi − center extraction method. Both methods need a WSI and the corresponding tissue mask as input, and they both produce images and metadata as output. The grid extraction methods (Patch_Extractor_Dense_Grid.py, Patch_Extractor_Dense_Grid_Strong_Labels.py), allow to extract patches from one magnification level ( Figure 3). The tissue mask is split in a grid of patches according to the following parameters: magnification level, mask magnification, patch size, and stride between the patches. The output of the method is a set of patches selected according to the parameters. The multi − center extraction methods (Patch_Extractor_Dense_Centroids.py, Patch_Extractor_Dense_Centroids_Strong_Labels.py) allow to extract patches from multiple magnification levels. According to the user's highest magnification level, the tissue mask is split into a grid (as done in the functions previously described). The patches within this grid are called centroids. Each centroid is used to generate the coordinates for a patch at a lower magnification level, so that the latter includes the centroid (the patch at the highest magnification level) in its central section. The method's output is a set of tuples, each one including patches at different magnification levels ( Figure 4). Compared with other patch extraction methods, such as the one presented in (Lu et al., 2021), this pre-processing component has two main characteristics. The first one is that the component extracts patches from multiple magnification levels of the WSIs, pairing the patches coming from the same region of the image. The second one is that the component allows extracting patches from an arbitrary magnification level, despite the magnification level not being included in the WSI. Usually, patch extractor methods extract patches only from the magnification levels stored in the WSI format (M a ), such as 40x, 20x, 10x, 5x, 2.5x and 1.25x. This process is driven by the input parameters that include both the patch size (P w ) and the magnification wanted (M w ). The method extracts a patch of size P a from a magnification stored in the WSI and afterwards the patch is resized to P w .
In both methods, only patches from tissue regions are extracted and saved using tissue masks, distinguishing between Frontiers in Computer Science | www.frontiersin.org August 2021 | Volume 3 | Article 684521 patches from tissue regions and patches from the background. The methods are developed to work with masks including tissue and, in case they are available, with pixel-wise annotated masks.
In the case of tissue masks, the tissue masks are generated using HistoQC tools (Janowczyk et al., 2019). The HistoQC configuration adopted is reported in the repository. In the case of pixel-wise annotations, the masks must be firstly converted to a RGB image. Besides the patches, the methods save also metadata file (csv files). The metadata includes information regarding the magnification level where the patches are extracted and the x and y coordinates of the patches' upper left corner. The scripts are developed to be multi-thread, in order to exploit hardware architectures with multiple cores. In the Supplementary Materials section, the parameters for the scripts are described in more detail.

Scale Detector
The scale detector tool is a CNN trained to estimate the magnification level of a given patch or image. This task has been explored in the past Otálora et al. (2018a), Otálora et al. (2018b) in the prostate and breast tissue types. Similar approaches have been recently extended to different organs in the TCGA repository Zaveri et al. (2020). The tool involves the scripts related to the training of the models (the input data generation, the training and testing modules) and a module to use the detector as a standalone component that performs the magnification inference for new images. The models are trained in a fully-supervised fashion. Therefore, the scripts to train them need a set of patches and the corresponding magnification level as input, which are provided into csv files, including the patch path and the corresponding magnification levels. Two scripts are developed to generate the input files, assuming that the patches are previously generated with the pre-processing components, described in the previous section. The first script is made to split the WSIs into partitions (Create_csv_from_partitions.py), which generates three files (the input data for training, validation and testing partitions) starting from three files (previously prepared by the user) including the names of the WSIs. The second script (Create_csv.py) generates an input data csv starting from a list of files. The model is trained (Train_regressor.py) and tested (Test_regressor.py) with several magnification levels that the user can choose (in this paper, 5x, 8x, 10x, 15x, 20x, 30x, 40x were used). Training the model with patches from a discrete and small set of scales can lead to regressors that are precise to estimate the magnifications close to input scales, and less precise when scales are far from them. Therefore, a scale augmentation technique was applied to patches and labels during the training (in addition to more standard augmentation techniques adopted such as rotation, flipping and color augmentation). In order to perform scale augmentation, the image is randomly cropped of a factor and resized to the original patch size. The factor is applied to perturbate also the magnification level. The scale detector component includes also a module to import and use the model in the code (regression.py). The component works both as a standalone module (with the required parameters) but it is also possible to load the functions from the python module. The Supplementary Materials section includes a more thorough description of the parameters for the scripts.

Multi-Scale CNN for Classification
The Both architectures are presented in two variants, optimizing respectively one and multiple loss functions. In the first variant (one loss function), the input is a set of tuples of patches from several magnification levels (one patch for each level), generated using the multi − center extraction tool (presented in Section 2.1). The input tuples are generated with a script (Generate_csv_multicenter.py) that exploits the coordinates of the patches (stored in the metadata) to generate the tuples (stored in a csv file). The tuple label corresponds to the class of the centroid patch (the patch from the highest level within the tuple). Therefore, the model outputs only the class of the tuple. Only one loss function is minimized in this variant, i.e. the categorical cross-entropy between the CNN output and the patch ground truth. Figure 5 summarizes the CNN architecture. In the second variant (multiple loss functions), the input is a set of tuples of patches from several magnification levels (one patch for each level), previously generated using the grid extraction method (presented in Section 2.1). The input tuples are generated with a script (Generate_csv_upper.py) that exploits the coordinates of the patches (stored in the metadata) to generate the tuples (stored in a csv file). The tuple labels correspond to the classes of the patches. The model has n + 1 outputs: the class for each of the n magnification levels and the whole tuple class. In this variant, n + 1 loss functions are minimized (n representing the number of magnification levels considered). The n loss functions are the categorical cross-entropy between the output for each of the scale branches and the tuple labels. The other loss term is the categorical cross-entropy between the output of the network (after the combination of the features or the predictions of the single branches) and the tuple labels. Figure 6 summarizes the CNN architecture. The Supplementary Materials section includes a more thorough description of the parameters.

Multi-Scale CNN for Segmentation
This component includes HookNet (van Rijthoven et al., 2020), a multi-scale CNN for semantic segmentation. HookNet combines information from low-resolution patches (large field of view) and high-resolution patches (small field of view) to semantically segment the image, using multiple branches. The low-resolution patches come from lower magnification levels and include context information, while the high-resolution patches come from higher magnification levels and include more fine-grained information. The network is composed of two branches of encoder-decoder models, the context branch (fed with low-resolution patches) and the target branch (fed with high-resolution patches). The two branches are fed with concentric multi-field-view multi-resolution (MFMR) patches (284 × 284 pixels in size). Although they have the same architecture, the branches do not share their weights (an encoder-decoder CNN based on U-Net). Hooknet is thoroughly described in a dedicated article (van Rijthoven et al., 2020).

Datasets
The following datasets are used to develop the Multi_Scale_Tools components: • Colon dataset, Sub-section 2.5.1, used in the Pre-processing component, the Scale detector and the Multi-scale CNN for classification • Breast dataset, Sub-section 2.5.2, used in the Multi-scale CNN for segmentation • Prostate dataset, Sub-section 2.5.3, used in the Scale detector • Lung dataset, Sub-section 2.5.4, used in the Scale detector and the Multi-scale CNN for segmentation

Colon Dataset
The colon dataset is a subset of the ExaMode colon dataset. This subset includes 148 WSIs (provided by the Department of Pathology of Cannizaro Hospital, Catania, Italy), stained with Hematoxylin and Eosin (H&E). The images are digitized with an Aperio scanner: some of the images are scanned with a maximum spatial resolution of 0.50 μm per pixel (20x), while the others are scanned with a spatial resolution of 0.25 μm per pixel (40x). The images are pixel-wise annotated by a pathologist. The annotations include five classes: cancer, high-grade dysplasia, low-grade dysplasia, hyperplastic polyp and non-informative tissue.

Breast Dataset
The breast dataset (provided by Department of Pathology of Radboud University Medical Center, Nijmegen, Netherlands) is a private dataset including 86 WSIs, stained with H&E. The images are digitized with a 3DHistech scanner, with a spatial resolution of 0.25 μm per pixel (40x). The images are pixel-wise annotated by a pathologist. 6,279 regions are annotated, with the following classes: ductal carcinoma in-situ (DCIS), invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC) benign epithelium (BE), other, and fat.

Prostate Dataset
The prostate dataset is a subset of the publicly available database offered by The Cancer Genome Atlas (TCGA-PRAD), that includes 20 WSIs, stained with H&E. The images come from several sources and are digitized with different scanners, with a spatial resolution of 0.25 μm per pixel (40x). The images come without pixel-wise annotations.

Lung Dataset
The Lung dataset is a subset of the public available database offered by The Cancer Genome Atlas Lung Squamous Cell carcinoma dataset (TCGA-LUSC), including 27 WSIs stained with H&E. The images come from several sources and are digitized with different scanners, with a spatial resolution of 0.25 μm per pixel (40x). Initially, the images come without pixel-wise annotation from the repository, but a medical expert from Radboudc Hospital pixel-wise annotated them with four classes: tertiary lymphoid structures (TLS), germinal centers (GC), tumor, and other.

EXPERIMENTS AND RESULTS
The Section presents the assessment of the components of the library Multi_Scale_Tools in dedicated subsections as follows: • Pre-processing component assessment, Sub-section 3.1 • Scale detector assessment, Sub-section 3.2 • Multi-scale CNN for classification assessment, Subsection 3.3 Frontiers in Computer Science | www.frontiersin.org August 2021 | Volume 3 | Article 684521 • Multi-scale CNN for segmentation, Sub-section 3.4 • Library organization, Sub-section 3.5

Pre-Processing Tool Assessment
The pre-processing component allows to extract a large amount of patches from multiple magnification levels, guaranteeing scalable performance. The patch extractor components (grid and multi-center methods) are tested on WSIs scanned with Aperio (.svs), 3DStech (.mrxs) and Hamamatsu (.ndpi) scanners, on data coming from different tissues (colon, prostate and lung) and datasets.  Table includes the number of patches extracted with the multi-center extraction method, considering two possible combinations of magnification levels (5x/10x, 5x/10x/20x). In both cases, patches are extracted with a patch size of 224 × 224 pixels without any stride. Methods performance are evaluated in terms of scalability, since the methods are designed to work on multi-core hardwares. Table 2 includes the time results obtained with the grid method (upper part) and with the multi-center method (lower part). The evaluation is made considering the amount of time needed to extract the patches from the colon dataset, using several threads. The results show that both the methods benefit from multi-core hardwares, reducing the time needed to pre-process data.

Scale Detector Assessment
The scale detector shows high performance in estimating the magnification level of patches that come from different tissues. The detector is trained with patches from the colon dataset and it is tested with patches from three different tissues. The performance of the models is assessed with the coefficient of determination (R 2 ), the Mean Square Error (MSE), the Cohen's κ-score (McHugh, 2012) and the balanced accuracy. While the experimental setup and the metrics descriptions are presented in detail the supplementary material, Table 3 summarizes the results. The higher performance is reached on the colon test partition, but the scale detector shows high performance also on the other tissues. The scale detector makes almost perfect scale estimations in the colon dataset (data come from the same medical source and include the same tissue type), in both the regression and the classification metrics. The scale detector makes reasonably good scale estimations also on the prostate data, in both the regression and the classification metrics, and in lung dataset, where the performance is the lowest though. The fact that the regressor shows exceptionally high performance in colon data and good performance in other tissues means that it has learnt to distinguish the colon morphology represented at different magnification level very well and that the learnt knowledge can generalize well to other tissues too. Even though tissues from different organs share similar structures (glands, stroma, etc.), the morphology of the structures is different in the organs, such as prostate and colon glands. Training the regressor with patches from several organs may allow to close this gap, guaranteeing extremely high performance for different types of tissue.

Multi-Scale CNN for Classification Assessment
The multi-scale CNNs show higher performance in the fully supervised classification compared to the single-scale CNNs. Several configurations of the multi-scale CNN architectures are evaluated. They involve variations in optimization strategy (one or multiple loss functions), in the magnification levels (combinations of 5x, 10x, 20x) and in how information from the scales is combined (combining the single-scale predictions or the single-scale features). Table 4 summarizes the results obtained. The CNNs are trained and tested with the colon dataset, that come with pixel-wise annotations made by a pathologist. The performance of the models is assessed with the Cohen's κ-score and the balanced accuracy. More detailed descriptions of the experimental setup and the metrics adopted are presented in the Supplementary material. In the presented experiment, the best multi-scale CNN architecture is the one that combines features from 5/10x magnification levels and is trained optimizing n + 1 loss functions. It outperforms the best single-scale CNN (trained with patches acquired at 5x) in terms of balanced accuracy, while the κ-score of the two architectures is comparable. The characteristics of the classes involved can explain the fact that CNNs trained combining patches from 5/10x reach the highest results. These classes show morphologies including several alterations of the gland structure. Glands can be usually identified at low magnification levels, such as 5/10x, while at 20x the cells are visible. For this reason, the CNNs show high performance with patches from magnification 5/10x, while including patches from 20x decreases the performance. The fact that the discriminant characteristics are identified in a range of scales may explain why the combination of the features shows higher performance than the combination of the predictions.

Multi-Scale CNN for Segmentation Assessment
The multi-scale CNN (HookNet) shows higher tissue segmentation performance than single-scale CNNs (U-Net). The model is trained and tested with breast and lung datasets, comparing it with models trained with images from a single magnification level. The performance of the models is assessed with the F1 score and the macro F1 score. More detailed descriptions of the experimental setup and the metrics adopted are presented in the Supplementary Material. Table 5 and Table 6 summarize the results obtained respectively on the breast dataset and on lung dataset. In both the tissues, HookNet shows an higher overall performance, while some of the single scale U-Nets have better performance for some segmentation tasks (such as breast DCIS or lung TLS). This result can be interpreted as a consequence of the characteristics of the task, therefore the user should choose the proper magnification levels to combine, depending of the problem.

Library Organization
The source code for the library is available on GIT 2 , while the HookNet code is available here 3 . The library is available can be deployed as Python package directly from the repository or as Docker container that can be downloaded from 4 (the multiscale folder). Interaction with the library is done through a model class and an Inference class 5 . The model instantion depends on the choice of algorithms. For a more detailed explanation about the hyperparameters and other options please make sure to browse the Readme file 6 . An example can be found here 7 . The Python libraries used to develop Multi_Scale_Tools are reported in Supplementary Materials. Multi_Scale_Tools library aims at facilitating the exploitation of multi-scale structure in WSIs with code that is easy to use and easy to be improved with additional functions. The library currently includes four components. The components are a pre-processing tool to extract multi-scale patches, a scale detector, two multi-scale CNNs for classification and a multi-scale CNN for segmentation. The preprocessing component includes two methods to extract patches from several magnification levels. The methods are designed to be scalable on multi-core hardware. The scale detector component includes a CNN allowing to regress the magnification level of a patch. The CNN obtains high performance in patches that come from the colon (the tissue used to train it) and it reaches good performance on other tissues such as prostate and lung too. Two multi-scale CNN architectures are developed for fully-supervised classification. The first one combines features from multi-scale branches, while the second one combines predictions from multi-scale branches. The first architecture obtains better performance and outperforms the model trained with patches from only one magnification level. The HookNet architecture for multi-scale segmentation is also included into the library, fostering its usage and making the library more complete. The tests show that HookNet outperforms single scale U-Net in the considered tasks. The presented library allows to exploit the multiscale structure of WSIs efficiently. In any case, the user remains a fundamental part of the system for several components, such as identifying the scale that can be more relevant for a specific problem. The comparison between the single-scale CNNs and the multi-scale CNN is an example of this. The CNN is trained to classify between cancer, dysplasia (both high-grade and low-grade), hyperplastic polyp and non-informative tissue. In the classification task, the highest performance is reached using patches of magnification 5x and 10x, while patches from 20x lead to lower classification performance. This can likely be related to the fact that the main feature related to the considered classes is the structure of the glands, therefore high magnifications (e.g. 20x) limitedly introduce helpful information into the models. The importance of the user to select the proper magnification levels is highlighted even more in the segmentation results. Considering low magnifications, the models show good performance in ductal carcinoma in-situ and invasive ductal carcinoma segmentation since these tasks need context about the duct structures in the breast use case. Considering higher magnifications, the models perform well in invasive lobular carcinoma and benign tissue segmentation, where the details are more important. The methods identified to pair images from several magnification levels can pave the way to multi-modal combination of images too. The combination may increase the information included in the single modality, increasing the performance of the CNNs. Some possible applications can be the combination of WSIs stained with different reagents, such H&E and immunohistochemical (IHC) stainings, the application in Raman spectroscopy data, combining information about tissue morphologies and architectures with protein biomarkers, and the combination of patches from different focal planes.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
NM: design of the work, software, analysis, original draft SO: design of the work, revised the work DP: software, revised the