Prediction of Treatment Response in Triple Negative Breast Cancer From Whole Slide Images

The automatic analysis of stained histological sections is becoming increasingly popular. Deep Learning is today the method of choice for the computational analysis of such data, and has shown spectacular results for large datasets for a large variety of cancer types and prediction tasks. On the other hand, many scientific questions relate to small, highly specific cohorts. Such cohorts pose serious challenges for Deep Learning, typically trained on large datasets. In this article, we propose a modification of the standard nested cross-validation procedure for hyperparameter tuning and model selection, dedicated to the analysis of small cohorts. We also propose a new architecture for the particularly challenging question of treatment prediction, and apply this workflow to the prediction of response to neoadjuvant chemotherapy for Triple Negative Breast Cancer.


Context
Breast cancer (BC) is the most common cancer in women and the leading cause of cancer deaths with 18.2% of deaths among female cancer patients and 8% among all cancer patients (Institut National Du Cancer, 2019). Out of the four main breast cancer types, Triple Negative Breast Cancer (TNBC) represents 10% of all BC patients. This group has the worst prognostic with a five-year survival rate of around 77 percent versus 93 percent for the others. Currently no specialised treatments exists and the standard procedure consists in administrating neoadjuvant or adjuvant chemotherapy (Sakuma et al., 2011). TNBC research is still a very active field of study (Foulkes et al., 2010) and on the one hand, most works have focused on stratifying cohorts based on molecular and biological profiles (Lehmann et al., 2016). We, on the other hand, tackle the problem of predicting the response variable in a TNBC neoadjuvant chemotherapy (NACT) cohort from a histological needle-core biopsy section from the primary tumor prior to treatment. In contrast to most of the effort in cancer research, which is driven by the analysis of sequencing data, our study is based solely on the histological image data prior to treatment.
Each histological sample corresponds to tissue slides encompassing the tumor and its surrounding, stained with agents in order to highlight specific structures, such as cells, cell nuclei or collagen. The morphological properties of these elements and their spatial organisation have been linked to cancer subtypes, grades and prognosis. Even if pathologists have been trained to understand and report the evidence found in this type of data, the complexity, size and heterogeneity found in histological specimens make it highly unlikely that all relevant and exploitable patterns are known of today. Tissue images are informative about morphological and spatial patterns and are therefore inherently complementary to omics data.
Two major technological advances have triggered the emergence of the field of Computational Pathology: first, the arrival of new and powerful scanners replaced to some extent the use of conventional microscopes. Today, slides are scanned, stored and can be accessed rapidly at any moment (Huisman et al., 2010). This in turn has lead to the generation of large datasets that can also be analyzed computationally. The second element was the rise of new computer vision techniques.
Indeed, while the analysis of tissue slides has been of interest to the Computer Vision community for many years (Bartels et al., 1988), it is the advent of deep learning that has truly impacted the field. The advent of deep learning has stemmed a wide number of projects and investments. For visual systems, it is the combination of very large annotated dataset (Jia et al., 2009), hardware improvement and convolutional neural networks (CNN) that led to humanlike capabilities. These outbreaks in performance have led to the creation of many annotated datasets and to the application of CNN's to many tasks. However, for biomedical imaging in particular the application of CNN is not always straightforward: 1) The price for generating and annotating large biomedical dataset limits the progress of big data in this domain (Gurcan et al., 2009). 2) In histopathology in particular, each individual sample can be very large, one sample can be up to 60 GB uncompressed. This leads to multiple issues, again linked with the time and price needed for annotation, but also for the subsequent analysis where ad hoc methods have to be used as an entire image does not comfortably fit in RAM.
3) The nature of the data is inherently complex, each biological sample has its own individual patterns to be differentiated with relevant pathological evidence. For histopathology data, samples have a very large inter-slide, but also intra-slide variability that make the apparent signal harder to detect. In addition, the level of detail can be an additional difficulty: the relevant image features may be very fine grained, such as mitotic events, or very large such as the size of relevant image regions (necrotic, tumorous) (Janowczyk and Madabhushi, 2016).
In this paper, we set out to predict the response to NACT in TNBC from Whole Slide Images (WSI). Prediction of treatment response is one of the most difficult tasks in digital pathology (Echle et al., 2021). Unlike tasks like tumor detection and subtyping, for which high accuracies have been achieved in the past (Echle et al., 2021), there are no known morphological phenotypes that would allow for the prediction of treatment response. Moreover, datasets with treatment response are usually much smaller than datasets acquired in clinical routine for tasks such as automatic grading, metastasis detection or subtype prediction.
The contributions of this article are: 1) We benchmark several state-of-the-art architectures with respect to their performance in treatment response prediction. 2) We propose a new architecture for the prediction of treatment response and demonstrate its efficiency. 3) We propose a suitable model selection procedure that can cope with small datasets and avoids re-training, which is particularly important for deep learning. We prove its validity and efficiency on simulated data.
The paper is organised as follows: in the next Section 1.2 we describe related work. In Section 2, we describe the methodological developments. In Section 2.1, we present the limits of the current validation procedures and our alternative method used in this study. We then introduce our histopathology dataset in Section 2.3. Section 2.4 is devoted to introducing the DNN architectures which will be applied to the TNBC cohort. In Section 3, we show our results on the simulated data for our validation procedure and the application of our DNN to the TNBC cohort. Finally in Section 4 we discuss our methods and results.

Challenges in Computational Pathology
The field of research in computational pathology can be divided in three categories: 1) Preprocessing, in particular color normalisation which aims at reducing the bias introduced by staining protocols used in different centers (Ruifrok and Johnston, 2001;Bejnordi et al., 2016). 2. Detection, segmentation and classification of objects of interest, such as regions (Bejnordi et al., 2017;Chan et al., 2019;Barmpoutis et al., 2021) and nuclei (Naylor et al., 2018;Graham et al., 2019).
Of note, these different of Computer Vision tasks have varying degrees of difficulty. For instance, Deep Learning based tumor detection and subtyping can be achieved with high accuracies [AUC: 0.97-0.99 for detection, AUC: 0.85-0.98 for subtyping, (Echle et al., 2021)]. This is not surprising, as these tasks rely on well-known visual cues. In contrast, prediction of treatment response is deemed one of the most difficult tasks in digital pathology (Echle et al., 2021), because the related morphological patterns are unknown and could potentially be very complex. It is even not clear to which extent treatment response is actually predictable from WSI.
Pipelines for slide variable predictions are usually divided into several steps. Tiles are partitioned into smaller images, usually referred to as patches or tiles, which are then encoded by a DNN, often trained on ImageNet (Zhu et al., 2017;Courtiol et al., 2019), as depicted in Figure 2. The training on ImageNet might be surprising at first sight, as the nature of the images are very different. In addition, ImageNet samples usually have a natural orientation, where the main object of interest is usually centered and scaled to fit in the image (Raghu et al., 2019). In contrast, histopathology images have a rotationally invariant content with no prior regarding scale or positioning of the relevant structures. However, rotational invariance can be imposed Lafarge et al., 2020), and in practice ImageNet based encodings are widely used and tend to perform very well.
After encoding of all tiles, each WSI is converted into a P × n i matrix where P is the encoding size and n i the number of tiles. The last step consists in aggregating tile level encodings to perform unsupervised or supervised predictions at the slide level (Zhu et al., 2017;Courtiol et al., 2018;Campanella et al., 2019;Courtiol et al., 2019;. Computational Pathology as a field has benefited from the generation of large annotated data sets, mostly with pixel-level annotations (Kumar et al., 2017;Litjens et al., 2018;Naylor et al., 2018), or cell-level annotations (Veta et al., 2015) for cell classification. The major resource for WSI with slide level annotations are the Cancer Genome Atlas (TCGA) and the Camelyon Challenge (Litjens et al., 2018). These public repositories are paralleled by many in-house datasets (thus not accessible to the public), some of which can be very large, namely in a screening context, e.g., (Campanella et al., 2019). In most cases however, the datasets tend to be very small and fall therefore in the small n large p category. This is due to the fact that often the most interesting studies focus on particular molecularly defined cancer subtypes for which only small cohorts exist. In addition, collecting the output variable might be very challenging and time-consuming, if the project is not formulated in the context of Computer Aided Diagnosis. This is particularly true for treatment response prediction.

Challenges in Applying DNN to Small Datasets
For all supervised learning method it is custom to use a two step procedure for estimating the performance. After dividing your dataset into three categories: train, validation and test. The first step consists in performing model selection with the training and validation set. The second step simply involves evaluating the chosen model on the test set in order to assess an unbiased estimator of the performance (Pereira et al., 2009). This is however only possible if the three categories are large enough. When the validation set is too small and the discrepancy in the data too high, one could very easily over-fit or under-fit on the dataset (Varoquaux, 2018). When the number of samples n is small, which is usually the case for biomedical data, alternative validation methods have to be found such as cross-validation (CV) and nested cross-validation (NCV). CV is mostly used for model selection or assessing performance. NCV is used when the model needs a tuning based on an external dataset, such as hyperparameter tunning for Support Vector Machines. Even if these methods have been debated (Krstajic et al., 2014;Varoquaux, 2018;Wainer and Cawley, 2018), they are widely accepted. These methods are explained in more details in Section 2.1.

Prediction of the Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer
Neoadjuvant chemotherapy (NACT) responses varies among patients in TNBC and no clear biological signal has been shown. Survival in these cohorts have been correlated to the Residual Cancer Burden (RCB) variable (Symmans et al., 2007) which can be used as a proxy for response. RCB is a pathological variable based on measurements of how much the primary tumor has shrunk and of the size of metastasis in axillary lymph nodes. Finding biological evidence to NACT response would allow for adequate and specific treatment, some histological variables have been found to be correlated with survival, such as the number Ki-67 positive cells (Elnemr et al., 2016), tumor infiltrating lymphocytes (Mao et al., 2016) and the Elston and Ellis grade (Elston and Ellis, 1991). Depending on the context, some alternative treatments have been found to help overall survival, such as those based on anthracycline and taxanes (Sakuma et al., 2011), carboplatin (Pandy et al., 2019) or with olaparib and talazoparib (Won and Spruck, 2020). Some treatments have emerged with targeted immunotherapy in combination with atezolizumab (anti-PD-L1 antibody) and nanoparticle albumin-bound (nab)paclitaxel (Won and Spruck, 2020). Most of the studies for NACT responses have been performed in clinical practices and based on pathological variables (Elnemr et al., 2016;Gass et al., 2018;Zhu et al., 2020). In addition, some studies have analysed sequencing and molecular profiles in order to better understand and stratify cohorts (Lehmann et al., 2016;García-Vazquez et al., 2017;Wang et al., 2019).
To the best of our knowledge, it remains unclear whether and to which extent NACT response can be predicted from biopsies taken prior to treatment, and only few works have addressed this question so far Ogier du Terrail et al., 2021).

Validation Procedure
Here, we present our procedures to replace NCV in order to train DNN in a context of small n large p. We first explain crossvalidation (CV), nested cross-validation (NCV) and their limitations. We propose two different procedures, better suited and show their effectiveness on a small case study.

Cross-Validation
CV is a common procedure for model evaluation or model selection, specifically in situations where the data set is relatively small. CV divides the initial data set into k f folds, denoted F cv 1 , F cv 2 , . . . , F cv k f and runs algorithms on the data sets with one fold left out. We define, for all j, the set F cv −j , which is the union of all folds expect for fold j:

.1 Cross-Validation for Model Selection
CV can be used for model selection or model tuning. The procedure that returns a tuned model M will be notated f cv .
We give the pseudo code in Algorithm 1, where H h 1 , . . . h i , ‥ is the set of hyperparameters (HP).

Cross-Validation for Model Evaluation
We can use CV to evaluate a given model and a HP set h. The procedure is similar to the pseudo code given in Algorithm 1, however we give in input a model and only one hyperparameter set and return t h . In this case, t h is an unbiased estimator of the performance, as no optimization of the hyperparameters took place. If however several sets of hyperparameters are tested as to minimize the accuracy measured by cross-validation, this accuracy is an overoptimistic estimation of the true accuracy. In order to get a realistic estimation of the accuracy, we therefore have to turn to nested cross-validation.

Nested Cross-Validation
NCV is a procedure that allows one to tune a model and effectively report an unbiased estimation of the performance of the tuned model.
Given sets of HP and a data set D, NCV corresponds to two nested loops of CV: The outer CV loop is for model evaluation, usually applied on test folds, sometimes referred to as outer folds. The inner CV loop is for model tuning, i.e., for each test fold, we perform a complete CV on the remaining data to correctly tune the model, and test the performance of the tuned model on data that has neither been used for training nor for HP tuning. We show the pseudo-code for NCV in Algorithm 2.
Another possible view is to see NCV as a simple CV for a model selection algorithm. For NCV, the model selection algorithm would be f cv .
It is important to note that as we are training DNN, we do not use a fixed hyperparameter set, H, but randomly generate the set as it has been shown that randomised search performs better (Bergstra and Bengio, 2012).

Limitations
DNN training suffers from inherent randomness, as the loss function is highly non-convex and possess many symmetries (Bishop, 2006). In addition, there are some stochastic differences between training runs, such as the random initialization of the weight parameters and the data shuffling, naturally leading to different solutions. Especially for small datasets, these stochastic variations lead to notable differences in performance when we repeat training with the same hyperparameters.
In the classical setting, CV provides us with a set of hyperparameters that lead to a model with optimal performance, as estimated in the inner loop. For DNN trained on small datasets, there is no guarantee that the same set of hyperparameters will lead to similar performance, and for this reason retraining is not guaranteed to lead to a very good solution.
Another problem with the retraining in line 10 in Algorithm 1 is the use of early stopping. Early stopping is a very powerful regularization procedures that choses experimentally the point between the under-and overfitting regime, but for this it requires a validation set. Early stopping would therefore not be applicable in the traditional CV-scheme with retraining.

Nested Ensemble Cross-Validation
Due to the incompatibility between NCV and early stopping we propose to modify the model selection procedure, i.e., function f cv shown in Algorithm 1. In particular we do not perform retraining and return an ensemble of the models used during CV. Similarly to NCV, we perform a CV where we propose to modify f cv into a better suited procedure, named f ecv (ensemble cross-validation), shown in Algorithm 3.
The main difference between f cv and f ecv is that we remove the final model retraining, i.e., line 10 of Algorithm 1 and give back the full set of k f models trained for all folds for the maximizing hyperparameters; the prediction is obtained by ensembling these models.
The advantage of this procedure is that we omit the retraining step which allows us to use early stopping for all individual models. In addition, we add another level of  regularization by the ensembling. Of note, f ecv can be used in an inner loop, too. This then leads to Nested Ensemble crossvalidation (NECV).

Nested Best Fold Ensemble Cross-Validation
Similarly to the procedure NECV, we can define another procedure that we name NBFCV. NBFCV is based on f bfcv   Frontiers in Signal Processing | www.frontiersin.org June 2022 | Volume 2 | Article 851809 6 defined in Algorithm 4. Compared to the preceding procedure, we inverted the two for-loops and retain the best model per fold. It then returns the ensemble of these selected models.

Simulations
In order to compare and validate our procedure presented in Section 2.1 we propose to conduct a series of simulation studies where the results will be given in Section 3.1. In particular, we wish to demonstrate that DNN training given a set of HP can lead to inconsistent models and that NCV therefore might provide under-performing models compared to NECV and NBFCV, the proposed validation procedures.

Data Set Simulation
We simulated a simple balanced binary data, of size N = 350 in a medium to high dimensional setting with p = 256. We have ∀i ∈ [(1, N)], Y i ∈ (−1, 1) and X i~N (p y i , σI p ) where p j is a cluster center, σ a given standard deviation and I p the identity matrix of size p. We set one cluster center at p 1 = (1, 1, . . ., 1) and the second cluster center at p −1 = (−1, −1, . . . , −1). In Figures 1A-D respectively, we show four plots of the simulated data reduced to two dimensions thanks to a UMAP (McInnes et al., 2018), with standard deviations set to 2, 6, 10 and 14 respectively. Naturally, when the standard deviation increases the data becomes less separable.

Model
We apply a simple DNN model, composed of two layers, a first hidden layer with 256 hidden nodes and a classification layer with two nodes, the model is depicted in Figure 1E.
In particular, we minimise a cross-entropy error with an Adam optimiser and a batch size of 16. The tunable parameters are the weight-decay, drop out and learning rate. As an extra regularisation we use batch normalisation.

Data Generation and Annotation
The data set used was generated at the Curie Institute and consists of annotated H&E stained histology needle core biopsy sections at 40× magnification sampled from a patient suffering from TNBC. In this paper, we evaluate the prediction of the response to treatment based solely on a biopsy sectioned prior to the treatment. As discussed in the introduction, not all patients respond to NACT, and we are therefore aiming at predicting the response to NACT based on the biopsy. In particular, each section was quality checked by expert histopathologists.
For each patient, we also collect WSI after surgery, allowing an expert pathologist to establish the residual cancer burden, as a proxy for treatment success. Out of the 336 samples that populate our data set, 167 were annotated as RCB-0, 27 as RCB-I, 113 as RCB-II and 29 as RCB-III. This data set is twice as large as the data set used in our previous study . Similarly to this study, we refine the number of classes in order to avoid the problem of under-represented class. We investigate two prediction settings: 1) pCR (no residuum) vs. RCB (some residuum) and 2) pCR-RCB-I vs. RCB-II-III, which is clinically more relevant, as it is informative of a patient's prognosis.

Data Encoding
As each biopsy section is relatively big, we wish to reduce the computational burden of feeding the entire biopsy to our algorithms. Instead, given a magnification factor, we divide each biopsy into tiles of equal sizes, 224 × 224 and project this tile into a lower dimensional space. We use a pre-trained DNN on ImageNet (Jia et al., 2009) such as ResNet (He et al., 2016) which produces a encoding of size 2048. This process is illustrated in Figure 2 where each biopsy section is converted into an encoded matrix of size n i × P where P is the size of the resulting encoding and n i the number of tiles tissue extracted from tissue i, i ∈ N.
In Table 1 we show the average number of tiles, n i and variance at different magnification factors: highest resolution i.e. no down-sampling (2 0 = 1), down-sampling by a factor 2 1 and by a factor 2 2 = 4.
The size of the data remains relatively large even after this reduction. We further reduce the size of the tile encoding with a PCA (Jolliffe, 2011), and project each tile encoding into a space approximately 10× smaller. By keeping 256 components, we keep 93.2% at 2 0 , 94.0% at 2 1 and 94.3% at a magnification factor of 2 2 of the explained variance.

Mathematical Framework
The data set will be denoted by D (X i , Y i ) i∈ [[1,N]] , where every item indexed by i in D is a joint variable (X i , Y i ), N is the size of the data set, X i is the input sample and Y i the corresponding label. As described in the previous section, each tissue is 2 | Model architecture and pooling strategy, as described in the text. CHOWDER has been published previously (Courtiol et al., 2018). represented by a bag of tiles of variable sizes, in particular ∀i ∈ [(1, N)], X i ∈ R ni×P and Y i ∈ (0, 1) for task (1) or (2). This is simply a multiple instance learning framework, and such a framework has already been implemented for histopathological data (Xu et al., 2012(Xu et al., , 2017Courtiol et al., 2018;Couture et al., 2018). We simplify this framework by setting ∀i ∈ i ∈ [(1; N)], n i = n MF 1 which is set accordingly to the chosen magnification factor. For a given sample i, if n i > n MF we down-sample X i , otherwise we up-sample X i to the correct size. We evaluate our models by using the Area Under the Curve of the Receiver Operating Characteristic for measuring the performances in our two binary settings.

Neural Network Architectures
Today, DNN models for WSI classification usually consist in three steps: starting from encodings that are usually provided by pretrained networks, a reduction layer might be applied, followed by an aggregation step that computes a slide level representation from the tile level representations and a final module that maps the slide level representation to the output variable.
In Figure 3, we show a basic example for such an architecture along these lines, with the three algorithmic blocks highlighted in gray. At the tile-level computation, we use 1D convolutions to transform the input encodings X i into a more compact representation X i ′. The tile representations X i ′ are then summarized by a pooling layer, providing us with the biopsy section profile Z i . For this, we can use a standard pooling layer such as an average pooling to quantify the abundance of specific tile patterns, or more complex, attention-based pooling, such as WELDON (Durand et al., 2016). Finally, from Z i , the slide variable is predicted.
In this article, we test several encoding and agglomeration strategies which are explained in the next sections.

Encoding Projection
In Figure 3, the baseline architecture is illustrated (OneOne), with a 1D convolution for the tile level encoding and one fully connected layer at the slide level. In Figure 4, we test a deeper architecture for the encoding projections, consisting in three consecutive 1D convolutions, including bottleneck layers (depicted in orange), according to best practice in deep learning .
Furthermore, we also experiment with skip connections by concatenating the first tile representations to the final representation X i ′ and by concatenating Z i prior to the final softmax. We name this structure ThreeTwoSkip and illustrate it in Figure 5.

Pooling Layers
In terms of pooling layers, we experiment with: average pooling shown in Figure 6A, WELDON (Durand et al., 2016) shown in Figure 6B, a modified version of WELDON shown in Figure 6C and the concatenation of the first and the third is named WELDON-C (for context). The DNN that uses WELDON-C will be named CONAN 2 .
The WELDON pooling is an attention-based layer which filters tiles based on a 1D convolution score. In particular, it retains the top and lowest R ∈ N* achieving scores as Z i . This architecture has shown excellent results for specific problems where the biological evidence lies in the detection of one type of specific tiles, like cancer regions (Courtiol et al., 2018). The method however suffers from identifiability issue, i.e., the model can not differentiate between two tiles achieving high or low score. In addition, the agglomeration strategy seems less promising in cases where the information resides in the percentage of tiles of a certain type. By providing a context in which a tile was selected, we allow the model to better differentiate between the selected tiles, thus allowing different tiles with different meanings to be selected, this can be particularly efficient when relevant information is based on different tile patterns.
We recap all the tested models in Table 2.

Baseline Approach
In addition to comparing our proposed architectures to CHOWDER, we also compare them to a much simpler approach where we propagate the slide label to the tile level. If a slide is positive, then we assume all extracted tiles from this slide are positive, if a slide is negative, we assume that all tiles are negative. This is the simplest form of MIL. The training set is therefore huge, and we also known that there will be many errors, as many tiles do not contain any useful information regarding treatment response. Nevertheless, such an approach can work if there is a large fraction of informative tiles.

Model Tuning
We perform a random grid search for most parameters and only in suitable ranges. For the learning rate and weight decay we perform a random log sampling for a random scale associated to a random digit. We range from a scale of 10 −6 to 10 −3 for the learning rate and from 10 −4 to 10 −1 for the weight decay. We randomly sample a drop out from a uniform U [0: 0.4] . We randomly sample a bottleneck layer size from the following list (8,32,64) and the size of the larger representations are randomly sampled from (64, 128).

High Performance Variability in DNN Training
In Figure 7A, we show the average variance of our model with increasing standard deviation σ. In particular, for each σ, we generate 100 simulated dataset with a standard deviation of σ and train 1000 DNN on the same data and with the same HP. As this is simulated data, we evaluate the performance of each training on a large independently simulated test set instead of using the outer CV loop (Varoquaux, 2018). We found that setting the learning rate to 1.10 −4 , the weight decay to 5.10 −3 and drop out to 0.4 tend to always return reasonable results for our simulation setting.
As the standard deviation σ of the simulated data increases, we expect more overlapping between our two classes and naturally, the classification accuracy decreases. For lower σ, regardless of using early stopping or not, the models reaches perfect scores.
In Figure 7, we observe that the more difficult the problem (larger σ), the lower the accuracy, but also the larger the variance: not only do we predict less well, but also does the performance variation increase, such that by retraining a model with the same hyperparameters is not guaranteed at all to provide a model with similar performance. We also see from Figure 7A that early stopping alleviates this problem and consistently reduces the variance in performance, in particular for higher σ.

Nested Cross-Validation Leads to Under-performing Models
We compare the performance of the proposed validation procedures to NCV with the number splits k f = 5 in Figure 7B with 95% confidence intervals around the estimator. On the x-axis we have the standard deviation σ of the simulated data and on the y-axis we have the difference between the Naive Bayes estimator with the corresponding performance, therefore the lower the curve the better the procedure. In addition to Figure 7B, we give the corresponding accuracy score for each σ and validation procedure explicitly in Supplementary Table S1. As the dataset distribution is known, the Naive Bayes is the best classification rule that can be implemented and can be viewwed as an upper bound of the performance. For each σ, we collect 100 estimators with Algorithm 1, Algorithm 3 and Algorithm 4. Next, we compare several strategies for NECV and NBFCV: NECV-1/NBFCV-1, where we keep the best scoring Comparison of models trained with NCV, NECV and NBFCV. On the y-axis we have the difference in accuracy between the Naive Bayes model and a given validation approach. As the data distribution is known, the Naive Bayes is a theoretical upper limit of the achievable performance. Curves are shown with 95% confidence intervals. A lower curve implies a better score. We first notice that the NCV curve is lower or equal to any of the NECV/NBFCV curves. The best performing model is NECV-i.e., the average of all selected models from the inner CV. In particular NECV has a higher Accuracy than NCV by at least 2%. NBFCV under-performs by a slight margin compared to NECV and seems to be equivalent to NECV-3 in terms of performance.
We conclude that retraining the model without an outer validation score leads to lower overall performance and early stopping is a very useful regularization technique for small sample size problems.

Prediction of Response to Neoadjuvant Chemotherapy
We next applied the different architectures detailed in Section 2.4 (also summarized in Table 2) to the problem of the prediction of response to neoadjuvant chemotherapy in TNBC. We tested 3 different image resolutions (0, 1, 2), 0 being the highest resolution. In order to get realistic estimations of the performance, while using early stopping, we perform the validation proposed in Section 2.1. In Figures 8A,B we show the average AUC ROC performance on the residual and prognostic prediction tasks, for all methods shown in Table 2 and for all resolution levels.
For the task of predicting the residual cancer, the best performing model would be the CONAN-c model at resolution 2 with an AUC of 0.654 ± 0.049. Others model's performance range in between 0.55 and 0.60 of AUC with higher standard deviations. Models at resolution 0 seem to generally achieve higher scores then those at lower resolutions. Model architecture c seem to be better suited for this task than the others. The Average concatenated to WELDON-C pooling seems to perform slightly better then the rest. The method CHOWDER which gave excellent results on CAMELYON for cancer detection (Courtiol et al., 2018) and which has been a state-of-the-art solution in the field underperforms on our dataset for response prediction.
For the task of predicting the patient prognostic, the best performing model would be the Avg-b model at resolution 0 with an AUC of 0.601 ± 0.019. CONAN-c at resolution 2 performs similarly but with a much higher standard deviation. Neither resolution, nor model architecture and pooling layer seem to unanimously be better then the others. However, CHOWDER under-performs compared to the other proposed methods.

DISCUSSION
In this study, we set out to predict the response to neoadjuvant chemotherapy in TNBC from biopsies taken before treatment. A system that would allow to predict this response with high accuracy could help identifying patients with no or little benefit of the treatment and therefore spare them the heavy burden of the therapy.
From a methodological point of view, this is particularly challenging for three reasons: first, we do not know to which extent the relevant information is actually present in the image data. In addition, even if the relevant information is contained in the slide, the complexity of the related patterns is unclear. Second, biopsies only capture a part of the relevant information, as they are only a localized sample of the tumor. Third, as this is a project regarding a specific subtype, the cohort is relatively small, unlike many pan-cancer cohorts used in large Computational Pathology projects (Campanella et al., 2019).
In order to solve this problem, we have developed the model CONAN, that combines the power of selecting K tiles (top and bottom), but keeps both the ranking scores and the full tile descriptions to build the slide representation. We have compared this model with a number of different architectures, and achieved an AUC of 0.65.
We also tackled an important problem of model selection with cross-validation, a crucial step in particular for small datasets. We found that the retraining step in classical Nested cross-validation can lead to lower performances for small N, because the training is highly variable, and a network retrained with the optimal set of hyperparameters is not guaranteed at all to be optimal itself. We therefore have proposed a new cross-validation procedure relying on ensembling rather than retraining, and thus allowing to use early stopping as a regularization method.
Nevertheless, we must conclude that the prediction of treatment response is probably one of the hardest problems in Computational Pathology, and that even though we see that there is some degree of predictability, the results still seem far from clinical applicability. Also, another problem that is not addressed in this study is the applicability of trained networks across different centers. Clearly, we need more data to tackle these challenging questions. But it is also likely that even with much more data, AUCs will not reach very high levels by looking at biopsies alone. A promising avenue would therefore be to use other kinds of data in addition to histopathology data.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because they are the property of Institut Curie. Requests to access the datasets should be directed to thomas.walter@mines-paristech.fr.

AUTHOR CONTRIBUTIONS
PN was main contributor, developed the methods and code and generated the results. TL helped with the code. FR and TW designed the project. GB, ML, AVC, A-SH, and FR generated the cohort. GB and ML provided interpretation of the biological results. TW supervised the project. PN and TW wrote the paper.

FUNDING
PN was funded by the Ligue Contre le Cancer. TL was supported by a Q-Life PhD fellowship (Q-life ANR-17-CONV-0005). Furthermore, this work was supported by the French government under man-agement of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).