Deep Learning Classification of Lake Zooplankton

Plankton are effective indicators of environmental change and ecosystem health in freshwater habitats, but collection of plankton data using manual microscopic methods is extremely labor-intensive and expensive. Automated plankton imaging offers a promising way forward to monitor plankton communities with high frequency and accuracy in real-time. Yet, manual annotation of millions of images proposes a serious challenge to taxonomists. Deep learning classifiers have been successfully applied in various fields and provided encouraging results when used to categorize marine plankton images. Here, we present a set of deep learning models developed for the identification of lake plankton, and study several strategies to obtain optimal performances, which lead to operational prescriptions for users. To this aim, we annotated into 35 classes over 17900 images of zooplankton and large phytoplankton colonies, detected in Lake Greifensee (Switzerland) with the Dual Scripps Plankton Camera. Our best models were based on transfer learning and ensembling, which classified plankton images with 98% accuracy and 93% F1 score. When tested on freely available plankton datasets produced by other automated imaging tools (ZooScan, Imaging FlowCytobot, and ISIIS), our models performed better than previously used models. Our annotated data, code and classification models are freely available online.


I. INTRODUCTION
Plankton are a key component of the Earth's biosphere.They include all the aquatic organisms that drift along with the currents, from tiny bacteria and microalgae, to larvae of vertebrates and invertebrates.Photosynthetic phytoplankton are responsible for about half of the global primary production [1] and therefore play a central role in atmospheric carbon fixation and oxygen production.Zooplankton are a broad group of aquatic microorganisms, spanning over tens of thousands of species [2], and comprising both carnivores and herbivores, the latter feeding on phytoplankton.Plankton are a critical component of aquatic food-webs, producing organic matter that forms the ultimate source of mass and energy for higher trophic levels [3], and serve as food for fish larvae [4].The death and excretion of planktonic organisms results in massive amounts of carbon being sequestered, regulating the biological carbon pump locally and globally [5].Plankton biodiversity and dynamics therefore directly influence climate, fisheries and the sustenance of human populations near water bodies.
Planktonic organisms, being mostly small in size, have short lifespans and a strong sensitivity to environmental conditions, which makes their diversity and abundances very effective indicators of environmental change and ecosystem health, particularly in freshwater ecosystems that suffer from combined exposure to human local impacts and global change, such as warming and invasive species [6].Information on individual plankton species is also critically important for the monitoring of harmful algal blooms, which can cause huge ecological and economical damage and have severe public health consequences [7].The diversity and abundance of plankton is generally measured using labour intensive sampling and microscopy, which suffer from a number of limitations, such as high costs, specialised personnel, low throughput, high sample processing time, subjectivity of classification and low traceability and reproducibility of data.These limitations have stimulated the development of a multitude of alternative and automated plankton monitoring tools [8], some of which were recently applied in freshwater systems [9][10][11].
If, on one side, studying freshwater environments offers the opportunity to approach several issues related to i) automated recognition of plankton taxa in systems that are heavily monitored for water quality, and ii) the creation of plankton population time series useful for both research and lake management, on the other side it presents a series of practical advantages.The number of species present in a lake is in the order of few hundreds and community composition changes at the scale of decades [12], and virtually all lakes of the same region tend to share the same geographic/climatic region and the same species pool of plankton taxa [13].This would allow us to process real lake data with a diminished need to account for species variability, build rather quickly a database that comprises all seen taxa, and easily use our models for more than one site.Moreover, lakes are usually characterised by lower levels of non-planktonic suspended solids (e.g.sand, debris) compared to coastal marine environments, so one can expect to work with cleaner images, with a relatively small number of non-biological or non-recognizable objects being detected.
These digital imaging systems can produce very large volumes of plankton images, especially if deployed in-situ for automated continuous monitoring [10,15].While the extraction of image features that describe important plankton traits like size and shape are well established [10,15], classifying large volumes of objects into different plankton taxonomic categories is still an ongoing challenge, and represents the most important component for plankton monitoring [17].Automated classification of imaged plankton objects may substitute the time-consuming job of taxonomists [17] and allow sampling and counting taxa at high temporal and spatial resolution.Automation of plankton monitoring could represent a key innovation in the assessment and management of water quality, aquatic biodiversity, invasive species affecting ecosystem services (e.g.parasites, invasive mussels), and early warning for harmful algal blooms.
Automated plankton classification is characterized by a set of features that make this task less straightforward than other similar problems.The data sets used for training, as well as the images analyzed after deployment, cover wide taxonomic ranges that are very unevenly distributed (some taxa are very common and others are rarely seen -this is called data imbalance or class imbalance) [18], and this distribution changes over time, with new taxa appearing from time to time [19].Moreover, many images do not belong to any taxon (e.g.dirt), or they cannot be identified due to the low resolution, their position, focus, or being cropped.Furthermore, labeling these data sets requires a high effort, because they need to be annotated by expert taxonomists, and sampling images from videos, as it is done e.g. for camera traps [20], is not helpful because the alignment of the organisms with respect to the camera does not generally change throughout the exposure time.
Image classification models fall into several broad categories, including unsupervised models (which clusters and classifies images without any manually-assigned tags), supervised models (which use a training library of manually identified images to develop the classification model), and hybrid models (which combine aspects of supervised and unsupervised learning).Even though there is current research that relies on unsupervised learning [21,22] or on the development of specific kinds of data preprocessing [23,24], the current state of the art for classifying plankton data sets most often involves deep convolutional neural networks trained on manually classified images [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41], 1 which allow for a great flexibility across applications and were demonstrated more satisfactory than relying on the manual extraction of features [43].These applications very often resort to transfer learning [44], which consists of using models which were pretrained on a large image dataset (usually, ImageNet [45]), and adapting them to the specific image recognition problem.Transfer learning was used in a two-step process to deal with data imbalance [28], but most commonly it is used because it allows for the training of very large models in reasonable times.The main differences in the various applications to plankton often dwell in the kind of image preprocessing.For example, Ref. [27] filters the images in different ways, and feeds both the original and the filtered images as input to the models, Ref. [31] applies logarithmic image enhancement on black and white images, and Ref. [36] tests different ways of resizing the pictures.
Furthermore, several models can be used in synergy in order to obtain better performances (be it to deal with data imbalance or to reach a higher weighted accuracy).Two main approaches to combining multiple models are collaborative models and ensembling.The former consists of training models together to produce a common output [27,37], while the latter trains the models separately and combines the outputs in a later stage.Collaborative models were used recently to counter data imbalance, yielding high performances on single-channel (i.e.black and white) images obtained in Station L4 in the Western English Channel [37].However, this involves deploying simultaneously several very heavy models, resulting in a very high memory usage, unless one uses smaller versions of the typically used models (thus, not allowing for transfer learning).Ensembling allows to fuse virtually any number of learners, and resulted in very satisfactory performances when joining different architectures (where DenseNets most often do best) or kinds of preprocessing [36].The mentioned methods for automated plankton classification were principally deployed in salt-water coastal habitats.To our knowledge, the only previous work performing image classification on freshwater images is Ref. [46], where the data does not come from an automated system, and they study a small balanced dataset sorted in four categories (daphnia, calanoid, female cyclopoid, male cyclopoid), and obtain a maximum classification accuracy of 93%.
In this paper, we study the classification of plankton organisms from lake ecosystems, on a novel dataset of lake plankton images that we make freely accessible, together with a code that allows to easily train and deploy our deep neural networks.We analyze plankton images from the Dual-magnification Scripps Plankton Camera (DSPC), which is a dark field imaging microscope, currently deployed in lake Greifensee (Switzerland) [10], and specifically the images from the 0.5x magnification, which targets zooplankton and large colony-forming phytoplankton taxa in the ranges of 100 µm to 1 cm.We manually annotated 17943 images consisting of n c = 35 unevenly distributed categories, which were collected in-situ using the DSPC deployed at 3 m depth in lake Greifensee.We propose a set of latest deep learning models that makes use of transfer learning, and we combine them through versions of collaborative and ensemble learning.In particular we explore several ways to ensemble our models based on recent findings in statistics [47,48].We evaluate the performances of our models on publicly available datasets, obtaining a slight but systematic increase in performance with respect to the previous literature.The simplest of the presented models were used to analyze part of the data in Ref. [10].

A. Data Acquisition
We used images coming from the DSPC [10], deployed in lake Greifensee, and acquired from wild plankton taxa across the years 2018 to 2020. 2 The DSPC takes images of the microscopic plankton taxa at user-defined frequencies and time intervals (for more details and camera settings see Ref. [10]).The original full frame images may contain from zero to several images of planktonic organisms, as well as non-organic matter.The full frames are segmented on site in real time, and regions of interest (ROIs), which contain e.g.plankton organisms, are saved and used for image feature extraction and classification.Images of objects at the boundary of the vision range of the camera result cropped, but we keep them anyway, as most of the time we are still able to identify them.The images have a black background, which favours the detection of ROIs.These have different sizes depending on the size of the detected object.For each ROI, we extracted 64 morphological and color features , and performed a series of graphical operations to make the image clearer. 3In Fig. 1(a) we show some examples of what the final images look like.In App.A, we provide an extensive description of the dataset and all its classes, together with one sample image from each class in Fig. 6.In App.B we describe the afore-mentioned 64 morphological features.

B. Data Preparation
The DSPC can be run with two different magnifications [10], but in this paper we report only on the images taken at the lower magnification, which contain mostly zooplankton taxa and several large colonial phytoplankton.We manually annotated a dataset of 17943 images of single objects, into n c = 35 classes. 4In Fig. 1(b) we show the names of all the n c classes, along with the number of labeled images of each class.Note that there are 300 times more annotated images of the most common class (dinobryon) than the rarest class (chaoborus).

C. Open-access availability of our dataset
We call ZooLake the described dataset of labeled plankton images.We give extensive details on ZooLake in Apps.A and B, and made the data openly available online at the following link: https://data.eawag.ch/dataset/eep-learningclassification-of-zooplankton-from-lakes .

D. Further data preparation
Since for most deep learning models it is not convenient to have images of different sizes, we resized our images in such a way that they all had the same size.The two simplest ways of doing this are either by (i) resizing all the images to 128x128 pixels irrespective of its initial dimensions thus not maintaining the original proportions, or (ii) shrinking them in such a way that the largest dimension is at most 128 pixels (no shrinking is done if the image is already smaller) and padding them with a black background in order to make them 128x128.The former method has the disadvantage of not maintaining proportions.The latter has the problem that in images with a very large aspect ratio there is a loss of information along the smallest dimension. 5The two methods are compared in Ref. [36], where it is seen that procedure (i) gives slightly better performances in most datasets.Further, the information lost when reshaping of the objects' aspect ratios can be recovered by using the initial aspect ratio (and similar quantities) as an extra input feature.For these reasons, the results we show in the main text are all obtained through method (i).
In order to artificially increase the number of training images, we used data augmentation, consisting of applying random deformations to the training images.The transformations we applied include rotations up to 180º, flipping, zooming up to 20%, and shearing up to 10%.As for the morphological and color features, we calculated 44 additional ones (see App. C), and standardized the resulting 111 features to have zero mean and unit standard deviation.

E. Training, validation and test
We split our images into training, validation and test sets, with a ratio of 70:15:15.The exact same splittings were used for all the models.The validation set was used to select the best model (hyper)parameters, while the test set was set aside throughout the whole process, and used only at the very end to assess and compare the performance of all the proposed models.

F. Deep learning models
A common challenge when choosing deep learning architectures is how to best jointly scale architecture depth, width and image resolution.A recent solution was given in Ref. [49], that proposes a scaling form for these three variables simultaneously, together with a baseline model, called EfficientNet-B0, for which this scaling is particularly efficient.This results in better performances than previous state of the art models, with a smaller investment in terms of model parameters and number of operations.The provided scaling form allows us to obtain efficiently scaled models according to how many computational resources we are willing to invest.These models, ordered by increasing size, are called Efficient-B1, Efficient-B2, Efficient-B3, Efficient-B4, Efficient-B5, Efficient-B6, and Efficient-B7.Given 4 Throughout this text, we use the machine-learning connotation of the work "class", which indicates a category for classification, and not a taxonomic rank.In other words, our classes are not necessarily related to the taxonomic classification of the categories.For example, we call "class" categories like "diatom chain", "unknown" or also "dirt". 5Imagine that an image is originally 1280 × 50 square pixels.Re-scaling the largest dimension to 128 pixels, maintaining the proportions, implies that there resulting image is only 5 pixels high, which means that we almost completely lose the information contained in the image.Further, with method (ii), the large images are re-scaled, while the small ones are not, so even in this case the image size suffers a non-linear transformation.
the aforementioned large efforts to apply deep learning models to plankton classification, we believe that it is worth to assess the performances of these architectures on plankton recognition.Aside from those, we also test other deep neural network architectures, some of which were already used successfully for our kind of problems.
In the main text of this manuscript, we report on 12 different models.These are the EfficientNets B0 through B7 [49], InceptionV3 [50], Dense121 [51], MobileNet [52] and ResNet50 [53]), trained with transfer learning (Sec.II G).Each individual model was trained four times, with different initial conditions from the same parameter distribution. 6 Additionally, we trained multi-layer perceptrons (MLPs) using as input the 110 morphological and color features mentioned in Sec.II D, and trained Mixed (collaborative) models that combine the MLPs with a larger model trained on images (Sec.III B).In Fig. 2 we sketch the structure of these Mixed models.Finally, we also trained 4-layer convolutional networks, to assess whether through specific kinds of ensembling we could reach performances that match larger models (App.E).
FIG. 2. Diagram of the three main kinds of models that we mention in our paper.Image models are convolutional networks that receive only images as input, feature models are MLPs that receive as input only features extracted from the image, but not the image per se, and Mixed models join and fine-tune Image and Feature models.

G. Transfer learning
Since training the mentioned models is a very demanding computational task, we used transfer learning, which consists of taking models that were already trained for image recognition on ImageNet, a very large dataset of nonplanktonic images [54]. 7We loaded the pretrained model and froze all the layers.We then removed the final layer, and replaced it with a dense layer with n c outputs, preceded and followed by dropout.The new layers (dropout, dense, dropout, softmax with categorical cross-entropy loss) and learning rate were optimized with the help of the keras-tuner [55].We ran the keras-tuner with Bayesian optimization search, 8 10 trials and 100 epochs, to find the best set of hyperparameters from the Bayesian search.Then, we trained for 200 epochs and used early stopping, i.e. 6 All the initial conditions of all models were different realizations from the same distribution.We used a Glorot (or Xavier) uniform initializer, which is a uniform distribution within [−a, a], where a = 6/(n i + no), and n i and no are respectively the number of input and output units in the weight tensor.All the models were trained with the Adam optimizer, a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.We used, respectively, 0.9 and 0.999 as decay rate of the first and second moment estimates. 7Transfer learning from models trained on plankton images was tried in Ref. [30], but it did not yield better results than using the models trained on ImageNet. 8The Bayesian optimization is a trial-and-error based scheme to find the optimal set of hyperparameters [56].
interrupting the training if the validation loss did not improve for 50 epochs, and keeping the model parameters with the lowest validation loss.We then fine-tuned the model by unfreezing all the parameters and retraining again with a very low learning rate, η = 10 −7 , for 400 epochs.

H. Ensemble learning
Ensemble methods use multiple independent learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone, often yielding higher overall classification metrics and model robustness [57,58].For our study we made use of two ensembling methods: averaging and stacking.

Averaging
For every image, the output of a single model is an n c -dimensional confidence vector representing the probability that the model assigns to each class.The model's prediction is the class with the highest confidence.When doing average ensembling over n models, we take the average over the n confidence vectors, and only afterwards choose the class with the highest confidence.With this procedure, all the models contribute equally to the final prediction, irrespective of their performance.We performed average ensembling on the following choices of the models: 1. Across different models, as for example it was successfully done for plankton recognition in Refs.[36,38].
2. Across different instances of the same model, trained independently 4 times.This is inspired by the recent observation that this kind of averaging can lead to a better generalization in models with sufficiently many (but not too many) parameters [48].We provide a deeper discussion in App.E.
3. Manual selection of the six best individual models (on the validation set) over all the models.These best models resulted to be DenseNet121, EfficientNetB2, EfficientNetB5, EfficientNetB6, EfficientNetB7 and MobileNet.For each, we chose the initialization that gave the best validation performance.We call this the Best 6 avg ensemble model.

Stacking
Stacking is similar to averaging, but each model has a different weight.The weights are decided by creating a meta-dataset consisting of the confidence vectors of each model, and training a multinomial logistic regression on this metadataset.We performed stacking both across initial conditions and across different architectures.We call Best 6 stack the ensemble model obtained by stacking the six individual best models (these are the same models that we used for the Best 6 avg model).

A. Performances
In Tab.I we summarize the performance of the individual models, along with the various forms of ensembling described in Sec.II H.We report test accuracy and F1-score since, depending on the specific application, one can be interested in one metric or the other.The accuracy is the fraction of correctly guessed images, so it is dominated by the most numerous classes.The F1-score is the geometric average between precision and recall. 9We average the F1-score among the classes in such a way that each class receives the same weight regardless of how many images there were of that class.We categorize the models in three ways, according to the kind of data they take as input.Feature models take numerical features extracted from the images, image models take the processed image, and mixed models take both features and image.

Individual model performance
First, we focus on the performances of the single models.Already the MLP, our simplest model, which does not take the images as input, had a best accuracy of 91.2%.However, the F1-score below 80% reveals that the accuracy is driven by the predominant classes.All the image models performed better than the MLP both in terms of accuracy and F1-Score.The model with the best F1-score is the EfficientNet-B7 (F1 = 90.0%),followed by the EfficientNet-B2, which obtained almost the same value, but with a much smaller number of parameters (8.4 × 10 6 parameters instead of 6.6 × 10 7 parameters for EfficientNetB7). 11The lightest of the models we present is the MobileNet, with around 3.5 × 10 6 parameters, with a maximum F1-score of 89.1%.
We tried to further improve the performance of EfficientNets by adopting basic methods for dealing with class imbalance.We reweighted the categories according to the number of examples of each class, in order to give an equal weight to all of them despite the class imbalance.We did not notice sizable improvements, so we restricted to only FIG. 3. Performance of single and ensemble models (same data as Tab.I).The solid circles are the single model performances.The red triangles represent the performance of average ensemble models, whereas the blue diamonds represent the performance of stacking ensemble models.Top-left: The first four columns show the test accuracy of each initial condition across different models (this corresponds to going along a column in Tab.I).The fifth column shows the performances of all 48 image models, and the sixth is restricted to the models constituting Best 6 avg and Best 6 stack .In all cases we show the result of ensembling over these models.Top-right: Same, but for the F1-score.Bottom-left: For each model, we show the test accuracy of the four chosen initial conditions, and of ensembling through them (this corresponds to going along a row in Tab.I).Bottom-right: Same, for the F1-score.The readers can refer to the table I for the values of each single points on this figure two models.We report on this in App.D.

Ensembling across initial conditions
As we discuss in App.E, ensembling across initial conditions can help reduce the generalization gap (i.e. the difference between train and test performance).This was shown for average ensembling [47,48], but we also tested it for stacking.We see that (rightmost columns of Tab.I), both for stacking and averaging, this kind of ensembling improves the overall result compared to each individual model's performance.We also show this in Fig. 3-bottom, where in each column we show the performances of all the repetitions of a single model, as well as the result of ensembling through initial conditions.Average ensembling over (only four) initial conditions is very successful for some specific models such as Eff2 and Dense121.

Ensembling across models
We also ensembled across available models.For consistency, we first used only one initial condition per architecture (randomly picked, without repetitions).The results shown in Tab.I and Fig. 3-top (first four columns of each plot) display a clear improvement when performing this kind of ensembling, which in most cases seems more effective than over initial conditions.

Overall Ensembling
Finally, we ensembled over all models and initial conditions, obtaining a further small improvement.We obtained a slightly better improvement when ensembling on the six best models of the validation set (Best 6 avg and Best 6 stack ), which had the further advantage of requiring less resources than using all 48 models.Our final best image model, Best 6 stack , has an accuracy of 97.9%, and an F1-score of 92.7%.
Towards practical purposes, the performances of Best 6 avg and Best 6 stack are even better than they appear if we take into account the nature of our dataset: the dataset is imbalanced, and for the most numerous two thirds of the classes we have almost perfect classification, as shown in Fig. 4, where we show the per-class performances.For the remaining third, the minority classes, the performance is good, though less reliable due to the very low number of test images at hand.If we keep into account the number of available images, the only three classes with a lower performance are the container (or junk) classes: unknown, dirt, unknown plankton.12This is not surprising, since these classes contain a wide variety of different objects, and it is less of a problem from the point of view of plankton monitoring, since misclassifications involving these classes are less relevant (we show the confusion matrices in Fig. 9).If we exclude the three mentioned junk classes, we reach F1-score=97.3%.If we only consider the 23 classes for which the ZooLake dataset contains at least 200 examples (and keep the junk classes with ≥ 200 examples), the F1 scores go up to 98.0%.Finally, if we both exclude the classes with less than 200 examples and the junk classes, we obtain F1-score=98.9%.
Moreover, even when making mistakes, our models are not completely off.We can see this in Tab.II, where we plot the top-2 metrics of the Best 6 avg and Best 6 stack models.These represent how good the models' guesses are if the second choice of the classifier is considered as a success.We see that the macro-averaged recall increases by 3%, and the total number of misclassified images is halved, with the top-2 accuracy exceeding 99%.

B. Mixed models
Since our image preprocessing did not conserve information on the image sizes, we trained mixed models that took as input a combination of image and 111 numerical features calculated from the image.The numerical features were fed into the MLP described in App.C, while the images were given as input to one of the image models described in Tab.I.The two models were then combined and fed into a dense layer, followed by a softmax with categorical cross-entropy loss.
With both features and images (and no image augmentation) as input we trained with a low learning rate η = 10 −5 for 400 epochs.For each choice of the initial conditions, each single image model was combined with its corresponding feature (MLP) model.In total, we trained 12 mixed models for 4 initial initial conditions each, so 48 mixed models in total.
Then, we ensembled through models and initial conditions in the same way as with the image models described in section II H.The test performance of the mixed models is shown in Table III.The single-model performances are slightly better than those obtained through image-only models (Tab.I).However, after ensembling, the performance of mixed models becomes quite similar to that of image models.The best F1 score of the mixed models improves that of the image models by 0.3%, reaching 93.0%.I.The bottom four lines depict the performances when using the four kinds of ensembling described in the main text.The italics represent the six models that we chose for Best 6 avg and Best 6 stack based on the validation F1-score (therefore their performance on the test set is not always the best).In bold, we represent the overall best for each sector.

C. Comparisons with literature on public datasets of marine plankton images
To compare our approach with previous literature, we evaluated our models on the publicly available datasets indicated in Ref. [24], which reports classification benchmarks on subsets of the ZooScan [59], Kaggle [60] and WHOI [18] plankton datasets.The ZooScan subset [24,59] consists of 3771 greyscale images acquired using the Zooscan technology from the Bay of Villefranche-sur-mer.It consists of 20 classes with variable number of samples for each class.The Kaggle subset [24,60] comprises 14,374 greyscale images from 38 classes, acquired by In-situ Ichthyoplankton Imaging System (ISIIS) technology in the Straits of Florida and used for the National Data Science Bowl 2015 competition.The distribution among classes is not uniform, but each class has at least 100 samples.The WHOI subset [18,24] contains 6600 greyscale images of different sizes, that have been acquired by FlowCytobot [14], from Woods Hole Harbor water samples.The subset contains 22 manually categorized plankton classes with equal number of samples for each class.
We compared the performance of our image models with the best models of Refs.[24,36,38].For WHOI, we used the exact same train and test sets, since the dataset splitting was available.For ZooScan and Kaggle we used respectively 2 fold cross-validation and 5 fold cross-validation as in Ref. [38].We used our Best 6 avg and Best 6 stack models, and did transfer learning starting from the weight configurations trained on our ZooLake dataset. 13We finetuned each of the 6 selected models belonging to Best 6 avg and Best 6 stack with a learning rate η = 10 −5 , and followed with average and stack ensembling. 14s we show in Fig. 5, our Best 6 avg and Best 6 stack models performed always slightly better than all the previous methods/studies.The improvement in terms of F1-score is consistent throughout the three datasets, with a 1.3% improvement on the previously best model for ZooScan, a 1.0% on Kaggle, and a 0.3% on WHOI.The same data of Fig. 5 is available in Tab.V. Note that these improvements come with a further advantage.Our results require ensembling over a smaller number FIG. 5. Performances Accuracy/F1-score of our Best 6 avg and Best 6 stack models (blue points) on the publicly available datasets (ZooScan, Kaggle, WHOI), and comparison with previous results from literature.The yellow points indicate ensemble models from Refs.[38]: SFFS (Sequential Forward Floating Selection -a feature selection method used to select models), WS (Weighed Selection -a stacking method that maximizes the performance while minimizing the number of classifiers).The red points are the Fus models from Ref. [36], which fuse diverse architectures and preprocessing.The green points stand for non-linear multi kernel learning (NLMKL), where an optimal non-linear combination of multiple kernels (Gaussian, Polynomial and Linear) is learnt to combine multiple extracted plankton features.
of models, and of total parameters.The 6-model average ensemble consisted of around 1.58×10 8 parameters compared to the 6.25 × 10 8 (4.0 times more) of the best model in Ref. [38] and the 1.36 × 10 9 (8.6 times more) of the best model in Ref. [36].A major advantage of having lighter-weight models is that it allows for a simpler deployment and sharing with field scientists.

IV. DISCUSSION
In this paper, we presented the first dataset, to our knowledge, of lake plankton camera images, and showed that through an appropriate procedure of preprocessing and training of deep neural networks we can develop machine learning models that classify them with high reliability, reaching 97.9% accuracy and 93.0%(macro-averaged) F1score.These metrics improve to 98.7% accuracy and 96.5% F1-score if we exclude the few container classes (dirt, unknown, unknown plankton), that do not identify any specific taxon, with the F1 score reaching 98.9% if we further restrict to the two thirds of the categories with a sufficient number of examples.We made both the dataset and our code freely available. 15e trained several deep learning models.Our main novelties with respect to previous applications to plankton are the usage of EfficientNet models, a wise and simple ensemble model selection in the validation step, and the exploration of ensembling methods inspired by recent work in theory of machine learning [48].We checked the utility of using mixed moedls that include as input numerical features such as the size of the detected object (in addition to the image), and found that this increases the single-model performance, but the gain is flattened out once we ensemble across several models (though the best F1 score still improved from 92.7% to 93.0%).We also checked whether the performance of the EfficientNets improved by correcting through class imbalance through class reweighting, and found no relevant improvement.We compared the performances of our models with previous literature on salt-water datasets, obtaining an improvement that was steady across all datasets.
The best performing individual models were EfficientNets, MobileNets and DenseNets.Surprisingly, the performance of the EfficientNets did not scale monotonously with the number of model parameters, perhaps due to the class imbalance of our dataset.The EfficientNets B2 and B7 were the best performing, but B2 uses a smaller number of parameters.If we had to select a single architecture, our choice would lean towards MobileNet or EfficientNet-B2, given their favorable tradeoff between performance and model size.If we apply ensembling, averaging and stacking provide similar performances, so we prefer averaging due to its higher simplicity.As for Mixed models, their narrow increase in performance after ensembling does not seem to justify their additional complexity in terms of deployment.
The Scripps Plankton Camera systems are a new technology that allows users to obtain large volumes of highresolution color images, with virtually any temporal frequency.We noticed that the images that we obtained were clearer than those coming from marine environments (c.f.Ref. [15]), which favoured the process of annotation and classification.Additionally, the taxonomic range is more stable during the seasonal progression compared to marine studies: fewer taxa are present in lake than coastal marine environments, colonisation by new taxa are relatively rare at the inter-annual scale (new taxa do not appear often), and lakes of the same region share large part of the plankton community composition.This makes the study of lake plankton dynamics an interesting and more controlled case study for method development due to its relative ecological simplicity and temporal stability, and implies that classifiers for lake taxa are more robust in these environments over space and time.This is particularly important from an application point of view, since the tools we developed in this paper are not only applicable for analyzing plankton population time series in lake Greifensee, addressing problems such as inferring interactions between taxa and predicting algal blooms, but they may be transferable to other similar lakes.Lakes represent very important water resources for human society and require routine monitoring for water quality and provision of ecosystem services.The models developed in this study can be directly used in real-world monitoring (e.g. a preliminary version of our models was already used in Ref. [10]).dirt: In this category we included pictures of inorganic and organic material that were clearly not plankton.eudiaptomus: A zooplankton genus from the order calanoid.The first antennas are very long, more than half of the body.The eggsacks are very distinguishable from the ones from cyclops.
filament: This folder includes different phytoplankton genus of a cylindrical shape and elongated.Appearing as a single filament.
fish: This category includes all pictures were young fishes were partially photographed.
fragilaria: This folder contains the colony forming diatoms Fragilaria crotonensis and Fragilaria capucina.they both have cells in large ribbon-like colonies.With the low magnification camera it was not possible to differentiate between this 2 species.
hydra: A macroscopic organism of the phylum cnidaria with a single body axis and tentacles.The body is a hollow tube with a gelatinous layer with a texture easy to identify with the camera.
kellicottia: The genus Kellicotia is a rotifer easy to identify with its elongated body and posterior and anterior spines.
keratella cochlearis: This species of rotifer has an oval shape body with a long posterior spine and short anterior spines.It is sometimes difficult to distinguish from the brachionus rotifer, depending on the angle of the picture.
keratella quadrata: In this category the taxonomic resolution of this rotifer is to species level because one can distinguish the 2 caudal spines at the base of the body.
leptodora: A top predator genus of zooplankton that can almost reach the size of 2cm.Most of the pictures of this category were of parts of its antennas or body, but still recognizable.
maybe cyano: This algal category comprises different genuses of Cyanobacteria.All forming gelatinous colonies of different shapes with small cells inside.Due to the light and the darkfield background they look slightly blueish on the DSPC.This class possibly contains also non-organic material, like sand or debris, because they look very similar to some cyanobacteria colonies (especially clathrate microcystis colonies) and are hard to distinguish from one another with this magnification.The gathering of "maybe cyano" pictures started at blooms of cyanobacteria that were confirmed with the higher magnification camera, where cyanobacteria are more easily recognizable.Then, looking at pictures from the lower magnification camera (the one made for zooplankton and discussed in this paper), the taxonomists could be more sure about tagging the images they thought were cyanobacteria colonies.This is why the folder is called maybe cyano.With the 0p5x magnification it is very difficult to be sure; but with the expert knowledge, and seeing at the other higher magnification pictures taken at the same time, we could confirm that there were many cyanobacteria colonies at that time, which made us learn that cyanobacteria colonies pictures look very similar to the ones on this class, and enabled the tagging.
nauplius: In this zooplankton category we classified the larval stage of all copepods.Nauplii were distinguishable because of their antennas and mandible and an absence of thorax.

paradileptus:
The genus paradileptus is a ciliate, has not been seen using traditional fixation methods with Lugol solution, that may had influence on its preservation [61].It has a conical body with a long tapering, spiraling neck region.

polyarthra:
The most important feature of this rotifer is the well developed paddles (or blade-like projections) originating on the body below the head, and the absence of a foot, like in Brachionus rotifers: In this category we classified smaller rotifers, pictures of other rotifers that did not belong to any of the categories already containing rotifers: brachionus, conochilus, kellicotia, keratella, synchaeta or trichocerca, or pictures that were either not sharp or not from the right angle to be able to see the features to classify it into another category, since in some cases, in order to identify the rotifer to a genus level, one needs to see a trait that is only seen from a specific plane.
synchaeta: Synchaetas are rotifers with a conical shape body that similarly to trichocerca have foot and toes, but this time those are reduced.

trichocerca:
The genus Trichocerca are rotifers with a lorica and a characteristic long 1 or 2 toes emerging from the foot.Anterior spines can be present but are not seen with the current magnification.
unknown: Objects that were for us difficult to decide if they were dirt, or part of zooplankton or algae were ordered into an unknown category.
unknown plankton: In this category we include all objects that we thought could be plankton but that because of the sharpness, focus or angle they were photographed we could not categorize them to a folder with label.
uroglena: A Chrysophyceae algal genus with spherical colonies.They very often occur in blooms, so in very high densities.Recent work in Ref. [48] on the double descent peak [63,64] showed that a major source of the generalization error is initialization variance.This variance can be attenuated by ensembling across different initializations of the same model.This was shown for simple balanced binary datasets in Refs.[47,48], and was especially useful near the interpolation threshold. 17n our case, we did not know where the interpolation threshold was, but we could assume that the models shown in Tab. 1 of the main text are highly over parameterized, and even then models such as Eff4 enjoyed over a 2% improvement thanks to ensembling over only 4 initial conditions.
We can then expect that smaller convolutional networks (thus, closer to the interpolation threshold), can benefit more from ensembling, and we might be able to reach accuracies similar to those of EfficientNets with much simpler models.Therefore, we trained convolutional networks that are close to the interpolation threshold, and investigated the improvement obtained through ensembling.In Fig. 8 we show results from a four-layer convolutional network (conv4) obtained through a Bayesian optimization search, 25 trials, 30 epochs to select the best convolutional network between 2 to 5 convolutional layers, filter size between 32 and 128, kernel size between 8 to 32, dense units between 64 to 256, learning rate between 10 −2 and 10 −5 .We used no early stopping here, since ensembling should replace its efficacy [47].The average final training accuracy that we obtained was 99.24%.We ran the model with K = 17 different initial conditions.Fig. 8 shows the test metrics while ensembling over an increasing number of initial conditions.Blue circles represent the performance of individual conv4 models, whereas the orange dots represent the cumulative performance of average ensembles of the model.The benefit of ensembling is clearly visible, as already with five initial conditions we reach an accuracy (F1-score) around 0.94 (0.8).Ensembling over 17 initial conditions give a 0.945 accuracy (0.82 F1-score).This is likely to increase if we increase K, but it does not seem that the performances of the EfficientNets can be reached.

Tables
In Tab.V we compare the performance of our models on public datasets, with previous literature.The numbers in the table correspond to the data shown in Fig. 5 [36,38].The last line stands for non-linear multi kernel learning (NLMKL), where an optimal non-linear combination of multiple kernels (Gaussian, Polynomial and Linear) is learnt to combine multiple extracted plankton features.
In Tab.VI we show the amount of wallclock time required to train our models on an NVIDIA GeForce RTX 2080 Ti GPU.

Model
Training

FIG. 1 .
FIG. 1. (a): Sample images from the DSPC in lake Greifensee (b): Abundance of each class in our dataset.The word class is intended in the classification sense, and does not indicate the taxonomic rank.Note that the y axis has a logarithmic scale.

FIG. 4 .
FIG. 4. Per-Class precision, recall and F1-score of Best 6 avg (left) and Best 6 stack (right) model on test set sorted based on Fig. 1(b).

FIG. 6 .
FIG. 6. Examples of images from all 35 plankton classes in their original proportion.

FIG. 8 .
FIG. 8. Test accuracy (left) and F1-score (right) of four-layer convolutional networks.Blue dots represent the performance of individual model whereas the orange dots represent the cumulative performance of average ensembles of the model.

TABLE II .
Top-1 and top-2 recall and accuracy.Top-n scores treat true positives and false negatives based on the n highest values of the confidence vector.In other words, the top-2 scores are the model performances in the case that either of the top two guesses is correct.

TABLE V .
of the main text.Performances Accuracy/F1-score of our Best 6 avg and Best 6 stack models on the publicly available datasets (ZooScan, Kaggle, WHOI), and comparison with previous results from literature.Different ways of creating ensembles are identified with the keywords SFFS (Sequential Forward Floating Selection -a feature selection method used to select models), WS (Weighed Selection -a stacking method that maximizes the performance while minimizing the number of classifiers) and Fus (fusion of diverse architectures and preprocessing) indicate different ways of creating ensembles

TABLE VI .
time [hours] Bayesian search [hours] Training times of our image models on an NVIDIA GeForce RTX 2080 Ti GPU.The second column represents the time to train a model across all phases of training, for a single choice of initial conditions and hyperparameters.The third column depicts the time required to choose the best hyperparameters.