Gear-Induced Concept Drift in Marine Images and Its Effect on Deep Learning Classification

In marine research, image data sets from the same area but collected at different times allow seafloor fauna communities to be monitored over time. However, ongoing technological developments have led to the use of different imaging systems and deployment strategies. Thus, instances of the same class exhibit slightly shifted visual features in images taken at slightly different locations or with different gear. These shifts are referred to as concept drift in the domains computational image analysis and machine learning as this phenomenon poses particular challenges for these fields. In this paper, we analyse four different data sets from an area in the Peru Basin and show how changes in imaging parameters affect the classification of 12 megafauna morphotypes with a 34-layer ResNet. Images were captured using the ocean floor observation system, a traditional sled-based system, or an autonomous underwater vehicle, which is used as an imaging platform capable of surveying larger regions. ResNet applied on separate individual data sets, i.e., without concept drift, showed that changing object distance was less important than the amount of training data. The results for the image data acquired with the ocean floor observation system showed higher performance values than data collected with the autonomous underwater vehicle. The results from this concept drift studies indicate that collecting image data from many dives with slightly different gear may result in training data well-suited for learning taxonomic classification tasks and that data volume can compensate for light concept drift.

In marine research, image data sets from the same area but collected at different times allow seafloor fauna communities to be monitored over time. However, ongoing technological developments have led to the use of different imaging systems and deployment strategies. Thus, instances of the same class exhibit slightly shifted visual features in images taken at slightly different locations or with different gear. These shifts are referred to as concept drift in the domains computational image analysis and machine learning as this phenomenon poses particular challenges for these fields. In this paper, we analyse four different data sets from an area in the Peru Basin and show how changes in imaging parameters affect the classification of 12 megafauna morphotypes with a 34-layer ResNet. Images were captured using the ocean floor observation system, a traditional sled-based system, or an autonomous underwater vehicle, which is used as an imaging platform capable of surveying larger regions. ResNet applied on separate individual data sets, i.e., without concept drift, showed that changing object distance was less important than the amount of training data. The results for the image data acquired with the ocean floor observation system showed higher performance values than data collected with the autonomous underwater vehicle. The results from this concept drift studies indicate that collecting image data from many dives with slightly different gear may result in training data well-suited for learning taxonomic classification tasks and that data volume can compensate for light concept drift.

INTRODUCTION
Recent developments in machine learning-based classification and object detection in computer vision has been greatly influenced by deep learning algorithms (LeCun et al., 2015). In "classic" pattern recognition, engineering skills and experiences were necessary to design a pipeline of algorithmic steps to map images to semantic categories using handcrafted feature descriptors and shallow learning architectures. In particular, the initial steps of pre-processing and feature computation required a considerable amount of experience and domain knowledge. At present, the currently available computation power and new concepts for improving the back-propagation learning algorithm allow the training of large multi-layer networks to learn the entire classification process, including signal transformation and feature representation, given the availability of sufficient training data. In parallel, marine research and environmental monitoring have advanced on the technological level, as new digital imaging hardware with increased storage capacities, higher resolution and improved image contrast, in combination with next-generation research platforms became available. These platforms, such as AUV (autonomous underwater vehicle, Wynn et al., 2014), OFOS (ocean floor observation system, Purser et al., 2018), ROV (remote operating vehicle, Christ and Wernli Sr, 2013), and FUO (fixed underwater observatory, Godø et al., 2014) enable researchers to collect large numbers of digital images from the field, orders of magnitude higher than 20 years ago. For a more in-depth look at image-based monitoring solutions have a look at (Bicknell et al., 2016), Mallet and Pelletier (2014), and Aguzzi et al. (2019). These image collections may provide valuable information on the taxonomic composition of habitat communities. Furthermore, changes in these communities across the spatial and/or temporal domains can be recorded. First attempts to link these two developments have been successful and showed the potential of deep learning in e.g., morphotype detection (Zurowietz et al., 2018), morphotype classification (Smith and Dunbabin, 2007;Gobi, 2010;Beijbom et al., 2012;Bewley et al., 2012;Kavasidis and Palazzo, 2012;Schoening et al., 2012;Langenkämper et al., 2018Langenkämper et al., , 2019Mahmood et al., 2019;Piechaud et al., 2019) or polyp behavior monitoring (Osterloff et al., 2019). However, all these studies have reported results obtained for data sets collected with the same gear, i.e., with one distinct camera system and the platform for the full analyzed data set. But in large scale studies, for instance, those ranging over a series of cruises and/or years, gear often changes with each deployment or is operated in different ways. As a consequence, the same fauna morphotype may well be recorded in images with transformed features in the contrasting research project data sets. The colors may be shifted and the textural features or some morphotype characteristics may be more or less visible. In addition, some morphotypes might be of lesser abundance in some data sets. These discrepancies in the appearance of particular morphotypes in the different data sets are referred to as "concept drift." Concept drifts can have a significant negative influence on the performance of machine learning classifiers that are trained on one data set, and then re-applied to new "unseen" data (e.g., collected with a different gear), where the performance of the classifier decreases for this new data. As changes in gear and operation for many studies cannot be avoided, the question as to what extent marine imaging can benefit from computer vision research depends on the ability of computer vision systems to compensate for the effects of such concept drifts.
Concept drift problems have been discussed for 20 years inside the machine learning community (see for example the influential early discussion in Schlimmer andGranger, 1986 andin Widmer andKubat, 1996). A commonly accepted definition of the term "concept drift" from a survey (Gama et al., 2014) is: "In dynamically changing and non-stationary environments, the data distribution can change over time yielding the phenomenon of concept drift." More recent reviews can be found in Žliobaitė et al. (2016) or in Barros and Santos (2018). For this current study, the change in "environment" (referring to the term in the citation above) between different deployments is caused by changes in the gear and its mode of operation. In addition, the location and time of image collection also vary, potentially causing further changes to the visual appearance of the objects of interest in the recorded digital images. To compensate for the common concept drift problem, different methods have been proposed, including ensemble methods (Grachten and Chacón, 2017;Sun et al., 2018;Ren et al., 2019).
In this paper, we investigate the effect of concept drifts on machine learning-based morphotype classification. We present four data sets collected from the same deep-sea seafloor area, with changing gears leading to concept drifts with mild or strong effects. We carried out a series of machine learning studies, starting with a baseline study that applied machine learning classification to standard training, test splits of data from the same set. Next, the concept drift effect was studied by using data from one or more sets and applying the classifier to complementary data from another data set. Finally, we test approaches for compensating the concept drift effects. In addition, we repeated these experiments with a subset of the said data set, which is balanced to eliminate the effect of data imbalance on the results.

Image Data Sets
The image data sets considered within this study were collected on cruises SO242/1 and SO242/2 of the research vessel SONNE in the Peru basin in the year 2015. During the cruises, four data sets of digital photos were collected from the sea bottom. An OFOS was used for three dives (Purser et al., 2018a,b,c) and an AUV for the fourth (Greinert et al., 2017). The original data is available at PANGAEA (https://pangaea.de/). The relevant parameters of the camera gear used to collect the four data sets are listed in Table 1.
While OFOS data sets D O 1 , D O 2 , and D O 3 were recorded with the same equipment, the fourth data set D A was recorded with a different camera carried by an AUV. So the strongest concept drift in feature representation would be expected between any of the D O data sets and the AUV data set D A . The OFOS data set D O 3 differs from the other sets D O 1,2 in that the OFOS was operating at a significantly higher altitude.
All four data sets were stored and shared for morphotype detection and annotation using BIIGLE 2.0 (Langenkämper et al., 2017). Eight marine biologists from eight different institutions, annotated megafauna morphotypes using a pre-defined set of 23 classes (Schoening et al., 2019).
Each class is a morphotype, with the exception of the one non-biological class "litter". To annotate a morphotype, the users drew a circle around the object using the BIIGLE 2.0 annotation tools. After the annotation task was completed by all users, the 12 most abundant classes were chosen for this study. The other classes were identified as having numbers too low to use for machine learning applications. A threshold of a minimum of 15 samples has been chosen to have at least 3 samples in the test using 20% of all data as test set. For each of the selected 12 classes, one example extracted from data set D O 3 is shown in Figure 1A. The same Figure also shows the abundances of the classes in the four data sets. A square image patch, i.e., the bounding box of the circle annotation, was extracted for all TABLE 1 | The four data sets were recorded with two kinds of gear (OFOS and AUV) and different camera set ups.

ID
Original name Gear Camera Lens Canon EF 8-15 mm/4,0 L Fisheye USM FIGURE 1 | From all 23 detected classes, the 12 most abundant classes of morphotypes were selected for this study. One example of each class from data set D O 3 is shown in (A). In (B) the abundance of each class in each data set is illustrated. In the upper left of the plot, one example of the morphotype "Sea cucumber" from each data set is shown for comparison. The gears were operated with different speed and altitude (both given in estimated mean value). Image patches of the 12 most abundant classes were extracted and used in our study. The total number of extracted patches from these 12 classes are given in the sixth column (see text for details). The last column shows the total number of classes found in each data set (including the ones of lesser abundance).
instances of the 12 classes. These patches were then resized to a uniform size of 256 × 256. In Table 1, for each data set the total number of images, the images size, the distance to the ground (average altitude of the OFOS/AUV), the average speed of the OFOS/AUV, the number of annotations of the 12 classes, i.e., extracted patches, and the total number of differently annotated classes is given.
In addition we generated another data set containing only the four classes Crustacean, Sponge, Ophiuroid, Polychaete (sessile), subsampled to the least common abundance in each of to eliminate the influence of data imbalance on the problem.

Concept Drift Visualization With t-SNE Projections
To visualize the concept drift for each of the 12 classes, the patches of all annotated objects were projected to the two-dimensional space using dimension reduction. Different methods can be used for dimension reduction and projection, such as Principal Component Analysis (PCA) (Jolliffe, 2011), Self-Organizing Maps (SOM) (Kohonen, 2000), Local Linear Embedding (LLE) (Roweis and Saul, 2000), or Sammon Mapping (Sammon, 1969). Here we applied the t-distributed stochastic neighbor embedding (t-SNE) introduced by van der Maaten and Hinton (2008). The basic idea of t-SNE is that it minimizes the Kullback-Leibler divergence between two probability distributions. One distribution describes the highdimensional data point distribution (here the 256 2 -dimensional image patches). The second distribution is defined in the lowdimensional space (here the 2D space used for scatter plot visualization). This method has shown good projection results preserving the data topology and avoiding dense cluttering for the majority of points more successfully than other methods, such as the PCA. These features render this approach a good choice for visualizing high-dimensional data.

ResNet Deep Learning Architecture and Training
Deep residual networks, also referred to as ResNet, have been proposed recently (Kaiming et al., 2015) to overcome limitations of other deep learning frameworks which suffer from the "vanishing gradient problem" (Glorot and Bengio, 2010). This phenomenon leads to a limitation of the number of layers achievable within the network and thus to a limitation of the complexity of the mapping function to be learnt by the network. The excellent results of ResNet architectures in computer vision tasks renders them an appropriate choice for testing deep learning architectures in marine imaging.
In each experiment (see study design) a ResNet was trained with a training data set D t consisting of input-output pairs (X, y), where X is an image patch sample and y is the corresponding class label. Thereafter the trained ResNet classifier was used to classify a disjoint test set D v for performance assessment. Different selections of samples for D t and D v were considered with the data collections listed above.
In all runs, the input dimension was 256 2 (i.e., the number of image patch pixels). The architecture of the ResNet, built with TensorFlow (Abadi et al., 2016) was not changed. The network was trained for 250 epochs. As the number of training samples was limited (see Table 2) data augmentation was applied to increase the amount of training data. To this end, each patch was flipped left-right and up-down with a chance of 50% for each of both axes. Second, random brightness adjustments of maximum 20% after image-wise standardization (zero mean and unit variance) was applied. Third, random cropping to a size of 224 × 224 and zero padding back to 256 × 256 was also applied.
For the experiments with the balanced data sets, we used patches of size 224 2 . The ResNet34 of torchvision by PyTorch (Paszke et al., 2019) was used. Data augmentation was omitted to get a bias free result. However, the network was initialized with ImageNet weights.

Performance Assessment
To measure the accuracy of a classifier, we report the F 1 -score as well as the macro F 1 -score. For this purpose, we define recall R and precision P on the test sets as follows with TP being the number of true positive classifications, FP the number of false positive classifications and FN the number of false negative classifications (Fawcett, 2006). Recall and precision were computed for each morphotype class ω j and are referred to as R ω j and P ω j respectively. Since the data sets show imbalanced class distributions, i.e., the 12 different classes are represented with significantly different amounts of samples (see Figure 1), two different accuracy measures are used to assess the accuracy of the trained classifiers in the different studies. The F 1 measure represents the average accuracy for the entire set of morphotype classes and the accuracy values of all classes are weighted by the class abundances: where N ω j is the number of element in class ω j . The idea is that each sample is of the same importance independent of the class it belongs to. Thus, the performance for more abundant classes may dominate the overall F 1 value. Of course, such a strong impact of the high abundant classes can be criticized as the accuracy of low abundant classes is ignored which may result in bad results for such classes which motivates us to consider the second accuracy value as well (see below). However, in practice, a computational class assignment is often followed by some manual posterior visual inspection for low abundant and difficult classes so the F 1 measure above is well motivated.
The macro F 1 -scoreF 1 is motivated by the assumption that all of the classes are of the same interest to the marine biologists and of the same relevance so the accuracy assessment shall not be biased by a small number of most abundant classes. We therefore compute an average accuracy from the class-wise accuracy values, so that every class has the same impact on the overall accuracy assessment independent of its abundance: with |ω| as the number of different classes (Zheng et al., 2020). The performance measures used were computed using scikitlearn (Pedregosa et al., 2011).

Study Design
As outlined in the introduction, a number of experiments were conducted to analyze the quality of the data in regard to machine learning-based classification. A visual depiction of the experimental design is shown in Figure 2. ResNet34, was used throughout all experiments for comparability. Each one of the Experiments A-E described below was carried out three times and the average accuracy was determined. The accuracy of the classifier was assessed using standard metrics explained above.

Experiment A: Intra-set Study (No Concept Drift)
For this experiment, each data set was investigated separately. So each individual data set was split into a ratio of 80% training data and 20% test data using stratified sampling, i.e., data were sampled in a way such that each class is represented with the same percentage. Splitting one data set into training and test categories had to be carried out with special care. Sometimes an object in an image was annotated by two different experts so it appeared twice in the data set, potentially with a slightly different position and circle radius. In such cases, both annotations were placed into the training or test set to guarantee that these two sets were truly disjoint. This experiment simulates the case that a part of a big data set is annotated and a classifier is trained to annotate the remaining part. So in these experiments we do not observe any serious concept drift despite small variations in the gear speed or altitude along this transect. Note that most of the studies about machine learning applications to marine image data are carried out this way.

Experiment B: 1 vs. 1 Inter-set Study
For this experiment, the network was trained with all data from one data The trained network was then used to classify all data from another data set All in all, four different classifiers were obtained from the four different training data sets. Each classifier was tested on the other three data sets, resulting in 12 sub-experiments. This experiment simulates the case that one previous data set is already annotated and a similar data set should now be automatically annotated with the help of a neural network.

Experiment C: Leave-One-Set-Out Inter-set Study
In this experiment, a classifier was trained with all data from three data sets. Afterwards, the classifier was used to classify the fourth data set. This experiment simulates the case that a series of image collections are already annotated and that a similar data set should now be annotated. All available data with different shifted concepts is used to train the classifier, in order to prepare the network for another concept drift.

Experiment D: Ensemble Classification Heuristic
We implemented a straight forward ensemble classification heuristic that was driven by the results obtained in Experiment B. To classify the data from one set D v an ensemble classifier F(X) (X as one image patch) was constructed from the three classifiers f i (X) ,i=1,2,3 that were trained with the three other data sets D t ∩ D v = ∅, taken from Experiment B (see above). Each patch X from D v was classified to one of the 12 classes ω j according to the following rule: To summarize Equation (4), if at least two ensemble members agree, their classification output is chosen. If all three classifiers disagree, the classifier is chosen that was trained by the largest amount of data. Since ensemble methods are reported to work for concept drift problems in other domains, we installed this experiment to either validate or falsify this hypothesis for the domain of marine image informatics.

Experiment E: Leave-One-Set-Out With Adaption
This experiment was carried out in a similar fashion to Experiment C. The training data consisted of the conjunction of three data sets. In addition, we added 20% of all annotations of the fourth data set to the training data to compensate for the concept drift. The trained classifier was applied to the remaining 80% of the fourth data set. When splitting the data of the fourth data set, we used stratified sampling. This experiment simulated the situation where a large quantity of data is available from past dives and very little data of the new dive has already been annotated. We aimed to investigate here how the classifier can compensate the concept drift using the updated data.

Balanced Data Set Experiments
In addition to the study design presented above, we conducted Experiments A, B, C, and E with a balanced data (cmp. end of section Imager data sets). This way we can analyze the impact of the concept drift without that of data imbalance.

RESULTS
The original four different image data sets showed a good overlap in the morphotype composition, especially regarding the 12 most abundant classes chosen for this study. This is important to analyze the impact of concept drift, as otherwise using a classifier trained on one data set to classify a different one would not be possible. The direct visual browsing and inspection of the images suggests that there might be a concept drift caused by different equipment. In Figure 1B it can be seen that the example category "Sea cucumber" from AUV data set D A features the lowest contrast and the lowest resolution of details. The number of annotations varied considerably between the data sets. This can be explained by the different locations of the recorded tracks in the habitat, as well as the different altitudes of the AUV/OFOS, i.e., more objects can be found, but they are harder to spot. In the following, we show the results of the t-SNE projections to get a visual representation of the concept drift. This allows us to validate whether a concept drift exists in the data space. Thereafter, we will present the results of the experiments as described above.

Concept Drift Visualization With t-SNE Projections
In Figure 3, we show the 12 scatter plots that were obtained for the 12 different classes using all data sets D O 1 , D O 2 , D O 3 , and D A . All data sets together were subject to a t-SNE projection into a 2D space. The color of each plotted icon encodes the data set source, following the same scheme as in Table 1. Looking at the plots, we observe the trend that the D A patches overlap with the D O 3 most. Let us call this pair C ∞ . In addition, the D O 1 data seems to overlap with the D O 2 data, which we call C ∈ . This indicates that the members of C ∞ and C ∈ share some visual qualities within their set. And that the concept drift between C ∞ and C ∈ is stronger than the drifts between their members.

Experiment A: Intra-set Study
The results of this experiment are shown in Table 3. Best results were obtained for the OFOS data set with the higher altitude (D O 3 ). The F 1 values for D A and D O 1 are on a similar level. For the results of Experiment A, in most cases observed here the classification performance increases with the amount of data available, i.e., the total number of annotations available (see Table 2). For all given data sets, theF 1 accuracy values were lower when compared with the F 1 accuracy values. The accuracy values for the 12 single classes are provided in Supplementary Material. When looking at the results for the balanced data set F bal 1 ( The results of the 12 accuracy measurements are shown in Table 4. As expected, the 1 vs. 1 inter-set study produced inferior results compared to those in Experiment A, due to the concept drift. Best results were obtained when D O 1 or D O 3 were used for training (see numbers given in boldface). The lowest accuracy values were achieved when the AUV data set D A was used for training. Interestingly, a similar altitude helped to compensate a little for the concept drift between AUV and OFOS data, as can be seen when training on D O 3 and testing on D A . This combination yielded superior results to others, where D A was used as a test set. In addition, this is supported by the t-SNE projection plots, where D A and D O 3 overlap most. Again, theF 1 results were inferior to the F 1 accuracy values. The results for the balanced data set are shown in Table S4. The general observation that the AUV data set D A produced the worst results and that the a similar altitude helped to compensate for concept drift still holds true, however the results for D O 1 and D O 3 as test sets are ranked differently.

Experiment C: Leave-One-Set-Out Inter-set Study
The four results from this experiment are shown in Table 5. In addition to the two accuracy measures, we also list the number of annotations in the union of the three data sets used for training. Comparing the results with those obtained in Experiment A we make the following observations. For three out of four data sets, the performance drops for the F 1 accuracy. However, the results obtained for the D O 2 data set increase by nine percentage points compared to the intra-set study A. Comparing theF 1 values in the intra-set Experiment A and the leave-one-set-out inter set Experiment C we notice an increase in the leave-one-set-out inter-set study for data sets D O 1 , D O 2 and D A withF 1 accuracy only dropping for D O 3 . Altogether we see that in Experiment C theF 1 values are lower than the F 1 values, but the difference here is much lower than observed in Experiments A and B. Another observation is that the performance values in Experiment C are always significantly higher than those obtained from Experiment B. The results using the balanced data support the findings stated above.

Experiment D: Ensemble Classification Heuristic
Looking at the results in Table 6 we can only see small improvements to the results we obtained when we trained the network with just the largest data set in the 1 vs. 1 inter-set study B (see the third row in Table 4). For instance, the performance values for the test sets D A and D O 2 do improve a little. However, all the resultant performance values are inferior when compared with the results obtained in Experiment C, when three data sets were joined for one training data set.
FIGURE 3 | Each scatter plot shows the result of a projection of the image patches of one class for all data sets. The icon color encodes the data set following the same scheme as in Table 1. The results for the balanced data set are shown in the last row. Please note that only the F 1 -score is shown, here F 1 and F 1 are the same for the case of a balanced data set. The best result is printed in boldface. Training set Test set  The last two columns refer to the same experiment on the balanced data set. The best result is printed in boldface.

Experiment E: Leave-One-Set-Out Inter-set Study With Adaption
The results in Table 7 show a small improvement compared to the results of the leave-one-set-out inter-set study in Table 5. However, we even see some small decrease in performance for data set D O 2 which is the smallest data set. In contrast to the imbalanced data the results for the balanced data set (Table 7 last row) we see bigger improvements of up to 13% points (D A ) and no decreases in performance.
A summary of the best results for each experiment for each test set is shown in Table S3.
The best result is printed in boldface. The last row shows the results using the balanced data set. The best result is printed in boldface.

DISCUSSION
The results of our experiments show the importance of carrying out such computer vision experiments using data from different dives. The results show limitations for the generalization power of a chosen up-to-date deep learning classification approach training and testing on data from different dives. We also observe a number of interesting trends. Looking at the results of Experiment A, we see that images recorded with a higher altitude seem to gain higher performance values if enough training data is available. Although data sets D O 1 and D O 3 have almost the same number of training patches, the performance for the higher altitude data is significantly higher. The same holds for the two small data sets D O 2 and D A . The reason for this improved performance may be the reduced motion blur in the higher altitude images. The AUV data is also classified with significantly lower performance than the OFOS data. The AUV is usually operated with a higher speed than the OFOS, so the motion blur is more severe for the AUV acquired data.
When it comes to the concept drift problem, it seems some compensation is possible. The results of Experiment B can be seen as a benchmark for the problems introduced by concept drift. For the given data sets, training a classifier on one of them and classifying one of the remaining data sets did not yield results > 90% not even for balanced data sets from small subsets of classes. Thus, we must assume, that posterior quality assessments and error corrections must be applied by human observers to increase the accuracy. The second main factor may be the training set size. The best results were obtained when the ResNet was trained with one of the two largest data sets D O 1 or D O 3 . If the three data sets are combined (Experiment C) the results for the complementary left-out test data set increased significantly (see Table 5) when compared to Experiment B. Interestingly, the strongest increase was reported for the small data set D O 2 , even outperforming the intra-set (i.e., concept drift-free) Experiment A. The same is true for three of the data sets regarding theF 1 measure. So in three cases, the performance values for some less abundant classes benefit from introducing more examples from other data sets.
In our experiment, an ensemble of three separately trained classifiers did not produce improved results. The heuristic approach performed much worse than the leave-one-setout approach. The leave-one-set-out with adaption approach (Experiment E) produced mixed results compared to the "simple" approach evaluated in Experiment C.
The results from the balanced data set experiments generally support this findings. Nevertheless, the increases by the leaveone-set-out with adaption approach (Experiment E) produce significantly better results. This might be due to more homogeneous data and eliminated data imbalance.
All the presented average classification performance falls short when compared to results from computer vision applications from other domains or public benchmark data (like COCO; Lin et al., 2014). However, we were interested in analyzing concept drift problems induced by image acquisition with different gear of the same area for a mixture of classes. This is the main reason why the results were inferior to other related classification results as the number of training data was rather low, at least for a subset of the classes tested.
In Tables S1, S2, we show tables of class-wise F 1 andF 1 values showing acceptable numbers for the classes Sponge, Small encrusting, Sea cucumber, and Ophiuroid for example. From these numbers, one could conclude that a minimum number of around 500-1,000 examples per class should be collected in the future to gain satisfactory performance values.

CONCLUSION
Combining all our observations, we make the following conclusions. Firstly, the concept drift between the four dives is considerable (cmp. Figure 3) but can be explained by the differences in survey gear and operation parameters. In this study, the concept is mainly determined by the object distance to the camera and the speed of the imaging platform. Secondly (and not surprisingly) a high number of examples per class for the training set is beneficial for attaining more satisfactory results in the future. Thirdly, combining data from several dives has the clear potential to reduce the impact of concept drift, at least for some low abundance classes. Fourthly, in the context of this particular study, a higher altitude of around 4 m for data collection was found to be preferable than at lower altitudes where motion blur has a greater impact on image quality. Finally, the most simple way of combining data from several dives is by training one network with all the data which achieved the most convincing results. This statement should not be considered as being absolute in the context of other potentially interesting ensemble strategies.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.