Edited by: Emanuele Di Lorenzo, Georgia Institute of Technology, United States
Reviewed by: Lida Teneva, Independent Researcher, Sacramento, United States; Erin L. Meyer-Gutbrod, University of California, Santa Barbara, United States
This article was submitted to Ocean Solutions, a section of the journal Frontiers in Marine Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Aquatic ecologists routinely count animals to provide critical information for conservation and management. Increased accessibility to underwater recording equipment such as action cameras and unmanned underwater devices has allowed footage to be captured efficiently and safely, without the logistical difficulties manual data collection often presents. It has, however, led to immense volumes of data being collected that require manual processing and thus significant time, labor, and money. The use of deep learning to automate image processing has substantial benefits but has rarely been adopted within the field of aquatic ecology. To test its efficacy and utility, we compared the accuracy and speed of deep learning techniques against human counterparts for quantifying fish abundance in underwater images and video footage. We collected footage of fish assemblages in seagrass meadows in Queensland, Australia. We produced three models using an object detection framework to detect the target species, an ecologically important fish, luderick (
The foundation for all key questions in animal ecology revolves around the abundance, distribution, and behavior of animals. Collecting robust, accurate, and unbiased information is therefore vital to understanding ecological theories and applications. Some types of data collection methods that are commonly used in animal ecology, such as tagging, manual visual surveys, netting, and trawling, can be replaced or supplemented with camera data.
In particular circumstances, the development and availability of these devices can provide a more accurate and cheaper method to collect data, with reduced risk to the operator (
Much like the physical collection of data, manual processing of data is often labor-intensive, time-consuming, and extremely costly (
Fortunately, recent advances in machine learning technologies have provided one tool to help combat this problem, deep learning. Deep learning is a subset of machine learning consisting of a number of computational layers that process data that are difficult to model analytically, such as raw images and video footage (
Although deep learning techniques are being implemented enthusiastically in terrestrial ecology, it is currently an under-exploited tool in aquatic environments (
Efforts to use deep learning methods in marine environments currently revolve around the automated
Although classification enables the determination of species, its usefulness for answering broad ecological questions is rather limited. Object detection allows us to classify both
Here, we use fish inhabiting subtropical seagrass meadows as a case study to explore the viability of computer vision and deep learning as a suitable, non-invasive technique using remotely collected data in a variable marine environment. Seagrass meadows provide critical ecosystem services such as carbon sequestration, nutrient cycling, shoreline stabilization, and enhanced biodiversity (
We used submerged action cameras (Haldex Sports Action Cam HD 1080p) to collect video footage of luderick in the Tweed River Estuary in southeast Queensland (−28.169438, 153.547594), between February and July 2019. Each sampling day, six cameras were deployed for 1 h over a variety of seagrass patches; the angle and placement of cameras was varied among deployment to ensure a variety of backgrounds and fish angles. Videos were trimmed for training to contain only footage of luderick and split into five frames per second.
The object detection framework we used is an implementation of Mask R-CNN developed by
Training dataset image demonstrating manual segmentation mask (white dashed line around fish) denoting the region of interest (RoI);
The utility of the model depends on how accurately the computer identifies the presence of luderick, which we quantified in two ways based on the interactions between precision (P) and recall (R). Precision is how rigorous the model is at identifying the presence of luderick, and recall is the number of the total positives the model captured (
Firstly, the computer’s ability to fit a segmentation mask around the RoI was determined by the mean average precision value (mAP) (
We used the mAP50 value in this study, which equates to how well the model overlapped a segmentation mask around at least 50% of the ground truth outline of the fish. The higher the value, the more accurate the model was at overlapping the segmentation mask. Secondly, the success of our model in answering ecological questions on abundance was determined by an F1 score. The F1 score takes into consideration the maximum number of fish calculated per video (MaxN), as well as the ratio of false positives and false negative answers given, to assess the performance of each method’s ability to correctly assess abundance:
We used the F1 score and mAP50 values to assess the performance of the computer model.
All predictions were made with a confidence threshold of 90%, that is, the algorithm was at least 90% sure that it was identifying a luderick to minimize the occurrence of false negatives. This threshold was chosen as it typically maximized F1 performance by filtering out false positives.
Models were trained using a random 80% sample of the annotated dataset, with the remaining 20% used to form a validation dataset (
The same computer algorithm was used to train three different models on three different randomized 80/20 subsets of the whole training data set to account for variation in the training and validation split. These models were subsequently used to compare the unseen and novel test dataset, and in the human versus computer test.
We generated a performance curve to confirm that variation among models was sufficiently low to ensure consistency in performance across the three models. Random subsets of still images were selected from the training dataset. These subsets of data increased in volume to determine the performance of the model as training data increase. As the volume of training data increased, the risk of overfitting decreased, so the number of training iterations was adjusted to maintain optimum performance.
Manual annotation cost can be a significant factor to consider when training CNN networks and can also be monitored by using the performance curve. Time stamps were added to the training software to record the speed at which training data were annotated to infer total annotation time of the training data by humans. We used these data to determine how much training is required by this model to produce high accuracy and thus also the effort needed to produce a consistent and reliable ecological tool.
The 80/20 validation test is an established method in machine learning to assess the expected performance of the final model (
Creating an automated data analysis system aims to lessen the manual workload of humans by creating a faster, yet accurate, alternative. Therefore, it is crucial to not only know how well the model performs but also assess its capabilities in speed and accuracy, compared to current human methods. This “human versus computer” method analysis compared citizen scientists and experts against the computer: (1) Citizen scientists were undergraduate marine science students and interested members of the public (
Based on the computer algorithm curve, F1 performance began to plateau earlier than mAP50 (
Performance curve showing the computer’s ability to fit a segmentation mask around the luderick (performance scored by mAP50) and in accurately identifying abundance (performance scored by F1).
Performance was high for both the unseen and novel test sets (mAP and F1 both >92%). Based on F1 scores, the computer performed equally well (
The performance of the three model’s F1 and mAP50 scores (mean, SE) for the unseen test footage from the same location and novel footage (Unseen: 32 videos, Novel: 32 videos).
The computer algorithm achieved the highest mean F1 score in both the image (95.4%) and the video-based tests (86.8%), when compared with the experts and citizen scientists. The computer also had fewer false positives (incorrectly identifying another species as luderick) and false negatives (incorrectly ignoring a luderick) in the image test. The computer models also had the lowest rate of false positives in the video-based test when compared to both human groups, but had the highest rate of false negatives. The computer performed the task far faster than both human groups. Experts on average performed better (F1) than the citizen scientists in both tests and had higher accuracy scores (
Summary of performance measures comparing averaged scores from computer versus humans (citizen scientists and experts).
Analysis method | False negatives | False positives | Accuracy (prop. ±) | F1 (%) (SE) | Speed (sec/mins) (SE) |
Citizen Scientist | 28.6 | 7.2 | −0.14 | 82.0 (2.8) | 12.6 (1.4) |
Expert | 18.1 | 5.6 | −0.08 | 88.3 (8.4) | 14.3 (4.0) |
Computer | 11.7 | 4.7 | −0.12 | 95.4 (0.9) | 0.4 (0.0) |
Citizen Scientist | 20.9 | 12.6 | −0.10 | 79.0 (2.4) | 2.4 (2.4) |
Expert | 12.1 | 11.9 | +0.06 | 85.3 (6.9) | 2.8 (4.4) |
Computer | 24.3 | 2.7 | −0.10 | 86.8 (1.6) | 1.2 (0.3) |
F1 scores were most variable for the citizen scientist group, with the difference between the lowest and the highest score for the image and video tests being 40.1 and 35.1%, respectively. The computer achieved the lowest variance, with these values only 3.1% for the video test and 1.7% for the image test (
Overall test performance in determining abundance (F1) by computer versus humans (citizen scientists and experts) based on identical tests using 50 images and 31 videos. The citizen scientist group had the highest variance and lowest performance, while the computer had the lowest variance and highest performance. Solid line denotes median, and dashed line denotes mean.
Our object detection models achieved high performance on a previously unseen dataset and maintained this performance on footage collected in a novel location. It outperformed both classes of humans (citizen scientists and experts) in speed and performance, with high consistency (i.e., low variability).
We clearly show that our model is fully capable of accurately performing the same on novel footage from locations beyond the data used for training. Few previous demonstrations of the utility of deep learning have tested algorithms under these novel conditions, but it is a factor which should be considered to determine how transferable the model is to environmental scientists. While our results suggest the algorithm is robust and flexible under different environmental conditions which vary with tides, water clarity, ambient light, and differences in non-target fish species and backgrounds, further work to quantify these differences would be needed before conclusive statements can be made. In a study conducted by
The computer’s high performance, speed, and low variance compared to humans suggest that it is a suitable model to replace manual efforts to determine MaxN in marine environments. Deep learning may be the solution for researchers to avoid analytical bottlenecks (
Previous studies comparing humans versus computers have predominantly used images rather than videos. When analyzing video footage, there is an assumption that humans have the comparative advantage when addressing uncertainty and ambiguity (
Quantifying population trends is critical to understanding ecosystem health, so ecologists need measurements of population size that are consistent. We found less variation in computer measurements than for human observers. Human errors in observations are attributable to individual observer biases (
Although recent advances in deep learning can make image analysis for animal ecology more efficient, there are still some ecological and environmental limitations. Ecological limitations include the difficulty in detection of small, rare, or elusive species, and therefore, abundance may not be able to be estimated
The performance curves for our models suggest that they may be just as useful in determining fish abundance with fewer annotations than our full training set of 6,080 annotations. Therefore, less time was needed for training the algorithm as the accuracy of the model’s ability to predict the whole fish (mAP50) is not needed to determine abundance. As our model took approximately 60 h to train, running a performance curve while training we can see that the time to reach optimum performance could be two-thirds quicker at 20 h. Creating a performance curve is a useful step when calculating the cost-benefits of implementing a high performing model as well as monitoring algorithm issues such as overfitting. However, this does not take into account the time for human to be trained on which species to annotate. Fish identification experts may not need additional training while citizen scientists may. However, studies have shown that citizen scientist annotated data for deep learning can be as reliable as expertly annotated data (
Deep learning methodologies provide a useful tool for consistent monitoring and estimations of abundance in marine environments, surpassing the overall performance of manual, human efforts in a fraction of the time. As this field advances, future ecological applications can include automation in estimating fish size (
The datasets generated for this study are available on request to the corresponding author.
The studies involving human participants were reviewed and approved by the Griffith University Human Research Ethics Review. The patients/participants provided their written informed consent to participate in this study.
ED and RC designed the study. ED and SL-M conducted the fieldwork. ED and EJ developed the deep learning architecture and user interface. ED drafted the manuscript. All authors commented on the manuscript and provided intellectual input throughout the project.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank the many fish experts and citizen scientists who participated in the study. This manuscript has been released as a pre-print at bioRxiv (