Cross-sensor vision system for maritime object detection

Accurate and automated detection of maritime vessels present in aerial images is a considerable challenge. While significant progress has been made in recent years by adopting neural network architectures in detection and classification systems, these systems are usually designed specific to a sensor, dataset or location. In this paper, we present a system which uses multiple sensors and a convolutional neural network (CNN) architecture to test cross-sensor object detection resiliency. The system is composed of five main subsystems: Image Capture, Image Processing, Model Creation, Object-of-Interest Detection and System Evaluation. We show that the system has a high degree of cross-sensor vessel detection accuracy, paving the way for the design of similar systems which could prove robust across applications, sensors, ship types and ship sizes.


Introduction
From the advent of passenger ships in the late 19 th century to container-revolutionized maritime transport in the 1970s, there has been increasing interest in monitoring, tracking and identifying vessels at sea. Before the first artificial earth satellite was placed into orbit in the mid 1950s, vessels were primarily tracked using either primitive cooperative systems such as inter-ship radio transmission or rudimentary non-cooperative systems such as coastal or on-board RADAR. Human interests at this timerevolving around safety & rescue, fishing and passenger transport -were largely satisfied by these systems.
More recently, effective understanding of the global maritime domainor Maritime Domain Awareness (MDA)has exploded in importance around the world with a significant number of commercial, defense and other government applications. There has been increasing attention given to exclusive economic zones (EEZ) and governance of a country's natural resources with state interests including maritime security, monitoring of marine traffic, illegal fishing, smuggling and maritime search & rescue. Commercial interests have expanded to include drilling and exploration of ocean floors, the management of fisheries, maritime piracy and cargo transportation. Private entities and NGOs have interests ranging from forecasting weather to the protection of ecology and sea health.
A number of these applications use knowledge of position and behavior of vessels as their cornerstone with MDA being enabled by information from land, sea, air and/or space systems and in some cases, vessel information repositories (Dekker et. al, 2013). These systems can broadly be classified into one of two types -Cooperative and Non Cooperative Systems -based on whether the system is employed by vessels to communicate information about themselves or whether they are observation systems which function independently of vessel cooperation (Table 1). Information captured usually includes the vessel type, cargo, position, velocity, route as well as other identifying and tracking data.
Cooperative systems are rarely used for comprehensive MDA. Most small (<300) ton vessels are not required to carry either an Automatic Identification System (AIS) or a Long-Range Identification and Tracking System (LRIT) while fishing vesselsregardless of sizeare not required to carry a Vessel Monitoring System (VMS). Additionally, illegally operating vessels rarely carry or operate their systems accurately. Some vessels turn off their systems while others spoof their mandatory position reports. Operations such as search and rescue can't be carried out effectively if one is to rely solely on cooperative reports either. These reasons make non-cooperative systems among the most beneficial sources of information for a number of the MDA applications outlined above. In particular, Synthetic Aperture Radar (SAR) and Optical Imaging Satellite Systems have several advantages such as their remote access, global reach, frequency of information updates and the high amount of data they can collect and process.
While the last century saw incremental progress made in computer-generated detection and classification of objects in images, the creation of the first convolutional neural network in the 1980s and GPU-accelerated training in the 2000s enabled significant strides in machine learning approaches to detection, segmentation and localization of objects in images.
However, a number of distinct challenges exist which prevent the robust detection of vessels at sea. Sea surfaces can be complex, and variations in weather and vessel reflectivity can lead to a loss of system precision. Small, densely packed and blurry vesselsas well as vessels very close to landhave all proven challenging for detection systems. While traditional systems are inefficient and generally have lower accuracy, modern systems have been timeconsuming to build and require large amounts of labeled data.
Lastly, no single technique has proved robust across sensors, leading to piecemeal solutions for various sensors, datasets and locations. This paper proposes a vision system which can provide robust target detection across disparate sensor types. The system is comprised of the following subsystems -Image Capture Subsystem, Image Processing Subsystem, Model Creation Subsystem, Object-of-Interest Detection Subsystem and System Evaluation Subsystem -and provides functionality for object detection using distinct independent data sources for model creation and object detection.

Related work
LandSat-1, launched in 1972, was the first civil optical satellite. Since then, hundreds of optical satellites with varying resolutions have been launched with many continuing to orbit our planet. Recent VHR additions like the WorldView and GeoEye series have expanded spectral and spatial resolutions while others like QuickBird and IKONOS have a higher radiometric resolution as well. An increasing number of optical satellite sensors now also provide more frequent coverage of Earth. At the turn of the century, there was a significant increase in the availability of commercial VHR sensor data and with it an explosion in the number of publications exploring the viability of maritime vessel detection using satellite systems.
Some of the earliest systems for maritime vessel detection used a number of pre-processing steps prior to target detection. Sea-land separation was considered crucial for accurate detection of vessels in harbors (Willhauck et al., 2005) as well as reducing the high number of false positives generated when vessel detection systems were applied to land (Corbane at. al, 2008). Consequently, coastline data was either incorporated from existing GIS data (Lavalle at al., 2011) or land masks were created from the images themselves (Dong et al, 2013). Similarly, key environmental effectscloud coverage, waves and sunlightwere usually minimized using cloud masks (ESA, 2015), texture discrimination (Yang et. al, 2014) or Fourier transform algorithms (Buck et al., 2007;Jin and Zhang, 2015).
Vessel Detection and Classification methods ranged from simple geometrical feature detection (Lin et al, 2012;Heiselberg, 2016) to machine learning techniques. Prior to the advancements made in object detection systems which used neural networks, support vector machines (SVM)a supervised classification  (Bi et al., 2010;Bi et al., 2012;Kumar and Selvi, 2011;Li and Itti, 2011;Xia et al., 2011;Guo and Zhu, 2012;Satyanarayana and Aparna, 2012;Song et al., 2014). Other classifiers for vessel detection include the Bayesian classifier (Antelo et al., 2009), random forest models (Johansson, 2011) and Fisher classification (Zhang et al., 2012). More recently, neural network-based systems have taken the world of image recognition and object recognition by storm. AlexNet, a convolutional neural network architecture designed by Alex Krizhevsky in 2012 achieved a top-5 error of 15.3%, a full 10 percentage points lower than that of the runner up and paved the way for significant strides in image classification, segmentation and object detection.
In Ramani et al., 2019 the authors build a vessel detection system for real-time maritime applications. The system employs a Mask R-CNN architecture to segment and classify 30 images every 30 seconds.
In Gallego et al. (2018), results from a Convolutional Neural Network are passed to a k-NN model to improve detection performance.
In Zhang et al. (2019), pre-processing of satellite images is performed using a support vector machine framework following which variations of the Faster R-CNN neural network architecture are applied to measure each system's performance on different sizes and types of vessels. The authors are able to identify a framework which performs reasonably well for both offshore and inland vessel detection.
Chen et al., 2020 also used CNNs to create an end-to-end detection system capable of detecting both inshore and offshore ships with an accuracy >90%. Their detection speed was 72 fps and their system intentionally balanced accuracy against speed of detection on the SAR Ship Detection Dataset (SSDD). Li et al. (2017) used a CNN architecture-based detection system on a custom dataset consisting of ships of various sizes as well a variety of environmental and sea conditions. Their paper established a higher precision with the custom framework than an equivalent Faster R-CNN system applied on the same dataset.
In contrast to common SAR and Optical Satellite Systems used in other publications, Yang et al. (2018) uses a remote sensing system which captured and segmented Google Earth images which were then used for vessel detection. The authors also used a custom neural network framework with a Feature Pyramid Network (FPN) to minimize false positives in images consisting of densely packed ships.
When we examine a collection of approaches used to build vessel detection systems, we observe a number of underlying trends: -Neural networks have gained popularity in recent publications due to the largely scripted/automated approach to building highly accurate detection systems.
-Classification of vessels by vessel type has proven very challenging regardless of the type and resolution of the sensor(s) used.
-Most publications have built and tested their systems using a homogenous dataset of images collected from either a single sensor or a set of sensors, thereby failing to establish robustness of their system across sensor types.
In this paper, we tackle the challenge of building a system robust enough to collect and use images from one sensor to detect objects in images collected from a second, disparate sensor. Such a system would be tunable, adaptable and re-purposable across applications. In the Object-of-Interest Detection subsystem, we evaluate several state-of-the-art algorithms as well as create a custom model architecture from scratch.
While we do not intend to recommend a winning algorithm to solve cross-sensor vessel detection, we show that the designed system has a high degree of cross-sensor vessel detection accuracy, paving the way for future research in tunable, adaptable and re-purposable systems which could prove robust across applications, sensors, ship types and ship sizes.

System overview Image capture subsystem
The Image Capture Subsystem (Figure 1) uses two satellite sensor feeds along with two XML file feeds to obtain and provide data to the consequent subsystems. The XML files contain annotated image information for the corresponding image feeds.
The first input is an optical aerial image feed of maritime scenes on the visible spectrum. The images are sourced in the RGB color scheme, and can contain zero, one or multiple maritime vessels in varying weather and lighting conditions. The images contain scenes from different regions of the world including Africa, Europe and Asia and different water bodies including the Mediterranean Sea as well as the Atlantic and Pacific Oceans. While the images are of different sizes, the average image has a spatial resolution of 512 x 512 pixels.
The second input is a synthetic aperture radar (SAR) generated feed of maritime scenes (sea waves, shallow sea topography, coastal zones, maritime vessels etc.) with a spatial resolution between 1-500m. This feed provides images of 256 pixels in both range and azimuth, and the vessels in these images have distinct scales and backgrounds. A given image can contain a single vessel, multiple vessels or none.
For each image feed, annotations are provided in the Pascal VOC format. The Pascal VOC format is a common annotation format for images which stores annotations in the XML file format with a separate XML annotation file for each image. Optionally, bounding box information is included in the [x-top-left, y-top-left, x-bottom-right, y-bottom-right] format.
The image feeds are tagged with their source before being merged together into a single stream and sent to the Image Processing Subsystem. The two streams of XML annotations comprise the other outputs of this system.

Image processing subsystem
The inputs to the Image Processing Subsystem (Figure 2 First, the images are re-sized for uniformity across input streams and to match the dimensions of the input layer in the Model Creation Subsystem. The pixels in the image stream are then converted to the float datatype following which each image is normalized. Normalization scales the pixel values down from a range of (0,255) to a range of (0,1). Lastly, the image streams are split based on their source, annotations are appended and the output of the subsystem consists of two tagged and annotated image streams. Convolutional Neural Network Architectures like AlexNet and GoogleNet perform various image chopping and feature extraction steps which, in conjunction with pooling layers make them translation (and to a large degree, rotation) invariant model architectures. Image Capture Subsystem.

FIGURE 2
Image Processing Subsystem.

Model creation subsystem
The input to the Model Creation Subsystem (Figure 3) is a single annotated image feed. This feed is used to train a binary classification model to detect the presence of a vessel in an image using a combination of pre-defined model frameworks and hyperparameters. The models employed are (a) a custom convolutional neural network architecture, defined and trained from scratch, and (b) transfer learning and benchmarking using four common computer vision model architectures. For the latter, we re-define and fine-tune the last layers for our specific task while leaving the architecture and weights of other layers as is. Model parameters for each fitted model comprise the output of the Model Creation subsystem as well as each model's predictions on the input image feed.

Object-of-interest detection subsystem
The inputs to the Object-of-Interest Detection Subsystem (OOIDS) (Figure 4) are the fitted model parameters and the second image feed on which OOI Detection is to be performed. The fitted model parameters can either be the hyperparameters of the modelin which case the model will need to be re-fit on the original datasetor a fit model, as we have assumed here. The model is applied ('scored') on the second image stream producing predictions indicating the presence or absence of maritime vessels. The output of this subsystem are the model results on the second image stream. As a reminder, this is an image feed the model itself has not been exposed to, and is an attempt to measure the model's power on a disparate and independent data source.

FIGURE 3
Model Creation Subsystem. Object-of-Interest Detection Subsystem.

System evaluation subsystem
The System Evaluation Subsystem ( Figure 5) calculates and produce model metrics which measure the performance of the model on the dependent ('training') and independent ('test') data sources. The inputs to this subsystem are the model results on both the training (image feed #1) and test (image feed #2) datasets. Using these model results, model metrics such as accuracy, precision and recall can be calculated, dependent on the number and frequency of classes in each dataset. The metrics indicate the overall performance of the system at performing object detection using different sensors and sensor types, i.e SAR and Optical Satellite sensors.,,,,

Methodology
The System Block Diagram is shown in Figure 6. To test system functionality and gauge performance, we use the MASATI (Maritime Satellite Imagery Dataset) and Sentinel datasets as inputs to the Image Capture Subsystem. The images contained in these datasets contain maritime scenes in the visible spectrum using optical aerial cameras and SAR-based radio waves, respectively. These datasets mimic and satisfy the earlier outlined assumptions regarding the two satellite image feeds (III.A) and are accompanied by annotations indicating the presence/absence of maritime vessels which are treated as ground truth in the subsequent model design and evaluation subsystems.
The datasets are tagged with their source name, re-sized to standardized dimensions, normalized and the bits converted to the float datatype. The Model Creation Subsystem uses the Keras Deep Learning API with the Tensorflow backend to fit four pre-defined convolutional neural network architectures and one custom architecture on the MASATI dataset. The MASATI dataset consists of 1027 (48%) images containing one or more maritime vessels and 1132 (52%) images with none. In addition to a custom model trained from scratch, the 4 pre-defined architectures include -VGG-16, proposed by Karen Simonyan and Andrew Zisserman of Oxford University in 2014, the '16' in the System Evaluation Subsystem. Since (a) one of the primary goals behind most maritime object detection systems is real-time processing, and (b) our primary goal is to develop a system capable of using data from one sensor to detect objects in incoming data from a second sensor, each of the 5 models is trained for <=5 epochs. There is no minimum stopping threshold or other optimization criteria since we want flexible models which aren't overfit or optimized on the MASATI dataset alone.
While the custom model is trained from scratch, each predefined architecture has the following changes: -The input layer is altered to match the dimensions of the incoming data stream, -The output layer is altered to a softmax function with two classes of interest, and -while the fitted weights of most layers stay the same, the last five layers are re-trained for the purpose of optimizing detection of our classes of interest Each model is trained using specific values of hyperparameters following which fitted parameter values are saved and transferred as outputs to the Object-of-interest Detection Subsystem. The OOI Detection Subsystem re-fits models and uses the fit models to predict the presence or absence of a maritime vessel in the second (SAR) input stream. The results of the 5 models on the SAR image stream is an input to the System Evaluation Subsystem which calculates, compares and displays metrics for the system's user(s).
Model Configuration and hyperparameters for each model are shown in Table 2.

Results and discussion
The System Evaluation Subsystem calculates each model's accuracy in detecting ships on the two datasets -MASATI and Sentinel, referred to as the training data and test data, respectively. The number of images in each dataset as well as the time to train and score each model on the respective datasets is also calculated. These results are shown in Table 3.
As we can see, the results are interesting and varied.
-While the custom modeltrained from scratchhas a low accuracy, recall and F-Score, it has high precision and beats larger architectures like ResNet50 across the board when trained for only a few epochs.
-ResNet50, as we can see in Table 2 also has the highest number of parameters of all the architectures indicating that the extra learning potential of this network likely requires additional parameter tuning and in its current form results in overfitting on the training data.
-Most pre-trained models performed better than the custom model indicating that the extra layers, learning capacity and learned features in these models aided in our binary classification task, despite being designed for larger and more complex image classification and object detection tasks. In addition to the higher F-Scores, InceptionV3 and XCeption have much lower training times than the custom architecture.
-Despite the datasets being collected from different sensors and sensor types, many of the models are successfully able to identify maritime vessels in one using data solely from the other with both high precision and recall despite the unbalanced Sentinel dataset.
-While most modern systems built on underlying neural network architectures require sufficiently large (a) computing power, (b) time, and/or (c) data to perform well, a dataset of~2K images was used to sufficiently capture between 62% -94% of vessels in a dataset 10x as large (21,682 images) with training and testing times of <=10 minutes.
-While many systems require significant tuning and selection of hyperparameters to optimize object detection, limited fine-tuning resulted in respectable vessel detection results.
While we have not examined incorrect classification results further to discover potential underlying trends, future research in cross-sensor vessel detection could prove robustness across ship and sensor types with longer training times, other model architectures and/or further hyperparameter tuning.
We propose the following guidelines which similar studies could consider: -Using multiple sensors for both system training and testing -Verification of algorithms on varied maritime scenes -Validation of accuracy and false detection rates across different ship sizes and difficult conditions -Introduction of ship classification algorithms for specific applications Given that earth observation is a rapidly growing field with an increasing availability of open data and new satellite technology, cross-sensor vessel detection systems would be more adept than traditional systems and could prove less cost-sensitive for new applications. Future research using small datasets and low system processing times may also lead to rapid detection rates, thereby aiding real-time maritime applications including safety, logistics and transportation.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.