Small Object Detection and Tracking in Satellite Videos With Motion Informed-CNN and GM-PHD Filter

Small object tracking in low-resolution remote sensing images presents numerous challenges. Targets are relatively small compared to the field of view, do not present distinct features, and are often lost in cluttered environments. In this paper, we propose a track-by-detection approach to detect and track small moving targets by using a convolutional neural network and a Bayesian tracker. Our object detection consists of a two-step process based on motion and a patch-based convolutional neural network (CNN). The first stage performs a lightweight motion detection operator to obtain rough target locations. The second stage uses this information combined with a CNN to refine the detection results. In addition, we adopt an online track-by-detection approach by using the Probability Hypothesis Density (PHD) filter to convert detections into tracks. The PHD filter offers a robust multi-object Bayesian data-association framework that performs well in cluttered environments, keeps track of missed detections, and presents remarkable computational advantages over different Bayesian filters. We test our method across various cases of a challenging dataset: a low-resolution satellite video comprising numerous small moving objects. We demonstrate the proposed method outperforms competing approaches across different scenarios with both object detection and object tracking metrics.


INTRODUCTION
In recent years, object detection and tracking in remote sensing videos have become a widely attractive area of research.Novel satellite and Wide Area Motion Imagery (WAMI) technologies have created an unprecedented demand for fast and automatic information retrieval.For example, Airbus' Zephyr high altitude drones can cover up to 20, ×, 30 km 2 of continuous video surveillance, or the Chinese Jilin-1 satellite captures ground images spanning several kilometers with a 1-m spatial resolution imaged at 20 Hz.
The generated images contain essential information for civilian and military domains when ground sensors are not locally available.Sample civilian applications include urban planning (Wijnands et al., 2021), automatic traffic monitoring (Kaack et al., 2019), driving behavioral research (Chen et al., 2021), or commerce management with ship monitoring (Cao et al., 2019).Similarly, object detection and tracking contribute to military applications such as border protection or abnormal activity monitoring.For example, the work proposed by Kirubarajan et al. (2000) presents an approach to detect and tracks convoys in different scenarios such as road networks or open fields.
While object tracking has dramatically improved during the last years, a significant amount of approaches solve problems that contain large training datasets and feature-rich targets, such as pedestrian tracking in surveillance cameras or city landscapes.Nevertheless, novel methods need to tackle application-related challenges such as small object tracking in remote sensing images and have to overcome challenges such as datasets with scarce and incomplete annotations.
Particularly, targets in satellite images and high altitude drones present notable challenges to common detectors and trackers.First, objects of interest are very small compared to the field of view.For instance, Figure 1 shows a ground image with a resolution of 1 m/pixel where vehicles span on average 5 × 6 pixels and resemble white moving blobs.In fact, numerous small objects appear at subpixel levels such as motorcycles and are not detectable for common appearance-based object detectors.Additionally, images show diverse noise sources such as illumination changes, clouds, shadows and environmental phenomena such as wind or rain.These noise sources generate numerous false positives when using motion as the main feature for object detection.Moreover, satellites and drones orbit introduce parallax effect noise for object detectors and motion prediction noise for object trackers.
In this paper, we present improvements and further results of our work presented by Aguilar et al. (2021) where we detect small objects using motion and appearance information.We use three consecutive frames to estimate moving object locations and we refine the detections using a patch-based Faster RCNN (Ren et al. (2015)).Specifically, in this paper we improve the patch-based detection by adding the motion response into the Faster RCNN input.The combination of motion and appearance information on extracted patches improves significantly Faster RCNN's object detection.
Once we obtain object measurements, we feed the extracted data to the probability hypothesis density (PHD) filter, proposed by Mahler (2003).This filter models multi-object states under a Markovian framework, where the state of each tracked object is conditionally independent of all but the previous step.This assumption simplifies the filter and allows it to be computationally efficient in comparison to other related filters at the cost of tracking single state instances instead of full target trajectories.In this paper, we propose an enhanced version of the PHD filter to propagate labels in time without compromising the filter's performance and also to discriminate surviving and appearing objects in each frame.
This paper is divided into five sections.We discuss popular object detection and tracking approaches used in satellite images in Section 2. We discuss the proposed method in Section 3 where we present the object detection and object tracking approaches.We show results for a challenging dataset in Section 4 and we discuss the conclusion and future work in Section 5.

RELATED WORK
While object detection and tracking are related, for sake of simplicity, we divide our literature review into two categories composed of object detection and tracking applied to satellite images.

Static Image Object Detection
Static image object detection methods rely on spatial information to extract features and obtain object segmentation masks or bounding boxes.Popular approaches include Faster-RCNN, proposed by Ren et al. (2015), YOLO, proposed by Redmon et al. (2016), Retina-Net, proposed by Lin et al. (2017).Although these works obtain remarkable results across several benchmarks, their performance decreases significantly when tested with small objects or weakly labeled datasets such as in remote sensing images.In fact, Acatay et al. (2018) presented a comprehensive review and the drawbacks from using the base Faster-RCNN, YOLO, and Single Shot Detectors (SSD) on aerial images.Several researchers approached satellite object detection with modified Frontiers in Signal Processing | www.frontiersin.orgappearance-based object detector approach for remote sensing images.For example, Ren et al. (2018) proposed a modified Faster-RCNN to detect small objects in satellite images by modifying the anchor boxes, adding skipped connections, and including contextual information.However, this method focuses on capturing relatively large objects such as planes and large ships.Similarly, Qian et al. (2020) proposed a modified version of Faster-RCNN with a new architecture, new metric, and loss to optimize the training of small objects bounding boxes that do not overlap.

Motion-Only Object Detection
Motion-based detections consist principally in background subtraction and frame differencing.A popular approach is to model backgrounds with Gaussian distributions and parameters derived from observations.This model has been extensively expanded such as with the method proposed by Stauffer and Grimson (2000) to use Gaussian mixture models (GMM) instead of a single Gaussian distribution, or the work proposed by Han and Davis (2012) which uses kernel density estimators (KDE) to estimate background distributions and support vector machines (SVM) to discriminate objects.Yang et al. (2016) proposed ViBe, an approach that updates the background estimation persistently and locally by using random selection.However, background subtraction methods generate noisy results when dealing with long sequences of images with a moving imaging system such as a satellite or drone.
Similarly, frame differencing has shown robustness across several methods.For example, Teutsch and Grinberg (2016) proposed to use frame differencing together with numerous post-processing filters to perform object detection in WAMI images.Also, Ao et al. (2020) proposed to use frame differencing together with noise estimation and shape-based filters to extract objects.These approaches obtain reasonable results but they rely on complex hand-crafted post-processing steps that can be hardly adapted to different noise sources.
Motion models are often robust and computationally lightweight; however, their performance relies heavily on frame registration.Small errors in frame registration or illumination changes often lead to large errors in motionbased object detection.

Spatio-Temporal Convolutional Neural Networks
State-of-the-art methods aim to combine approaches from both appearance and motion to improve object detection.Generally, these methods use CNNs that take into account both motion and appearance information to extract object locations.For instance, LaLonde et al. (2018) proposed ClusterNet and FoveaNet, a two-stage approach for exploiting spatial and temporal data in small object detection.They use five consecutive frames as input to an under-sampling network to create clusters of object locations (ClusterNet), and then they use a region specialized network (FoveaNet) to refine the outputs of the first network.Also, Canepa et al. (2021) proposed T-Rex Net, a network that uses frame differencing as inputs to the network to improve small object detection performance.Sommer et al. (2021) proposed an appearance-based and motion-based object detector by combining two networks, one to estimate moving objects locations, and one to extract image features.These methods showed promising results for ultra high resolution datasets such as the WPAFB 2009 (AFRL ( 2009)) dataset which contains a resolution of up to 0.25 cms/pixel; however, these approaches cannot be directly applied to lower resolution data such as at 1m/pixel as the target features are lost and performing undersampling could miss the small targets.

Feature Tracking
Common tracking approaches for satellite images include the use of correlation filters and expansions to this approach.Correlation filters find similarities between frames to responses to learned filters and match the coordinates and responses.For example, Du et al. (2017) employed a correlation filter combined with three frame difference to track objects in satellite images, and Xuan et al. (2020) used correlation filters together with linear equations to track objects even under occlusions.While these methods are robust for object tracking, they rely on initialization and are normally adapted to track single objects.

Joint Tracking and Detection
Numerous state-of-the-art tracking methods are deep learningbased and learn to jointly detect and track objects.For instance Bergmann et al. (2019) proposed Tracktor++ to use a CNN to perform both object detection and tracking.Similarly, Feichtenhofer et al. (2017) proposed Track to Detect and Detect to Track to regress both bounding boxes for the object dimensions and for the object temporal displacement.Among robust CNN tracking approaches are attention-based methods such as Patchwork, proposed by Chai (2019), which consists in using an attention mechanism to predict the location of an object in future frames.Jiao et al. (2021) created a survey of novel generation deep learning-based techniques used for object tracking, where methods mostly depend on correlating learned features in time.

Track by Detection
Tracking by detection approaches include SORT, proposed by Bewley et al. (2016) and its extension DeepSORT, proposed by Wojke et al. (2017).SORT consists of an online multiple object tracker (MOT) that uses multiple Kalman filters for tracking and the Hungarian algorithm (Kuhn and Yaw (1955)) for data association, and DeepSORT is an extension that uses object features similarity to modify the data association step.These approaches obtain state-of-theart results in remarkable computational times; however, due to their pragmatic approach, they do not process a unified multi-object data uncertainty model that can model ambiguous target paths.Reid (1979) proposed a Bayesian framework named multiple hypothesis tracking (MHT) and Fortmann et al. (1980) proposed the joint probabilistic data association (JDPA).These approaches consider unified probabilistic models and propagate the data association combinatoric metrics on time.However, these filters are often slow due to the complicated data association process and the exponential increase of complexity with time.
Finally, the random finite set (RFS) framework and random finite set statistics proposed by Mahler (2007) propose an attractive track-by-detection paradigm without compromising the computational time.Among popular trackers are the PHD filter, proposed by Mahler (2003), the cardinally PHD filter, presented by Vo et al. (2006), and novel methods such as the Labelled Multi-Bernoulli Filter, developed by Vo and Vo (2013) and its computationally efficient version Vo et al. (2017).In our case, we propose an extended version of PHD filter due to its robust results and significant computational advantages.

PROPOSED APPROACH
In this paper, we extend the work proposed by Aguilar et al. (2021) which employs a 3-frame difference algorithm to approximate target locations and a patch-based CNN to refine detections.We extend this work by 1) concatenating the frame difference response to the input for the neural network, 2) by performing a tile-based patch selection rather than coordinatebased patch selection.Finally, we use an extended version of the PHD filter, a Bayesian multi-object tracker, to convert frame-wise object detections into track hypothesis.

Motion Detector
We estimate object motion by finding differences between consecutive frames and adding their responses to create a likelihood 3FD k (i, j), where (i, j) ∈ R 2 are the pixel coordinates and k ∈ N is the time index.This process is summarized in the equations: Sequentially, we binarize the 3FD k (i.j) response with a frameadaptive threshold to obtain rough object location estimates by applying the formulas: Where c ∈ (0, 1) is a percentage-based threshold hyperparameter and is used to remove noisy 3-frame difference responses.We chose c by performing grid search and choosing values of c that would favor higher detection rates, in particular we set c = 15% for all the experiments shown in Section 4. The 3frame difference approach yields good object location estimates but it fails to perform shape regularization, detect low contrast objects, and detect slow-moving targets.Therefore, we complement the frame difference response with Faster RCNN (Ren et al. (2015)).This addition helped to filter false positives, discriminate nearby objects, and increase the detection rate.
We use the frame difference for two objectives: to reduce the target search space and to feed this information to the neural network.We begin by tiling the image starting at the origin and using the response G (i, j) to find patches with moving objects.The patch-based approach rather than full image-based approach presents significant advantages: it contributes to focusing on relevant areas rather than the whole image space, and it contributes to training a network with scarce data because one image can yield several training patches.We extract patches that contain object hypothesis (given by the frame difference response) and refine the detections using Faster RCNN.
We modify the inputs to the traditional Faster RCNN by including three consecutive frames (shown in Figure 2B) and by concatenating these images to the frame difference response (shown in Figure 2C).This step is different from our previous approach Aguilar et al. (2021) where we used only one patch as input for the CNN.Using three frames together with the frame-difference response provides an additional cue for the network to detect moving objects (denoted by cyan and yellow colors in the concatenated inputs in Figure 2D. Figure 2E shows that our approach detects very small moving objects such as motorcycles that would have been missed by using only one frame as input.The addition of motion information improves detection rates for small moving objects and also reduces false positives of vehicle-looking static objects.Section 4.3 shows further details in the effect of using three frames and frame difference as opposed to one frame.
Finally, we merge the patch results by performing global nonmaximum suppression and applying the respective offset to the patch-based detections.The whole object detection process is summarized in Figure 3.

Motion and Measurement Modeling
We define the state vector for the jth target at time k as x j k [p x , p y , v x , v y , w, h] T where p x , p y ∈ R denote the target x and y position, v x , v y ∈ R denote the target velocity components, and w, h denote the target width and height respectively.We assume the target motion is linear and adopt the constant velocity (CV) model with Gaussian noise.Hence we assume the targets evolve according to the equation: where Q k is the motion covariance and F k is the transition matrix defined as: Where τ is a hyperparameter related to the sampling frequency.Similarly, we define the ith measurement at time k as z i k [p x , p y , w, h] T , where p x , p y , w, h ∈ R denote the x, y coordinates, width and height respectively.We assume the noisy and Gaussian measurements in the form of , where R k is the measurement noise covariance and H k denotes the measurement matrix defined as:

PHD Filter
We aim to estimate the multi-target states from a sequence of possibly noisy or cluttered measurements.We approach this task by using the random finite set (RFS) statistics defined by Mahler (2007).This setup provides a Bayesian formulation for modeling objects and observations as set-valued random variables.Specifically, the collection of targets state at time k is defined by   The PHD filter provides an approximation to the optimal multi-target filter by modeling the posterior p k|1:k (X k |Z 1:k ) as a Poisson random finite set and by recursively propagating its firstorder statistical moment, called probability hypothesis density (PHD) function.The PHD filter achieves this task by iteratively performing a two step process: the prediction step and the update step.
The prediction step consists on estimating the PHD function D k|1:k−1 (X k |Z 1:k−1 ) at time k given only previous measurements, abbreviated as D k|k−1 (x).The update step consists on estimating the posterior PHD D k|1:k (X k |Z 1:k ) using the predicted information and the new measurement obtained at time k and is abbreviated to D k|k (x).

The GM-PHD Filter
The Gaussian Mixture PHD Filter (GM-PHD), proposed by Vo and Ma (2006), is a close form solution to the PHD recursion and its convergence properties are analyzed by Clark and Vo (2007).The GM-PHD relies on the assumptions of linear Gaussian motion and measurement models explained in Section 3.2.1.Additionally, the GM-PHD assumes the form of the posterior at the previous time frame, D k−1|k−1 (x), has the form of a Gaussian mixture given by: Where J k−1|k−1 is the number of Gaussian components and ω The GM-PHD filter estimates the predicted D k|k−1 (x) and updated D k|k (x) PHDs with Gaussian mixtures.The closed form solution for the GM-PHD prediction step is given by the equation: Where F k and Q are respectively the transition and motion covariance matrices defined in Section 3.2.1,p s is the survival probability, and λ(x) is the birth RFS intensity which will be described in Section 3.2.4.Finally, we update the GM-PHD posterior following the equation: Where D k|k−1 (x) denotes the predicted GM components and p D is the probability of detection.The terms m j k|k (z) and P j k|k represent the updated component mean and covariance and are defined as: The updated component weight ω j k|k (z) is defined as: Where κ k (z) denotes the clutter process intensity (modeled with a Poisson Random Finite Set) and l j k (z) denotes the targetmeasurement association likelihood defined as: We estimate the filter's inference cardinality by adding all the weights in the posterior PHD and we apply merging and pruning for components with very small weights in order to preserve the computational advantages of the PHD filter.

PHD Filter Enhancements
We use a measurement-driven approach to estimate the birth λ(x) intensity.Specifically, we use an adapted measurement classification similar to Fu et al. (2018) to discriminate measurements into surviving measurements, Z s k and birth measurements Z b k .During each iteration, we use the Hungarian algorithm to find the optimal matching between the new measurement set, Z k , and the set of spatial components of the predicted GM-PHD: {H m j k|k−1 } j 1,2,...,J k|k−1 .If the distance between a measurement and a predicted component mean is less than a threshold, we classify the target as surviving measurement, otherwise, all the unassigned measurements are classified as a birth-proposal.
We implement the label preserving structure proposed by Panta et al. (2009) as the original GM-PHD filter does not account for target labels or past trajectories.This extension initializes a label for every Gaussian mixture component and propagates the label in time without affecting the filter performance.Each birth step initializes new labels for each birth component and the labels are tracked during the prediction and the data association step.These advantages contribute to keeping track of possible target trajectories without compromising the filter computational load.

Evaluation Metrics
We evaluate our methods by using object detection and object tracking metrics.We use ground truth annotations in the form of where k is the frame number and o i = (p x , p y , l) is a single annotated object at coordinates (p x , p y ) with associated label l.We let an estimated target be ôi (p x , py , l), where px , py are the location components from the GM-PHD filter inferred object state, and l is the inferred associated label.At every frame, we match the set of detected targets with the set of ground truth objects, we label an estimated target ôi as true positive (TP) if is within five pixels away from an unmatched ground truth object, otherwise, we label the object as a false positive (FP).Similarly, we label any ground truth target that has not been matched to an estimated target as a false negative (FN).Finally, we call a track an identity switch (IDS) if its object track hypothesis is associated with more than one ground truth label l.

Object Detection Metrics
For object detection, we report the F1 score which is a widely accepted evaluation metric to evaluate the quality of the detector.The F1 score is defined as: Where precision denotes the ratio of relevant hypothesis proposed by the object detector and is defined as: Recall denotes the percent of correctly detected objects in comparison to the total number of available objects and is defined as: We report these metrics as percentages, where the best score is of 100 and the worst score is 0. Additionally, we present a precisionrecall curve to show the robustness of the proposed approach over the possible parameter ranges and to show its improved performance over possible competing approaches.We use these tests to choose the parameters for running the F1 score for each listed method.

Object Tracking Metrics
We also report tracking metric ClearMOT, proposed by Bernardin and Stiefelhagen (2008), as it has become a popular and robust metric for tracking algorithms.We report the multiple object tracking accuracy (MOTA) which evaluates the quality of the recovered tracks.It considers FPs, FNs, and identity switches (IDSs), The MOTA score is defined as: Where N refers to the number of frames, and FN k , FP k , IDS k , GT K refers to the false negatives, false positives, identity switches and number of ground truth objects at frame k respectively.The MOTA score has a range in (−∞, 1), where negative values report poor performances, and one is the best possible score.In this work, we report the scores as a percentages to keep consistency with literature.We also report the multiple object tracking precision (MOTP), which considers the average distance error between the detected objects and the ground truth objects.The MOTP is defined as:  Where c k refers to the number of correctly detected objects at frame k and d i,k denotes the distance between a ground truth object and the detected hypothesis.The MOTP score is in the range [0, ∞) where 0 denotes the perfect score and large values denote worse performances.
Finally, we report track quality measures in a similar format to Dendorfer et al. (2021).We call a trajectory mostly tracked (MT) if we can persistently track at least 80% of its path.Similarly, we call a trajectory mostly lost (ML) is we can track 20% or less of its ground truth trajectory.We report these scores as percentages where larger percentages of MT scores denote better performances but larger percentages of ML scores denote worse performances.

Experiment Set up
For evaluation purposes, we use the CGSTL dataset, available at https://mall.charmingglobe.com.This dataset contains a video of the city of Valencia, Spain, recorded on 7 March 2017, by the Jilin-1 satellite.Its spatial resolution is 1 m/pixel and the video spans 12 kms 2 , with a size of 3,071 × 4,096 pixels.The video contains 580 frames and represents 29 s of video imaged at 20 frames per second.The labels were provided by Ao et al. (2020) and contain the (x, y) object center coordinates, the width, and height of the object bounding boxes.The provided ground truth contains strong labeling for only moving targets in three areas of interest (AoI) of size 500, ×, 500 pixels (shown in Figure 4).The approximate coordinate location for each area are AoI 1 [520,1616], AoI 2 [1074,1895] and AoI 3 [450,2810] with respect to the first frame.Additionally, we performed image stabilization (ORB (Rublee et al. (2011)) to compensate for the satellite motion during the recorded video.Finally, only one every ten frames is labeled (58 total labeled frames), hence, we used the stabilization procedure and linear interpolation between frames to fill the label subsampling.The stabilization procedure has a significant impact on object detection, object tracking, and score evaluation across all 580 frames as these methods depend on linear object motion and static background.It is worth mentioning we improve the stabilization procedure over our previous work (Aguilar et al. (2021)) by using the Python OpenCV implementation of ORB (Rublee et al. (2011)); hence our 'true positive' distance criteria is set to five pixels rather than 20 pixels as in our previous work.
All of the AoIs contain highways and moving vehicles at high speed.AoI one contains a roundabout, where objects reduce their velocity and travel in clusters.AoI two contains a highway next to farming structures that create numerous false positives for both motion and appearance-based object detectors.AoI three contains a highway with objects moving at high speeds.It is worth mentioning all AoIs contain numerous motorcycles and very small objects that are often missed in the ground truth annotations due to the difficulty of labeling such objects at such low image resolution.For each AoI, we trained the network using the other two AoIs as training data due to the ground truth data scarcity.We trained the networks using extracted patches of size 128 × 128 centered at ground truth objects and we augmented data by using patch vertical and horizontal flips, and random translations.We used the Pytorch implementation for Faster-RCNN using a pre-trained ResNet50 proposed by He et al. (2016) as backbone for feature extraction.The networks were trained using an NVIDIA QUADRO using stochastic gradient descend as optimizer with a learning rate of lr = 0.005 and a weight decay of 0.0005.

Ablation Studies
We perform ablation studies to investigate the impact of using patch-based inference and the impact of including motion information on object detection quality.We report the F1 scores for our method using path-selection only, motioninformation only, and patch-selection and motioninformation combined.We evaluate these scores across all   Yang et al. (2016) 74.4 56.9 64.47 GMM, Wren et al. (1997) 35.9 54.9 43.43 Faster-RCNN, Ren et al. (2015) 62 AoIs and report the average precision, recall, and the F1 scores for each combination.

Patch-Based Inference
We test the effect of using a patch-based method by comparing a full-image and patch-based inference with Faster RCNN.
Table 1 shows that a full-image Faster RCNN obtains a F1 metric of 61.69 but using a patch-based Faster RCNN increased the F1 score to 69.22.The patch-based approach outperforms Faster RCNN in the precision score because it reduces the search space to areas with moving objects and decreases the ratio of FPs.This result is expected as satellite images contain numerous blob-looking objects that yield false positives and Faster RCNN alone would detect the objects as vehicles.These results are developed further and shown numerically and visually in Section 4.4.Additionally, we test the effect of varying the patch size by evaluating average object detection metrics using patch sizes of 32, 64, 128, 256, and 512 (full image).
The size effects for the patch selection are depicted in Table 2, were the highest F1 score is obtained for the patch size of 128 × 128 pixels.During our experiments, we concluded that the patch size of 128 × 128 focuses the CNN to smaller regions while preserving contextual information.In fact, a patch size of 64 × 64 yielded numerous false positives from static objects with white-blob appearance.On the contrary, large patch sizes such as 256 × 256 and 512 × 512 obtained large numbers of misdetections due to the small object size in comparison with the field of view.

Motion-Based Inference
We investigate the effect of including motion information by testing the full-image Faster RCNN combined with motion information.We achieve this task by feeding three consecutive frames concatenated with the three frame difference algorithm to Faster RCNN.Table 1 shows that including motion information for the full-image Faster RCNN improves the F1 score from 61.69 to 70.05.This improvement occurs due to the increase in the precision score, from 56.73 to 69.46.Our results show that including motion information also helps Faster RCNN to filter non-moving objects in a similar fashion to using a patch-based approach.

Motion and Patch-Based Inference
Finally, we test the effects of including motion information and a patch-based approach to the original Faster RCNN.Table 1 shows that adding both motion information and patch-based inference increased the F1 score of the original Faster RCNN by 6 and 7% respectively.The combined effect of using a patch inference and including motion information reduced the false-positive ratios further, thus, increasing the precision score from 69.46 to 69.06 to 78.13.It is worth noting that neither the addition of motion or a patch-based approach contributed to increasing the recall score.In fact, full-image Faster RCNN obtains higher recall values than the proposed approach at the cost of increasing the number of false detections.These results suggest further development explained in Section 5.

Object Detection Evaluation
We evaluate the proposed object detector using the F1 metric mentioned in Section 4.1.1 and we compare its performance with five competing object detectors: custom 3-frame difference proposed by Ao et al. (2020), background subtraction using Gaussian mixture models proposed by Wren et al. (1997), ViBe, proposed by Yang et al. (2016), Faster RCNN, proposed by Ren et al. (2015) and the Patch-based object detector presented by Aguilar et al. (2021).We calibrate each method parameters by running a precision-recall curve on AoI 1, shown in Figure 5.We also show visual and numerical results for each AoI by reporting the precision, recall, and F1 scores for each competing method in Table 3 and by showing sample object detection results in Figure 6 and in Figure 7.
We varied the threshold and confidence parameters for 11 points in the range (0, 1) for the methods: 3-frame difference, GMM, Faster RCNN, Patch-based RCNN, and the proposed approach.For ViBe, we changed the neighbor radius parameter: R for 11 points in the range (5, 50). Figure 5 shows that our method is robust to parameter variations: it obtains better F1 scores across a diverse parameter range as the combination of appearance and time information increases true positives and decreases false negatives.
Figure 6 shows sample results for AoI 1.This area contains clusters of small moving objects at a roundabout and also presents numerous small vehicles such as motorcycles or bicycles.Figure 6 shows that ViBe and GMM struggle to detect small and low contrast targets; hence, their recall values in Table 3 are the lowest for AoI 1.
Similarly, the 3-frame difference approach merges and splits nearby targets.On the other side, Figure 6 shows that the supervised approaches detect a large number of relevant objects; thus their recall score for all these methods is greater than 75%.However, both Faster RCNN and patch-based RCNN suffer from false positives such as detecting objects in farms or buildings.These artifacts reduce the overall F1 score for the detectors.
Figure 7 shows AoI two which contains two high-speed highways next to buildings with rich textures that generate  (Ren et al., 2015) and SORT (Bewley et al., 2016).Third column: proposed tracking algorithm.false positives.For example, Figure 7 shows clusters of moving objects. Figure 7 shows that both Faster RCNN and the patchbased RCNN detect false positives in the static background while our approach can discriminate only moving objects.Table 3 shows that the proposed approach obtains better F1 scores than all the competing methods, thanks to the better combination of precision-recall.It detects more relevant objects while reducing the overall ratio of false positives.

Object Tracking Evaluation
We compare object tracking using the MOTA, MOTP, MT and ML and F1 scores shown in Tables 4, 5, 6.We compare the proposed GM-PHD tracker with the SORT tracker, developed by Bewley et al. (2016) and with the Generalized Labeled Multi-Bernoulli filter (GLMB), developed by Vo et al. (2017).We test the tracking outputs applied to each object detector shown in Table 3 combined with all 3 filters.The rows marked with an asterisk* in Tables 4, 5, 6 show tracking metrics using ground truth object detections as filter inputs.These measurements simulate ideal object detectors and contribute to calibrating the filters' parameters.Tables 5,  6 show robust performance for all three trackers across AoI two and AoI 3 (high-speed highways): all three filters obtain MOTA scores close to 99%.However, Table 4 shows a case where SORT outperforms the GM-PHD and the GLMB filter when tracking with ground truth labels.SORT obtains a MOTA score of 99.4% while the GLMB filter 85.1% and GM-PHD filter obtains 89.7%.The GM-PHD and GLMB filter decrease their performance mostly due to the increased uncertainty and label switches for nearby slowmoving targets inside the roundabout of AoI 1.
The second to seventh row of Tables 4, 5, 6 show metrics for tracking results applied to each object detector output.These detectors present considerable challenges for trackers due to clutter measurements and numerous misdetections.Tables 4, 5, 6 show that both the GLMB and GM-PHD filter outperform the SORT filter for object detectors with high detection rate.For instance, the GM-PHD filter obtains higher MOTA scores for 3-frame difference, Faster-RCNN, patchbased Faster-RCNN, and the proposed method.These results are reflected in Figure 8 where the GM-PHD recovers most of the objects moving in the roundabout.On the other hand, SORT outperforms the GM-PHD and GLMB filters for object detection with low detection rate such as ViBe and GMM, where SORT obtains higher MOTA scores than the GM-PHD filter but lower MOTA scores compared to the proposed object detection and GM-PHD filter.
During our experiments, we determined that SORT performs better in tracking cases with linear constant motions, such as in AoI one and AoI 2. In fact, SORT obtained better results than the GM-PHD and GLMB filter for AoI two when applied in our proposed method.However, SORT presented difficulties adapting to high-speed tracks as in AoI 3. Figure 8 shows the incomplete track trajectories of applying SORT to the outputs of Faster RCNN.
Finally, our modified GM-PHD filter presents similar tracking performances to the GLMB filter.The GLMB tracker slightly outperforms the modified GM-PHD filter in most tracking scores in all three AoIs.This is an expected result as the GLMB tracker shares the RFS framework with GM-PHD but has been extended to jointly estimate object states and tracks.Nevertheless, the GLMB filter retrieves tracks at the cost of a high computational burden.In fact, the efficient implementation of the GLMB filter (Vo et al. (2017)) relies on a pre-processing PHD filter lookup step and a Gibbs sampler step to perform joint prediction and update.Vo et al. (2017) explain that the efficient GLMB filter has a complexity of O(P 2 M), where P denotes the number of hypothesis and M the number of measurements.On the other hand, our proposed GM-PHD filter has a linear complexity of O(PM).Additionally, we present sample computational times using the default GM-PHD (O(PM)) filter and default GLMB (O(P 2 M)) filter implemented in Matlab by Vo et al. (2017).Table 7 shows that the default GLMB filter is on average 4.77 times slower than the default GM-PHD filter.While our implementation of the GM-PHD filter obtains slightly lower tracking scores, it presents a considerable advantage in terms of computational demands.This advantage is particularly important for onboard applications where robusts online tracking algorithms are preferred.

CONCLUSION AND FUTURE WORK
In this paper, we presented an improved track-by-detection approach where we use motion information together with neural networks to detect small moving objects on satellite images.Additionally, we perform tracking by using a modified version of the GM-PHD filter.Our version of the GM-PHD uses a measurement-driven birth intensity approximation and a label propagation in time.We present results for three AoIs in a challenging dataset where our approaches do not only outperform competing detection and tracking algorithms, but also detect objects not labeled by the ground truth annotations.
While our method performs detection and tracking, the method still requires several improvements.For example, our approach still misses several objects at sub-pixel level that appear and disappear.This drawback could be improved by including the tracking information into the object detection in order to perform a unified track-and-detection approach.

FIGURE 1 |
FIGURE 1 | Jilin-1 satellite image with provided annotations.Each colored box represents a target instance.
where x j k denotes the jth target state vector at time k, and N k denotes the cardinally of X k .Similarly, the measurements at frame k are defined by the RFSZ k {z 1 k , z 2 k , . . ., z M k k }, where M k denotes the cardinality for the measurement RFS at time k.Our objective is to model the multitarget state posterior of X k given all the previous measurements Z 1,2, . . .,k , namely we aim to find p k|1:k (X k |Z 1:k ).
−1 are the weight, mean, and covariance for each GM component in the posterior distribution at time k − 1.

TABLE 2 |
Average F1 scores for different patch sizes.

TABLE 4 |
Tracking Metrics for AoI 1. *Denotes ground truth measurements used for calibration and filter-only testing.

TABLE 5 |
Tracking Metrics for AoI 2. *Denotes ground truth measurements used for calibration and filter-only testing.
Tracking Metrics for AoI 2. *Denotes ground truth measurements used for calibration and filter-only testing.Sample Object Tracking.The square denotes the object current location and the line the object past locations.First column: ground truth marks.Second Column: Faster RCNN 6 |

TABLE 7 |
Computing times for modified GM-PHD and GLMB filters.