A Deep Learning Approach to Maritime
Vessel Detection
University of Turku
Department of Computing
Master of Science Thesis
Computer Science
May 2024
Aleksi Kangas
Supervisors:
Luca Zelioli
The originality of this thesis has been checked in accordance with the University of Turku quality assurance
system using the Turnitin OriginalityCheck service.
UNIVERSITY OF TURKU
Department of Computing
Aleksi Kangas: A Deep Learning Approach to Maritime Vessel Detection
Master of Science Thesis, 63 p.
Computer Science
May 2024
The detection of maritime vessels is a fundamental task in maritime surveillance,
and it is essential for various applications such as maritime traffic monitoring, search
and rescue, and maritime security. Modern maritime surveillance systems rely on
computer vision and deep learning techniques to detect and track maritime vessels
in real-time. Maritime vessel detection from images is a challenging task due to
various factors such as occlusions, varying illumination conditions, and long-range
distances.
The focus of this thesis is on researching and experimenting with modern object
detector architectures, backbone networks and different maritime datasets in order
to understand the effects of different factors on the performance of maritime vessel
detection. Using transfer learning, in total 6 different object detectors are trained
and evaluated on 2 different maritime datasets. Used architectures include one-stage
and two-stage object detectors. Experimentation is performed on consumer grade
hardware.
The results of the experiment show that it is viable to develop a maritime vessel
detection system using transfer learning and modern object detector architectures,
even on consumer grade hardware. Quantitatively, the chosen one-stage architecture
outperformed the chosen two-stage architecture with equivalent backbone networks,
although the performance with both architectures were satisfactory. The qualitative
results show that the major challenges in maritime vessel detection are related to
environmental factors such as varying illumination conditions, long-range distances,
and occlusions. Thus, additional research is needed to develop more robust maritime
vessel detection systems. Techniques such as sensor fusion have been shown and
could be used to improve the performance of maritime vessel detection systems,
especially in challenging environmental conditions.
Keywords: vessel detection, object detection, maritime, transfer learning, computer
vision
TURUN YLIOPISTO
Tietotekniikan laitos
Aleksi Kangas: A Deep Learning Approach to Maritime Vessel Detection
Pro gradu -tutkielma, 63 s.
Tietojenkäsittelytieteet
Toukokuu 2024
Meriliikenteen laivojen ja alusten havaitseminen on keskeinen haaste merivalvonnas-
sa. Luotettava alusten havaitseminen on erityisen tärkeää erilaisissa sovelluskohteis-
sa, kuten meriliikenteen seurannassa ja turvallisuudessa sekä etsintä- ja pelastus-
toimissa. Nykyaikaiset automaattiset merivalvontajärjestelmät hyödyntävät kone-
näköön ja syväoppimiseen perustuvia menetelmiä alusten havaitsemiseksi ja seuraa-
miseksi reaaliajassa. Meriliikenteen havaitseminen kuvista on haastavaa erilaisten
tekijöiden, kuten vaihtelevien valaistusolosuhteiden, pitkien etäisyyksien ja esteiden
vuoksi.
Tämän opinnäytetyön painopisteenä on nykyaikaisten kohteenhavaitsemisarkkiteh-
tuurien, runkoneuroverkkojen ja erilaisten meriliikenteen tietoaineistojen tutkiminen
ja kokeilu, ymmärtääksemme eri tekijöiden vaikutukset meriliikenteen alusten auto-
maattiseen havaitsemiskykyyn. Siirto-oppimista hyödyntäen, yhteensä kuusi erilais-
ta kohteenhavaitsemismallia koulutetaan ja arvioidaan kahdella erilaisella merilii-
kenteen tietoaineistolla. Käytetyt kohteenhavaitsemisarkkitehtuurit sisältävät yksi-
ja kaksivaiheisia kohteenhavaitsemismalleja. Opinnäytetyön keskeinen koeasetelma
suoritetaan kuluttajatason laitteistolla.
Kokeen tulokset osoittavat, että meriliikenteen alusten havaitsemisjärjestelmän ke-
hittäminen on toteuttamiskelpoista, jopa kuluttajatason laitteistolla, hyödyntäen
siirto-oppimista ja nykyaikaisia kohteenhavaitsemisarkkitehtuureja. Määrälliset tu-
lokset osoittavat, että valittu yksivaiheinen arkkitehtuuri suoriutui paremmin kuin
valittu kaksivaiheinen arkkitehtuuri vastaavilla runkoneuroverkoilla, vaikka suoritus-
kyky molemmilla arkkitehtuureilla oli tyydyttävä. Laadulliset tulokset osoittavat,
että suurimmat haasteet meriliikenteen alusten havaitsemisessa liittyvät ympäristö-
tekijöihin, kuten vaihteleviin valaistusolosuhteisiin, pitkiin etäisyyksiin ja esteisiin.
Näiden haasteiden vuoksi tarvitaan lisätutkimuksia kehittämään tehokkaita meri-
liikenteen laivojen ja alusten havaitsemisjärjestelmiä. Lisäksi erilaisia tekniikoita,
kuten sensorifuusiota, on kirjallisuudessa ehdotettu hyödynnettäväksi havaitsemis-
järjestelmien suorituskyvyn parantamiseksi, erityisesti haastavien ympäristöolosuh-
teiden vallitessa.
Asiasanat: aluksen havaitseminen, kohteen havaitseminen, merenkulku, siirto-oppiminen,
konenäkö
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Literature Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Maritime Situational Awareness . . . . . . . . . . . . . . . . . 3
1.2.2 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Transfer Learning in Object Detection . . . . . . . . . . . . . 6
1.2.5 Vessel Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 COCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 ABOships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 SeaShips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Neural Networks 13
2.1 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . 16
i
2.1.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Object Detection 28
3.1 Two-Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 One-Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Single-Shot Multibox Detector . . . . . . . . . . . . . . . . . . 32
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . 35
3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Intersection over Union . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 COCO Detection Evaluation Metrics . . . . . . . . . . . . . . 39
4 Experiment 40
4.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 COCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 ABOships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.3 SeaShips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ii
4.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Experiment Results 49
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Detector Architecture . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Vessel Size and Distance . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.3 Environmental Conditions . . . . . . . . . . . . . . . . . . . . 58
5.4 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 59
6 Conclusion 62
References 64
iii
1 Introduction
1.1 Motivation
A deep understanding of the maritime environment is essential for a wide range of
applications, including maritime traffic monitoring, surveillance and military oper-
ations. The maritime environment is complex and challenging, as it is affected by
weather, lighting conditions and occlusions. Detecting vessels, i.e. what kind of
ships or boats are present and where they are, is a crucial part of developing Situa-
tional Awareness (SA) in the maritime context. SA has been traditionally defined as
a three-level process: perception of the environment, comprehension of the situation
and prediction of the future state [1].
There are many ways to achieve perception of the environment in the maritime
context, such as using radars, cameras, Automatic Identification System (AIS) and
other sensors. For example, in [2] the authors have developed a novel detector based
on Convolutional Neural Networks (CNNs) for multiscale Synthetic Aperture Radar
(SAR) ship detection. In this thesis, the focus is on using traditional RGB cameras,
i.e. using Computer Vision (CV) techniques to detect vessels from images.
Once the perception of the environment has been achieved, the next challenge is
in understanding the situational context, which is crucial for the safety and efficiency
of the operations, especially in port areas. Comprehension of the situation can be
viewed as a traditional Pattern Recognition (PR) problem, where the goal is to
1.1 MOTIVATION 2
process the sensor information and uncover the patterns and regularities in the
data. Techniques like object detection [3], image classification [4] and sensor fusion
[5] are generally used to achieve comprehension of the situation. This thesis focuses
on the comprehension of the situation by developing a vessel detection system using
object detection techniques.
In practice, Artificial Intelligence (AI) and Machine Learning (ML) methods are
the standard tools for developing predictive models for modern object detection.
In vessel detection, Neural Networks (NN) and Deep Learning (DL) methods are
widely used. For example, in [6], the authors have published a collection of papers on
the topic of remote sensing in vessel detection and navigation, containing papers on
subtopics such as "Convolutional Neural Networks for Detection and Classification
of Vessels" and "Object Detections by Different Sensors". Later maritime vessel
detection research has also focused on Sensor Fusion (SF) techniques, such as fusing
RGB and IR images [7] for better detections in challenging situations.
1.1.1 Problem Statement
For maritime traffic monitoring and safety purposes, it is essential to develop a robust
vessel detection system that can detect vessels from images automatically. Such
an automatic detection system would be very useful for port authorities, military
organizations and other operators in monitoring and estimating the traffic flow both
in ports and waterways and in open waters. This thesis aims to investigate the
feasibility of developing a robust vessel detection system using publicly available data
and models, and to expand the knowledge of the factors affecting the performance
of the object detection in the maritime context.
1.2 LITERATURE OVERVIEW 3
1.1.2 Research Questions
The research questions of this thesis are centered around the feasibility of developing
a robust vessel detection system using publicly available maritime datasets and pre-
trained object detection models, which are run on consumer-grade hardware. More
specifically, the thesis seeks to answer the following research questions:
1. How does the vessel detection performance differ between the two object de-
tector architectures, one-stage SSD FPN (3.2.1) and two-stage Faster R-CNN
(3.1.1)?
2. How does the performance of the object detectors differ when fine-tuned on
the maritime datasets, ABOships (1.3.2 & 4.2.2) and SeaShips (1.3.3 & 4.2.3)?
3. How do different ResNet (3.3.1) feature extractors sizes affect the performance
of SSD FPN and Faster R-CNN object detectors?
1.2 Literature Overview
This section briefly introduces the basic concepts of maritime situational awareness,
pattern recognition, machine learning, transfer learning and vessel detection.
1.2.1 Maritime Situational Awareness
Situational Awareness (SA), defined as "being aware of your surroundings" [8], is
a key concept in contexts like aviation, military and maritime. It is extremely
important in military contexts to be able to identify and track vessels, but also
in civilian contexts, such as maritime traffic monitoring and surveillance. Systems
like Vessel Traffic Services (VTS) and River Information Services (RIS) serve as
information providers for maritime SA in ports and waterways [9]. Van den Broek
et al. [10] have presented a framework for maritime SA by combining sensor data
1.2 LITERATURE OVERVIEW 4
with intelligence and context information. Vessel detection based on object detection
techniques can be used as part of the observable processing chain.
1.2.2 Pattern Recognition
Large amounts of data are being collected all around us, purchasing habits of cus-
tomers, medical records of patients, images and videos from surveillance cameras,
and much more. Pattern Recognition (PR) refers to a process of finding regular-
ity structures and patterns in data. Moreover, Bishop [11, p. 1] defines PR as a
field "concerned with the automatic discovery of regularities in data through the use
of computer algorithms and with the use of these regularities to take actions such
as classifying the data into different categories". Figure 1.1 displays a simplified
overview of the PR process.
Figure 1.1: A simplified overview of the Pattern Recognition process. Notably,
automated feature extraction via machine learning methods or even end-to-end deep
learning have become extremely useful in practice, adapted from [12].
In Computer Vision (CV), PR tasks traditionally include image classification,
object detection, image segmentation, image captioning, etc. The Convolutional
Neural Network (CNN) architecture has been successfully used in many CV tasks,
like image classification, semantic segmentation and importantly object detection
[12], which is in the focus of this thesis.
1.2 LITERATURE OVERVIEW 5
Different PR problems could be solved by using hand-crafted rules and heuristics,
i.e. by manual feature extraction, but such an approach is often infeasible and results
in poor performance in practice [11]. Instead, Machine Learning provides a more
general and automated approach to solving PR problems.
1.2.3 Machine Learning
Machine Learning (ML) allows computers to learn patterns from data without hand-
crafted feature extraction. Deep Learning (DL) is a subfield of ML. Many common
tasks in computing can be solved by using ML, such as regression (predicting a
continuous value), classification (predicting a discrete value), machine translation,
anomaly detection and much more. [13, p. 98-103]
ML methods can be grouped into unsupervised and supervised learning algo-
rithms. In unsupervised learning, the algorithm learns patterns from unlabelled
data. For example, clustering of data into groups of similar data points or learning
the entire probability distribution of the data are unsupervised learning tasks. [13,
p. 98-105]
In supervised ML algorithms, the algorithm learns patterns from labelled data,
i.e. data where each sample is also associated with a label [13, p. 105-106]. Object
(vessel) detection is a supervised learning task, where the algorithm learns to detect
objects from images, using a set of images with known objects and their locations
in the images.
A key challenge in ML is to develop a model that generalizes well on unseen data.
Generally, we evaluate the performance of a model by measuring its performance on
a dataset not used during training. The choice of the performance measure depends
on the task at hand. [13, p. 104]
A common technique for assessing the performance of a model is with Cross-
Validation (CV). CV is a "data resampling method to assess the generalization abil-
1.2 LITERATURE OVERVIEW 6
ity of predictive models and to prevent overfitting" [14]. In general, CV generates
multiple training and testing sets from the original dataset, using random subsam-
pling methods. Different variations of CV exist, such as the simple leave-one-out
and (stratified) k-fold being the most commonly known variations. For example,
in leave-one-out CV, only one sample is used for testing and the rest for training,
and the splitting is applied for all samples, yielding a performance measure for each
sample, which can be averaged to get the final performance measure. [14]
1.2.4 Transfer Learning in Object Detection
DL requires large amounts of labelled data for training, which is often not available
in many domains. This is a problem in many applications, including maritime vessel
detection. Additionally, training a DL model from scratch is computationally expen-
sive. Transfer Learning (TL) is a ML technique that aims to improve performance
in target domain by using the knowledge from a related source domain [15].
TL drastically reduces the amount of data needed for training a model. This
is achieved by using a pre-trained model on a source domain, and then fine-tuning
the model on the target task. [15] Practically, this means that the target model is
initialized with the weights from an existing model, and then the model is trained on
the target domain. Puttemans et al. [16] have shown that TL can be used to build
robust industrial applicable object detection systems even with affordable hardware
and modest amount of target domain data.
TL has been shown to be useful in vessel detection. The main challenge being
the lack of huge vessel datasets, which is a problem shared by many other domain-
specific object detection tasks. TL is an efficient way to use a trained network on
another task, i.e. across domains [17].
Farahnakian et al. [17] have explored the performance of DL based vessel detec-
tion models using TL. Their results show that TL is important for training CNN-
1.2 LITERATURE OVERVIEW 7
based vessel detection models capable of maritime vessel detection.
1.2.5 Vessel Detection
It is challenging to develop a universal, automated and efficient CV-based vessel
detection system due to the complexity of the maritime environment [18]. Different
types of vessels, changing weather conditions and occasional occlusions hinder the
development of such a system.
Dataset diversity is a key factor in developing a robust vessel detection system.
For example, the SeaShips dataset (1.3.3) has been explicitly designed to contain
images with different backgrounds, lighting environments, visible proportions of ves-
sels, and occlusions. Even a large vessel can appear as a tiny object in the image if
it is far away from the camera. Vessels can occlude each other, especially in crowded
areas such as ports.
Wawrzyniak et al. [9] have proposed a detection method using video streams
of existing monitoring system cameras. Even with the reasonably good results, the
authors state that the incorrect detections are mainly due to unfavourable lighting
conditions.
Another obvious challenge is night-time detection, where RGB images are not
sufficient. Infra-Red (IR) cameras, and Sensor Fusion (SF) in general, allow for
better detection in night-time conditions. For example, Farahnakian and Heikko-
nen [7] have explored different multi-modal fusion architectures utilizing RGB and
IR images in vessel detection. Such techniques have been shown to improve the
performance of vessel detection systems.
1.3 DATASETS 8
1.3 Datasets
Building a precise vessel detection model requires a large amount of maritime images.
Images should contain vessels of different sizes and categories, depicted from various
angles and distances, as well as in different weather and lighting conditions. A
dataset is a collection of information (e.g. images, bounding boxes, labels) used for
training and evaluating machine learning models.
1.3.1 COCO
The generic ‘Microsoft Common Objects in Context‘ (COCO) dataset [19] contains
3 146 images of vessels. Originally, the COCO dataset was developed to advance
the state-of-the-art in object recognition by gathering and annotating many images
of common objects in their natural context. In total, the COCO dataset contains
328 000 images, 2.5 million labelled object instances and 91 object categories from
everyday scenes. Examples of COCO images containing vessels are shown in Figure
1.2.
The COCO dataset has 3 146 images and 11 189 instances in the ‘boat‘ category.
It contains images of vessels of all sizes, from small rowboats to large ships. See
Table 1.1 for the amount of images and objects, and the amount of vessel images
and instances in the datasets. According to Iancu et al. [18], the COCO dataset has
more vessel instances than other well-known general object detection datasets, like
‘ImageNet‘ [20]. For maritime object detection, the COCO dataset is a reasonably
good choice as the pre-training dataset, but for precise vessel detection, we need a
proper maritime dataset. However, some specialized maritime datasets have been
published in recent years, such as ‘ABOships‘ and ‘SeaShips‘.
1.3 DATASETS 9
Figure 1.2: Examples of COCO images containing vessels of different sizes and types
in the ‘boat‘ category.
1.3.2 ABOships
The ‘ABOships‘ dataset [18] is a new maritime dataset from 2021 consisting of
inshore and offshore images of vessels from a watercraft traversing in the Turku
(Åbo) region and the Finnish Archipelago. The dataset contains 9880 images of
maritime objects (41967 in total) extracted from a collection of 720p 15 FPS videos
at every 15 seconds (i.e. every 225 frames). Each annotated maritime object belongs
to one of the 11 categories: ‘boat‘, ‘cargoship‘, ‘cruiseship‘, ‘ferry‘, ‘militaryship‘,
‘miscboat‘, ‘miscellaneous‘, ‘motorboat‘, ‘passengership‘, ‘sailboat‘ and ‘seamark‘.
An example image is shown in Figure 1.3. The table 1.1 shows that ABOships
dataset has the most categories, 11 in total, but two of them, ‘miscellaneous‘ and
‘seamark‘, are not vessels of any kind or have not been identified as such.
One of the advantages of the ABOships dataset is the amount of vessel objects
as well as the information about the categories of vessels. The authors have focused
on creating a maritime dataset by taking into account the challenges of maritime
1.3 DATASETS 10
Figure 1.3: An example image from the ABOships dataset. The dataset contains
images of maritime objects from the Finnish Archipelago and the Turku (Åbo)
region.
object detection, such as background variation, atmospheric conditions, occlusions
and vessel scale variations [18]. Thus, the dataset is suitable for the purposes of
investigating transfer learning capabilities of object detection models in maritime
object detection.
1.3.3 SeaShips
The ‘SeaShips‘ dataset [21] is a large-scale maritime dataset from 2018. It con-
sists of images of six common types of vessels: ‘ore carrier‘, ‘bulk cargo carrier‘,
‘general cargo ship‘, ‘container ship‘, ‘fishing boat‘ and ‘passenger ship‘ extracted
from cameras of a coastline video surveillance system. The authors have focused
on selecting images with different occlusions, backgrounds, vessel scales and other
variations. The dataset contains 31 455 images of vessels (40 077 in total), which is
more than the vessel images in the COCO dataset, and ABOships dataset. However,
the publicly available subset [22] of the dataset contains only 7 000 images of the
1.3 DATASETS 11
total 31 455 images. An example image is shown in Figure 1.4. The table 1.1 shows
that SeaShips dataset has the most vessel instances, but less vessel categories and
in general less vessel instances per image than the ABOships dataset.
Figure 1.4: An example image from the SeaShips dataset. The dataset contains
images of six common types of vessels from a coastline video surveillance system.
Compared to the ABOships dataset, the SeaShips dataset tends to have less
vessel instances per image. This is because the SeaShips dataset has been collected
from narrow port areas, whereas the ABOships dataset has been collected from both
the Aura-river and the open waters of the Finnish Archipelago. On the other hand,
the SeaShips dataset has vessel categories focused more on larger vessels, such as ‘ore
carrier‘ and ‘bulk cargo carrier‘. It should be taken into account when discussing
about real world applications of the object detection models. The SeaShips dataset
is suitable for the purposes of investigating the performance of the vessel detection
models.
1.4 THESIS OUTLINE 12
Dataset COCO ABOships SeaShips
Images 328 000 9 880 31 455
Instances 2 500 000 41 967 40 077
Categories 91 11 6
Vessel Images 3 146 7 992 31 455
Vessel Instances 11 189 34 100 40 077
Vessel Categories 1 9 6
Table 1.1: Summary of the amount of images, instances and categories of the
datasets.
1.4 Thesis Outline
This thesis is structured in a way, where the theoretical background is presented
first, followed by the experiment and results. Chapter 1 has introduced the motiva-
tion and problem statement of the thesis, as well a brief literature overview of the
central concepts, and the datasets used in the experiment. Chapter 2 introduces
the basic concepts of (convolutional) neural networks and deep learning necessary
for understanding the feature extractor (e.g. ResNet) and the basis of the object
detection models. Chapter 3 introduces the theory of object detection, transfer
learning and the two object detection models, SSD and Faster R-CNN, used in the
experiment. Chapter 4 states the experiment setup in which the two pre-trained
object detection models are trained and evaluated on the two maritime datasets,
‘ABOships‘ and ‘SeaShips‘. Chapter 5 presents the results of the experiment and
discussion about the results and future work. Chapter 6 summarizes and concludes
the thesis.
2 Neural Networks
Neural Networks (NNs), inspired by the biological neurons of the brain, belong to
the set of ML models. Artificial NNs consist of a network of many simple processing
units, called neurons. Furthermore, the learning capabilities of NNs resemble the
human brain by acquiring knowledge through a learning algorithm and storing the
acquired knowledge into interneuron connection weights. NNs are described as mas-
sively parallel distributed processors and occasionally referred to as neurocomputers.
[23, p. 24]
Humans have an inherent ability to transfer knowledge from one task to another.
In contrast, traditional ML models are usually task-specific, meaning that they are
only capable of solving the task they were trained for. Transfer Learning (TL) is
a ML technique that aims to transfer learned knowledge from one task to another
[24]. NNs include a capability to adapt the weights to the task at hand. A NN
trained for one task environment can be easily adapted to another task environment
by retraining it with a new set of training data [23, p. 25]. The focus of the thesis
is on the TL capabilities of NNs in the context of vessel detection.
2.1 NEURAL NETWORK ARCHITECTURE 14
2.1 Neural Network Architecture
2.1.1 Neuron
Neuron is a fundamental unit of a NN. Highly inspired by the biological neuron of the
human brain, it is a simple processing unit that takes a number of inputs, performs
a computation and outputs a result. Mathematically, a neuron can be described
as a function that takes a vector of inputs x = [x1, x2, ..., xn] and outputs a scalar
value y. The inputs x are both multiplied by learnable weights w = [w1, w2, ..., wn]
and added together to form the logit of the neuron, z =
∑︁n
i=1wixi. Frequently,
a constant bias term b is added to the logit. The output of the neuron is then
computed by applying a function f to the logit, yielding [25, p. 45-47]:
y = f(z) = f(x ·w + b). (2.1)
The vector representation is crucial for the implementation of NNs, as it allows for
efficient computation using matrix operations [25, p. 45-47]. A neuron is illustrated
in Figure 2.1.
Figure 2.1: Illustration of a neuron, with inputs x1, x2, ..., xn, weights w1, w2, ..., wn,
activation function f and output y.
The function f in Equation 2.1 is referred to as the activation function of the
neuron. The original perceptron model developed by Rosenblatt in 1958 used a step
function as the activation function [26]. Several other activation functions have been
proposed since then, and the choice of activation function is an important design
decision when building a NN. Three commonly used non-linear activation functions
2.1 NEURAL NETWORK ARCHITECTURE 15
are sigmoid, softmax and rectified linear unit (ReLU) [25, p. 51-54].
Sigmoid
The sigmoid function is defined as f(z) = 1
1+e−z . The S-shaped curvature of the
function ensures that the output for a small logit is close to 0 and for a large logit
close to 1. [25, p. 51] The output of the sigmoid is saturated for small and large logit
values, meaning that the gradient of the function is close to 0. This introduces the
vanishing gradient problem, where the gradient descent algorithm converges slowly
or not at all. [27]
Softmax
The softmax function is defined as f(zi) = exp(zi)∑︁
j exp(zj)
. Softmax is essentially a gener-
alization of the sigmoid function to multiple dimensions. When we have a discrete
variable with n possible values and we want to represent a probability distribution
over it, we can use the softmax function. The softmax function is commonly used
as the output of a classifier. It is also susceptible to the saturation problem, caused
by the input being extremely negative or positive. [13, p. 184-186]
ReLU
The ReLU (Rectified Linear Unit) [25, p. 52] function is defined as f(z) = max(0, z).
It became a popular activation function after it was successfully used in the AlexNet
architecture by Krizhevsky et al. at the ImageNet Large Scale Visual Recognition
Challenge in 2012. They achieved the winning top-5 test error rate of 15.3% beating
the second-best result 26,2% by using a deep convolutional NN with ReLU activation
functions. The advantage of the ReLU is that it is non-saturating, which means that
the vanishing gradient problem is alleviated. [28]
2.1 NEURAL NETWORK ARCHITECTURE 16
2.1.2 Feed-Forward Neural Networks
The idea of a NN is to connect multiple neurons together in a way that allows com-
plex computations. The first NNs date back to the 1800s, but the basis of modern
NNs was laid roughly in the 1950s and 1960s [29]. Rosenblatt proposed in 1958
[26] and 1962 [30] the concept of perceptron model and the general architecture of a
multi-layer perceptron. Fundamentally, a single neuron has more expressive power
than a linear perceptron, as it is capable of performing non-linear computations,
provided that the activation function is non-linear. However, even a single neuron
with a non-linear activation function is not capable of solving complex problems.
[25, p. 47-51]
A modern NN is a collection of neurons organized into layers. In a NN, each
neuron is connected to at least one other neuron, with the connections between the
neurons being expressed with a numerical weight coefficient. A multi-layer feed-
forward NN contains an input layer, an output layer, and one or more hidden layers
in between. In a fully-connected feed-forward NN, every neuron within a layer is
linked to every neuron in the subsequent layer of the network. [31]
Thus, the neurons within a layer of a fully-connected feed-forward neural net-
work can be rearranged arbitrarily without affecting the final output of the network
[25, p. 97]. Feed-forward NNs only contain forward connections, i.e. no recurrent
connections or loops [25, p. 49, 208]. A typical fully-connected feed-forward NN is
illustrated in Figure 2.2.
The input layer is the first layer of the NN, and in charge of taking the input of
the NN in a vectorized form. For example, the input values could be the pixel values
of an image, in which case the number of input neurons would equal the pixels in
the image [25, p. 50].
The last layer of the NN is the output layer, which is responsible for outputting
the result in a form that is suitable for the task at hand. [25, p. 50] For instance, a
2.1 NEURAL NETWORK ARCHITECTURE 17
Figure 2.2: The structure of a typical fully-connected feed-forward neural network.
Illustrated is only a single hidden layer, but the network may contain any number
of hidden layers.
classification task with k classes would have k output neurons, where each neuron
represents the probability of the input being of the corresponding class.
The rest of the layers between the input and output layers are commonly referred
to as hidden layers [25]. The hidden layers are the basis of the expressive power of
NNs, since they allow the NN to learn complex non-linear relationships between the
input and output. Theoretical results suggest that a single hidden layer NN can
represent any function, given a sufficiently large number of neurons [32].
In theory, hidden layers may contain any number of neurons. However, the
number of neurons in each hidden layer is usually lower than the number of input
neurons, allowing the NN to learn a compressed representation of the input data
[25, p. 50]. For the NN to be able to solve non-linear problems (e.g. the simple
exclusive-or (XOR) problem), it must contain at least one hidden layer using non-
linear activation functions [13, p. 172]. Activation functions have been discussed in
more detail in Section 2.1.1.
2.2 TRAINING NEURAL NETWORKS 18
2.1.3 Deep Neural Networks
Formally, feed-forward NNs can be described as collections of functions f connected
in a chain-like manner. For example, we can represent a NN of 3 layers with functions
f (1), f (2), f (3) (where f (i) denotes the ith layer) chained together to yield f(x) =
f (3)(f (2)(f (1)(x))). The length of the function chain, i.e. the amount of layers, is
referred to as the depth of the network. [13, p. 168] Figure 2.2 illustrates a feed-
forward NN with 3 layers.
A Deep Neural Networks (DNNs) stacks many hidden layers on top of each other
[32, p. 275]. Deep Learning (DL) refers to the class of ML models that utilize many-
layered neural networks. In contrast to traditional ML methods, the benefit of DL is
end-to-end training, which means that the entire model is trained at once, including
automatic feature extraction and engineering, which are usually done manually in
traditional ML methods. [32, p. 27-28]
DL has become popular during the recent decades, achieving breakthrough per-
formance in many tasks, such as speech recognition, face detection and image clas-
sification. Traditionally, object detection has been performed with hand-crafted
features and shallow trainable architectures. DL methods enable learning of seman-
tic, high-level and deeper features from the data. These abilities have been shown
to improve the performance in object detection. DL object detection methods have
been and are being actively researched. [33]
2.2 Training Neural Networks
NNs are usually trained with gradient-based optimization algorithms, which itera-
tively update the weights of the network to minimize a loss function. The general
approach to training of NNs does not differ much from the training of traditional
ML models. [13, p. 177] The basic concepts of training NNs and relevant intricacies
2.2 TRAINING NEURAL NETWORKS 19
are discussed in this section.
2.2.1 Optimization Problem
Optimization means either minimizing or maximizing a function f(x) by choosing
a suitable value for x. Learning from data, i.e. fitting a model to the data, is a
fundamental optimization problem in ML. Most commonly, we tend to minimize a
loss function. [13, p. 82]
Let w be the weights of the NN and b be the bias. The goal of the optimization
problem is to find parameters θ = (w∗, b∗) which yield the minimal total loss,
formally [32, p. 85]:
w∗, b∗ = argmin
w,b
L(w, b). (2.2)
Loss Function
For measuring how the NN performs, a loss function is used. Traditionally, the loss
is a non-negative real value where smaller values indicate a better fit [32, p. 84].
Different loss functions are used for different tasks [34].
A common loss function for classification tasks is the cross-entropy loss. For any
given example y and prediction yˆ, the cross-entropy loss is defined as [32, p. 131]:
L(y, yˆ) = −
q∑︂
j=1
yj log yˆj (2.3)
Cross-entropy loss can be thought to both maximize the likelihood of the ob-
served data and to minimize the information (number of bits) required for commu-
nicating the labels [32, p. 134]. The derivation of cross-entropy loss is based on
information theory and the proof is omitted here. More about relevant information
theory can be found in for example [35].
A common loss function for regression problems is the smooth L1 loss, defined
2.2 TRAINING NEURAL NETWORKS 20
as [36]:
L(y, yˆ) =
∑︂
i
⎧⎪⎪⎨⎪⎪⎩
0.5(yˆ(i) − y(i))2 if |yˆ(i) − y(i)| < 1
|yˆ(i) − y(i)| − 0.5 otherwise
(2.4)
where yˆ(i) is the prediction and y(i) is the true value for example i.
Back-Propagation
Optimization problems are commonly tackled with gradient descent algorithms.
These algorithms rely on the gradient of the loss function ∇w,bL(w, b) to estab-
lish how to change the parameters to minimize the loss function L, i.e. traverse in
the direction of the steepest descent. Mathematically, the gradient of a function f
is defined as a vector of partial derivatives ∂
∂xi
f(x) which measure the change in f
with respect to each of the input variables xi. [13, p. 82-86]
Feed-forward NNs take an input x and compute an output yˆ. The input x is
transferred forward in the network until the output yˆ is computed along with a
scalar loss value L(w, b). This step is called forward propagation. [13, p. 204]
In contrast, back-propagation allows "the information from the cost to then flow
backwards through the network, in order to compute the gradient" with respect to the
parameters of the network. The optimization algorithm uses the computed gradient
value to update the parameters of the network. [13, p. 204]
The back-propagation algorithm relies on the chain rule of calculus. Formally, let
y = f(g(x)). Additionally, let the functions y = f(u) and u = g(x) be differentiable.
The chain rule is [32, p. 59]:
dy
dx
=
dy
du
· du
dx
. (2.5)
The chain rule allows to mathematically determine the gradient of the loss func-
tion, according to each parameter of the network. Suppose a chain x = f(w), y =
2.2 TRAINING NEURAL NETWORKS 21
f(x), z = f(y) of functions, where w is the parameter of interest. To compute ∂z
∂w
,
we can use the chain rule as follows [13, p. 211]:
∂z
∂w
=
∂z
∂y
· ∂y
∂x
· ∂x
∂w
(2.6)
2.2.2 Optimizers
When the gradient of the loss function is zero, the algorithm has reached a critical
point, which is either a local minimum, local maximum or a saddle point. A critical
point which yields the lowest value of the loss function is called the global minimum.
Optimization in the field of DL is challenging, because the loss functions may have
many local minima. [13, p. 82-86] The challenges in gradient-based optimization
are illustrated in Figure 2.3.
Figure 2.3: Challenges in gradient based optimization are many. Saddle points and
poor local minima are obstacles when trying to find the global minimum.
Instead of finding the global minimum, we settle to finding a local minimum
with a sufficiently low loss value [13, p. 82-86]. Many different optimizers based
on gradient descent have been proposed ([37], [38], [39], [40], . . . ), and a few are
presented in this section.
2.2 TRAINING NEURAL NETWORKS 22
Stochastic Gradient Descent
Traditionally, the whole dataset is used to compute the gradient value of the loss
function. This obviously does not scale well to large datasets. Since the gradient is
an expectation, we can approximate it by using a small subset of the training data,
a minibatch B = {x(1), ...,x(m′)}. The estimate of the gradient is then [13, p. 152]:
g =
1
m′
∇θ
m′∑︂
i=1
L(x(i), y(i),θ). (2.7)
And the stochastic gradient descent (SGD) update rule is [13, p. 152]:
θ ← θ − ϵg (2.8)
where ϵ is the learning rate.
SGD is a popular optimization algorithm for neural networks, because it scales
well to large datasets [13, p. 153]. Despite the popularity of SGD, it has some
drawbacks. Training with SGD may be slow because of noisy and small gradients
introducing variance to the learning process [13, p. 296-297].
Momentum Optimization
Momentum optimization, originally presented by Polyak in 1964 [37], is a modifica-
tion to the SGD algorithm that accelerates the convergence. It relies on the concept
of momentum, which in physics is defined as the product of mass and velocity. In
the context of optimization, a unit mass is assumed, and the momentum is defined
as the velocity that accumulates a moving average of the previous gradients, which
is decayed exponentially. This alleviates the problem of variance caused by noisy
gradients in SGD. [13, p. 296]
Formally, let velocity be v and α to be a hyperparameter for controlling exponen-
tial decay of the previous gradients. Then the momentum optimization algorithm
2.3 CONVOLUTIONAL NEURAL NETWORKS 23
is defined as: [13, p. 296]
v← αv − ϵg
θ ← θ + v
(2.9)
2.3 Convolutional Neural Networks
In computer vision tasks, feature extraction has always been an important research
topic. Traditionally, feature extraction has been based on pre-designed features
derived from statistical regularities or prior knowledge [41]. While DL allows for
automatic feature extraction, it is not feasible to use fully-connected NNs for image
data, because of the huge amount of parameters needed.
A neuron in the hidden layer of a fully-connected NN would have 784 parameters
when a 28×28 black-white image is used as input. With larger images, e.g. 200×200
image with 3 colour channels, the number of parameters in the input layer alone
would be 200× 200× 3 = 120, 000. [25, p. 121-122]
Clearly, this is not a feasible approach. Inspired by human vision, Convolutional
Neural Networks (CNNs) reduce the number of parameters drastically with an archi-
tecture suited for image data. [25, p. 121-122] CNNs provide an end-to-end learning
framework for feature extraction [41].
2.3.1 Convolution
The core of CNNs is the convolution operation, where a filter or kernel is multiplied
over the entire area of an input image [25, p. 123]. Given two real-valued functions
x and w, the convolution operation is defined as s(t) = (x ∗ w)(t) [13, p. 331].
Since images are two-dimensional, two-dimensional convolution kernel is used and
2.3 CONVOLUTIONAL NEURAL NETWORKS 24
the convolution operation is defined as [13, p. 332]:
S(i, j) = (I ∗K)(i, j) =
∑︂
m
∑︂
n
I(m,n)K(i−m, j − n) (2.10)
where I is the input image, K is the kernel and S is the output. An example of a
convolution operation is shown in Figure 2.4.
Figure 2.4: Simplified example of a convolution operation. The input image is 3× 3
and the kernel is 2 × 2. The kernel is applied to only the valid positions without
padding.
Convolution being commutative, the kernel can be flipped, and the convolution
operation can be written as [13, p. 332]:
S(i, j) = (I ∗K)(i, j) =
∑︂
m
∑︂
n
I(i+m, j + n)K(m,n) (2.11)
which is referred to as cross-correlation, and is the operation used in practice for
implementation reasons [13, p. 333].
If the input image shape is nh × nw and the kernel is kh × kw, the shape of the
convolution output is (nh − kh + 1)× (nw − kw + 1) [32, p. 254]. Thus, the output
shape is smaller than the input shape, assuming that the kernel is larger than 1× 1.
Two techniques, padding and stride, are used to control the output shape of the
convolution.
2.3 CONVOLUTIONAL NEURAL NETWORKS 25
Padding
When applying convolution, by default the kernel is centred at each pixel of the input
image [32, p. 247]. Pixels at the edges of the input image are used less than the
pixels at the centre of the input image. Successive convolution operations obliterate
the information at the edges of the image due to the reduction in the image size. To
preserve the output shape, the effective size of the input image can be increased by
padding the input image with zeros. In many cases, using a padding of ph = kh − 1
and pw = kw − 1 is sufficient, which will result in an output shape matching the
input shape. [32, p. 254-255]
Stride
The stride refers to the number of rows and columns that the kernel moves at each
step. In some scenarios, we may want to downsample the input image for e.g.
computational efficiency reasons. This can be achieved by using a stride larger than
1, which results in skipping intermediate locations yielding a smaller output shape.
A stride of sh and sw will result in an output shape of ⌊(nh − kh + ph + sh)/sh⌋ ×
⌊(nw − kw + pw + sw)/sw⌋. [32, p. 256-257]
2.3.2 Pooling
Another core concept of CNNs is pooling. Goodfellow et al. [13, p. 339] define
pooling as a function which "replaces the output of the net at a certain location with
a summary statistic of the nearby outputs". The main purpose of pooling is to make
the "representation approximately invariant to small translations of the input" [13,
p. 342]. A small translation in the input image should not result in a large change
in the output of the pooling layer. Deciding if an object (e.g. face) is present in an
image should not depend on the location of the object in the image. [13, p. 342]
Commonly used pooling functions are maximum pooling and average pooling.
2.3 CONVOLUTIONAL NEURAL NETWORKS 26
Maximum pooling obtains the maximum value of the input in the pooling area,
while average pooling obtains the average value of the input in the pooling area.
[32, p. 264]
In addition to the local translation invariance property, pooling also reduces the
computational burden of the network. This is because pooling region summarizes the
values in the region. Spacing these pooling regions k pixels apart reduces the inputs
to the next layer by a factor of k. This results in reduced memory consumption and
improved statistical efficiency. Pooling also allows to handle inputs of varying sizes,
which is a useful property for image data. [13, p. 342]
2.3.3 Architecture
A ’convolutional layer’ consists of a sequence of operations – convolution, nonlin-
earity and pooling [13, p. 341]. The input image is processed by k different kernels,
representing the weights and connections in the CNN. These convolution (2.3.1)
kernels produce a feature map for each kernel. Finally, the feature maps are pooled
(2.3.2) to reduce the dimensionality of the feature maps. [25, p. 122-132]
CNNs are typically composed of stacks of relatively complex convolutional layers
[13, p. 341]. An example of a CNN architecture is shown in Figure 2.5. Because
pooling is inherently a destructive operation, we tend to stack multiple convolutional
layers before pooling in deeper CNNs [25, p. 132]. Convolutional layers are typically
followed by fully-connected layers, which are used to map the high-level features to
the desired output. Fully-connected layers can compress the flattened feature maps
into a vector of desired length. [25, p. 135]
In a traditional fully-connected neural-network, each output unit interacts with
all the input units. However, CNNs use sparse interactions, i.e. each output unit
interacts with only a subset of the input units, which is achieved with a tiny kernel
compared to the input size. Fewer connections result in fewer parameters, which
2.3 CONVOLUTIONAL NEURAL NETWORKS 27
Figure 2.5: An example of a generic CNN architecture, adapted from [25, p. 133].
reduces the memory requirements and improves statistical efficiency. In deeper
CNNs, the deeper layers can ’see’ a large area of the input image indirectly through
the layers, allowing it to efficiently learn complex representations with relatively few
parameters. [13, p. 335]
Another key property of CNNs is parameter sharing. Traditionally, each weight
in a fully-connected neural-network is used only once when computing the output.
However, in CNNs, each kernel weight is applied to all positions of the input im-
age (except for the edges). Thus, the same kernel parameter is used for multiple
computations, and only a set of parameters is learned. [13, p. 335-338]
3 Object Detection
Vision is a task that is natural for humans, yet difficult for computers and machines.
Computer Vision (CV) is and has been an active research field, and lately especially
important with the rise of DL. As a field, CV is very broad as it includes diverse
ways of processing images and widely different application possibilities. [13, p. 452]
In traditional image classification, the task is to classify an image with a category,
based on the key or major object in the image. However, in many applications there
are multiple objects in an image, requiring the computer to know both the categories
and the locations of the objects in the image. This task is called object detection.
Spatial location of an object is usually represented with a bounding box, a rectangle
commonly defined by the x and y coordinates of the top-left corner and the bottom-
right corner. [32, p. 629-630]
Modern Object Detection
Object detection pipeline consists of three stages, informative region selection, fea-
ture extraction, and classification [33]. Applying traditional recognition algorithms
in image recognition can be too slow and inaccurate to be practical. Constructing
special-purpose detectors, that are able to find likely locations of objects in an image
quickly, is a more effective approach for object detection. [42, p. 295] Modern ob-
ject detectors rely on CNNs, which considerably reduce the amount of computation
needed in comparison to traditional recognition algorithms [17]. In the DL era of
3.1 TWO-STAGE DETECTORS 29
object detection, the detectors are roughly categorized into two groups: two-stage
(Section 3.1) and one-stage (Section 3.2) detectors [3]. Modern object detectors use
a backbone network, e.g. ResNet (Section 3.3.1), for feature extraction.
3.1 Two-Stage Detectors
Two-stage detectors frame the object detection as a two-stage "coarse-to-fine" pro-
cess [3]. The first stage of a two-stage detector is to propose a set of plausible
rectangular regions in the image, called region proposals. A classifier is then applied
to each region proposal for determination of object presence, and for refinement of
the bounding box of the detected object. [42, p. 304].
R-CNN
As deep CNNs proved out to be very effective in learning high-level feature repre-
sentations of an image [3], in 2014 Gircshick et al. a proposed a two-stage object
detector called R-CNN (Region-based CNN) [43]. R-CNN extracts region proposals
from an image using a selective search [44] algorithm, which are then fed into a
CNN to extract features [3]. At the end, one linear SVM (Support Vector Machine)
is trained for each object category for classification. At the time, R-CNN yielded
a substantial 30% relative improvement over the existing models on the PASCAL
VOC 2012 object detection challenge. [43]
Fast R-CNN
Fast R-CNN [36], proposed a year later by Girshick, is an improved version of R-
CNN [43], which includes innovative improvements yielding a 9× improvement in
training time of VGG16 network and a 213× improvement in inference time over
R-CNN, while also improving the detection accuracy. The key innovation of Fast
R-CNN is the Region of Interest (RoI) pooling layer. First, a CNN processes the
3.1 TWO-STAGE DETECTORS 30
entire image to produce a convolutional feature map. Using the RoI pooling layer,
the feature map allows the extraction of a fixed-length feature vector for each region
proposal. [36]
Max-pooling is applied to each valid RoI to convert the features into a small
feature map with H×W (e.g. 7×7) shape. The feature extraction can be done in a
single forward pass of the CNN, allowing the RoIs to share memory and computation
burden in both forward and backward passes. [36]
3.1.1 Faster R-CNN
Faster R-CNN [45] is an object detection model that builds on the R-CNN [43]
and Fast R-CNN [36] detectors. Fast R-CNN successfully made the detection net-
work faster, but the region proposal step remained as a bottleneck [45]. Faster R-
CNN alleviates this bottleneck by introducing a Region Proposal Network (RPN), a
Fully Convolutional Network (FCN) predicting object bounding boxes and object-
ness scores at each location simultaneously [45].
Region Proposal Network
The RPN receives an image as input and outputs a collection of region proposals
with objectness scores. Generation of the region proposals is done by applying a
smaller network on the convolutional feature map. This small network receives a
n×n spatial window of the convolutional feature map (originally n = 3). Projection
to a lower-dimensional feature map is performed, and two fully-connected layers are
applied to the projected feature map. These fully-connected layers are box-regression
and box-classification layers. All spatial locations share these fully-connected layers,
making the RPN efficient. [45]
For achieving translational invariance and for tackling the issue of different scales
of objects, the RPN uses multi-scale anchors as regression references. For each
3.1 TWO-STAGE DETECTORS 31
sliding window position, maximum k proposals are generated. Reference boxes,
called anchors, of various aspect ratios and scales are defined at each sliding window
position. The region proposals of the RPN are parameterized relative to these
k anchors. Originally, Ren et al. used k = 9 anchors at each sliding position,
consisting of 3 scales and 3 aspect ratios. The anchor-based approach allows sharing
of computation between the different scales and aspect ratios, as the convolutional
feature map is computed only once for the entire image. [45]
The loss function of the RPN is defined as a sum of classification and regression
loss [45]:
L({pi}, {ti}) = 1
Ncls
∑︂
i
Lcls(pi, p
∗
i ) + λ
1
Nreg
∑︂
i
p∗iLreg(ti, t
∗
i ) (3.1)
where Lcls is the binary cross-entropy loss (object or not), Lreg is the smooth L1
loss for bounding box regression. Terms pi and p∗i are the predicted probability and
ground-truth label of anchor i and terms ti and t∗i are the predicted and ground-
truth bounding box coordinates of anchor i. Additionally, the loss function includes
normalization terms Ncls and Nreg, and a balancing term λ. The RPN can be
trained with stochastic gradient descent (SGD), and Ren et al. used momentum
optimization with a momentum of 0.9. [45]
Architecture and Training
Faster R-CNN consists of two ‘modules‘, the first being the RPN and the second be-
ing the Fast R-CNN [36], which utilizes the region proposals generated by the RPN.
The architecture of Faster R-CNN is a unified network such that the convolutional
layers are shared between the RPN and the Fast R-CNN detector. [45]
The overall loss function of Faster R-CNN is the sum of the RPN loss and the
Fast R-CNN loss. The Fast-RCNN uses a multi-task loss function [36]:
3.2 ONE-STAGE DETECTORS 32
L(p, u, tu, v) = Lcls(p, u) + λ[u ≥ 1]Lreg(tu, v) (3.2)
where Lcls is the categorical cross-entropy loss and Lreg is the smooth L1 loss.
Terms p and u are the predicted probability and ground-truth label of a RoI, and
terms tu and v are the predicted and ground-truth bounding box coordinates of a
RoI. Additionally, the loss function includes a balancing term λ and an indicator
function [u ≥ 1] for handling the catch-all background class (i.e. u = 0).
Joint training of the RPN and the Fast R-CNN detector requires a specific tech-
nique the authors used, called alternating training. In alternating training, the RPN
is trained first and then the Fast R-CNN detector is trained with the proposals of the
RPN. The RPN is then initialized with the network tuned by Fast R-CNN detector
and the alternating training continues. [45]
3.2 One-Stage Detectors
Two-stage detectors can achieve high precision, but the complexity of the two-stage
pipeline results in high requirements for computational resources. In contrast, one-
stage detectors treat object detection as a "complete in one step" problem. Gen-
erally, one-stage detectors are faster than two-state detectors, which makes them
more suitable for applications with real-time requirements. However, their precision
is usually worse especially when detecting smaller objects. [3]
3.2.1 Single-Shot Multibox Detector
The Single-Shot Multibox Detector (SSD) [46], proposed by Liu et al. in 2016, is
a one-stage object detector that achieves high average precision with high detec-
tion speed. SSD is relatively simple compared to two-stage approaches, because
no explicit region proposal step is used. The idea of SSD is to use a collection of
3.2 ONE-STAGE DETECTORS 33
predefined bounding boxes to predict category scores and bounding box offsets. [46]
Architecture
The SSD architecture builds on a generic backbone network (originally VGG16 [47])
while including additional convolutional layers. The convolutional layers progres-
sively decrease in size, due to pooling, yielding a set of feature maps of different
sizes, allowing detection of objects of different scales. A small convolutional kernel
size of 3 × 3 is the basic element for predicting category scores and bounding box
offsets. The bounding box offsets are predicted relative to the default bounding
boxes. [46]
Default Boxes
A collection of default bounding boxes, often called default boxes, are defined and
anchrored relative to each feature map location. For each box out of k default boxes,
the SSD predicts offsets (4) to the original default box shape and c class scores (where
c is object category count). The authors used k = 6 default boxes with different
aspect ratios (1, 2, 3, 1
2
, 1
3
) and scales. The idea of default boxes is similar to the
anchors used in Faster R-CNN. The key difference is that the predefined boxes are
applied to feature maps of different resolutions. [46]
Training
Training of SSD requires matching of ground truth information to the chosen default
boxes. To determine the assignment, the authors used a matching strategy, where
the predefined boxes and ground truth boxes are matches based on the jaccard
overlap, i.e. the Intersection-over-Union (see Section 3.4.1). [46]
The loss function of SSD is a weighted sum [46]:
3.3 FEATURE EXTRACTION 34
L(x, c, l, g) =
1
N
(Lconf (x, c) + αLloc(x, l, g)) (3.3)
where Lconf is the confidence loss (softmax over class confidences), Lloc is the
localization loss (smooth L1 loss), x is the input data, c is the class predictions,
l is the predicted bounding box, g is the ground truth bounding box, and α is a
balancing parameter.
3.3 Feature Extraction
3.3.1 ResNet
There exists many different deep CNN based feature extraction (commonly known
as backbone) architectures for object detection, such as ResNet [48], VGG [47] and
others. In this thesis, the ResNet architecture is used as the backbone for the object
detection models.
While deeper NNs have more expressive power, they are more difficult to train
because of vanishing or exploding gradients. The more layers the network has, the
more multiplications the gradient has to go through, which can result in the gradient
vanishing, hurting the convergence.
Stacking many layers on top of each other has introduced a degradation problem,
where the NN accuracy is saturated and degrades rapidly. He et al. [48] propose
a solution to this problem with the Residual Network (ResNet) architecture. The
idea of ResNet is to add skip connections between layers, which allows the network
to learn residual mappings instead of the direct mappings. [48]
A residual block is a fundamental core of ResNet. It consists of a series of weight
(e.g. convolutional) layers and a skip connection that adds the input of the block to
the output of the block, bypassing the weight layers. Let x be the input and f(x) be
the direct mapping to learn. In a residual block, the weight layers learn the residual
3.3 FEATURE EXTRACTION 35
g(x) = f(x) − x instead of the desired f(x). Thus, the output is f(x) = g(x) + x,
where x is from the skip connection. [32, p. 312]
The original ResNet architecture was based on plain 34-layer CNN, to which the
skip connections were added. The ResNet34 architecture had better performance
than the baseline plain 34 layer CNN, and the degradation problem was addressed
well. [48] More commonly known variations of the ResNet architecture are ResNet50,
ResNet101 and ResNet152, which are based on 50, 101 and 152 layers respectively.
3.3.2 Feature Pyramid Network
Detection of objects at different scales is a common problem not only in maritime
vessel detection, but in object detection in general. The Feature Pyramid Network
(FPN) [49] is a feature extraction architecture that aims to allow the network to
better detect objects at different scales. FPN is based on an idea of a pyramid of
feature maps of different scales. The FPN has two pathways – a bottom-up pathway
and a top-down pathway. Each of the stages of the bottom-up pathway is connected
to the top-down pathway via lateral connections, allowing the top-down pathway to
use the feature maps of different scales from the bottom-up pathway. [49]
The bottom-up pathway is the standard feature extraction backbone, such as
ResNet, producing feature maps of different scales. For ResNets, the stages (or
levels) of the pyramid are produced by the last residual block of each stage. [49]
The top-down pathway essentially hallucinates higher-resolution feature maps
from the coarser feature maps of the bottom-up pathway. The top-down pathway
produces semantically strong, but spatially coarse feature maps, later refined with
the information from the lateral connections. [49]
In the context of this thesis, the original SSD inherently uses a pyramidal feature
hierarchy, as it uses feature maps of different scales for detection. The difference is
that SSD builds the feature pyramid high-up in the network, i.e. not reusing the
3.4 EVALUATION METRICS 36
higher-resolution feature maps of the hierarchy. The FPN is a more general and
flexible architecture, showing to improve the performance, especially in small object
detection. [49]
3.4 Evaluation Metrics
3.4.1 Intersection over Union
In order to assess the performance of an object detection model, a metric for "correct
detection" is needed. Intersection over Union (IoU) measures the overlap between
two bounding boxes. Mathematically, IoU is defined as the ratio of the intersection
area and the union area of two bounding boxes [50]:
IoU =
Area of Intersection
Area of Union
=
area(Bp ∩Bgt)
area(Bp ∪Bgt) (3.4)
where Bp is the predicted bounding box and Bgt is the ground truth bounding box.
A visual example of the IoU is shown in Figure 3.1.
IoU allows to establish definitions for "correct detection" and "incorrect detec-
tion". Given a threshold t and an IoU value of Bp and Bgt, a detection is considered
correct if the IoU is greater than or equal to the threshold t [50]:
IoU ≥ t ⇐⇒ Correct Detection (3.5)
3.4.2 Precision and Recall
Confusion Matrix (CM), is used to describe the outputs of a classification model [51].
Viewing object detection as a binary classification problem between "correct" and
"incorrect" detections, we can extrapolate from the CM the following information
[50]:
3.4 EVALUATION METRICS 37
Figure 3.1: IoU (0.80) of a vessel detection, image from SeaShips and overlays by
the author – ground truth bounding box in green, predicted bounding box in red.
• True Positive (TP): A correct detection of Bgt
• False Positive (FP): An incorrect detection of Bgt
• False Negative (FN): No detection of Bgt
• True Negative (TN): Not applicable in object detection (infinite number of B
that should not be detected)
CM and the definitions above are the basis of many different metrics [51]. In
object detection, the assessment of performance is largely based on the precision
and recall metrics [50].
3.4 EVALUATION METRICS 38
Precision is defined mathematically as the ratio of the correct detections (TP)
to the total number of detections (TP + FP) [50]:
Precision =
TP
TP+ FP
(3.6)
Recall is defined as the ratio of the correct detections (TP) to the total number
of ground truth objects (TP + FN) [50]:
Recall =
TP
TP+ FN
(3.7)
Intuitively, precision measures the ability to identify only relevant objects, while
recall measures the ability to find all relevant objects [50].
Precision-Recall Curve
Precision and recall usually have an inverse relationship, i.e. increasing one decreases
the other. Generally, if FP is low, precision will be higher, but recall is lower.
Likewise, if FN is low, recall will be higher, but precision is lower. An ideal object
detector would have both high precision and high recall (FN = 0 and FP = 0).
Precision-recall curve is a graph that shows the relationship between precision and
recall for different confidence thresholds t. The area under the precision-recall curve
(AUC) is a metric summarizing the overall performance of the model, where higher
AUC indicates better performance. [50]
Average Precision
In practical applications of object detection, the precision-recall curve has a zigzag-
like shape, which makes it difficult to accurately measure the AUC [50]. The AUC
can be summarized with the average precision (AP) metric, which is computed using
interpolation. For example, in 11-point interpolation the precision is measured at
3.4 EVALUATION METRICS 39
11 equally spaced recall levels R ∈ {0, 0.1, ..., 0.9.1}, yielding the AP [50]:
AP11 =
1
11
∑︂
R∈{0,0.1,...,0.9.1}
Pinterp(R) (3.8)
where Pinterp(R) is the maximum precision of all recall levels greater than R.
Mean Average Precision
Mean Average Precision (mAP) is an extension of the AP metric, which summarizes
the AP over multiple object categories. Let C be the set of object classes, then the
mAP is defined as [50]:
mAP =
1
|C|
∑︂
c∈C
APc (3.9)
3.4.3 COCO Detection Evaluation Metrics
This thesis uses the COCO Detection Evaluation [52] metrics for evaluating the
performance of the object detection models. In total, 12 different metrics in 4
groups are used, as shown in Table 3.1.
Metric Group Metric Description
Average Precision (AP)
AP AP at IoU = 0.50:0.05:0.95
APIoU=.75 AP at IoU = 0.50
APIoU=.50 AP at IoU = 0.75
AP Across Scales
APsmall AP small objects (area < 322)
APmedium AP medium objects (322 < area < 962)
APlarge AP large objects (area > 962)
Average Recall (AR)
ARmax=1 AR given 1 detection per image
ARmax=10 AR given 10 detections per image
ARmax=100 AR given 100 detections per image
AR Across Scales
ARsmall AR small objects (area < 322)
ARmedium AR medium objects (322 < area < 962)
ARlarge AR large objects (area > 962)
Table 3.1: The evaluation metrics of the COCO detection challenge [52]. The au-
thors use the terms mAP and AP (as well as mAR and AR) interchangeably.
4 Experiment
In order to understand and measure the capabilities of Transfer Learning in the
context of vessel detection from waterborne images, this thesis presents an experi-
ment in which two pre-trained object detectors, SSD and Faster R-CNN, are trained
and evaluated on two datasets, ABOships and SeaShips. The experiment outline is
presented in Section 4.1. The datasets are described in Section 4.2, the object de-
tectors in Section 4.3, the training process in Section 4.4 and the evaluation process
in Section 4.5. The experiment is conducted on a single consumer-grade computer
with the specifications presented in Section 4.6.
4.1 Outline
The main outline of the experiment is to fine-tune two pre-trained object detectors,
SSD and Faster R-CNN, on two maritime datasets, ‘ABOships‘ and ‘SeaShips‘.
Different ResNet feature extractors are used to study their effect on the transfer
learning performance. The research questions of the thesis have been presented in
Section 1.1.2, and for convenience are repeated here:
Research Questions
1. How does the vessel detection performance differ between the two object de-
tector architectures, one-stage SSD FPN (3.2.1) and two-stage Faster R-CNN
(3.1.1)?
4.2 DATASETS 41
2. How does the performance of the object detectors differ when fine-tuned on
the maritime datasets, ABOships (1.3.2 & 4.2.2) and SeaShips (1.3.3 & 4.2.3)?
3. How do different ResNet (3.3.1) feature extractors sizes affect the performance
of SSD FPN and Faster R-CNN object detectors?
4.2 Datasets
The experiment is conducted on two maritime datasets, ABOships [18] and Sea-
Ships [21]. The base models have been trained on the generic COCO [19] dataset,
introduced in Section 1.3. In order to evaluate the unbiased performance of the
vessel detectors, the datasets are split into three subsets: training, validation and
test. The training set is used to train the object detectors, the validation set is used
to evaluate the performance of the object detectors during training and the test set
is used to produce an unbiased estimate of the performance of each vessel detector.
4.2.1 COCO
For the purposes of this thesis, the COCO dataset contains 3 146 images of vessels
and has been discussed in Section 1.3.1. The drawbacks of COCO in the context
of vessel detection are the small amount of images containing vessels and the lack
of information about the categories of vessels. Examples of the ‘boat‘ category are
shown in Figure 1.2. Yet, it is suitable for pre-training the object detectors, as it
contains a large amount of images and object instances. All models (4.3) used in
the thesis have been pre-trained on the COCO dataset 2017 edition and are publicly
available in the TensorFlow 2 Detection Model Zoo [53].
4.2 DATASETS 42
4.2.2 ABOships
Each annotated maritime object (41968 in total) in the ABOships dataset belongs
to one of the 11 classes: ‘boat‘, ‘cargoship‘, ‘cruiseship‘, ‘ferry‘, ‘militaryship‘, ‘mis-
cboat‘, ‘miscellaneous‘, ‘motorboat‘, ‘passengership‘, ‘sailboat‘ and ‘seamark‘. For
the purposes of this thesis, the categories ‘miscellaneous‘ and ‘seamark‘ are excluded
from the dataset, as they are not vessels of any kind or have not been identified as
such. After the exclusions, the dataset contains 7 992 images of vessels, 34 100
instances in total. The category distribution of the dataset is shown in Figure 4.1
and an example image in Figure 1.3.
Figure 4.1: The category distribution of the ABOships dataset, where the categories
‘miscellaneous‘ and ‘seamark‘ have been excluded.
The extraction process described in the original article yielded a dataset with
images of vessels from different angles and distances, as well as in different weather
and lighting conditions. Variety is a desirable characteristic for the purposes of this
thesis, as the object detectors are expected to perform well in different conditions.
One consideration is the dependency between the video frames. However, the ex-
traction interval being 15 seconds, the dependency between the frames should be
minimal and not expected to have a significant effect on the results.
4.2 DATASETS 43
Data Split
The ABOships dataset is not distributed with a predefined data split, so for the
purposes of the experiment the dataset needs to be split into three subsets: training,
validation and test. The dataset is split into three subsets with a ratio of 70% (5
333 images, 22 907 vessels) for training, 15% (1 329 images, 5 594 vessels) for
validation and 15% (1 330 images, 5 599 vessels) for testing. In order to preserve
the distribution of the classes in the subsets, special care is taken when splitting
the dataset. First the labels are grouped by image filename, then the groups are
split into the three subsets using stratified random sampling. Grouping by filename
ensures that an image cannot be split into multiple subsets, which would cause data
leakage, invalidating the results. Stratified sampling ensures that the distribution
of the classes is preserved also in the subsets. The class distributions of the subsets
are shown in Figure 4.2 and of the original dataset in Figure 4.1.
Figure 4.2: The class distributions of the subsets of the ABOships dataset using
stratified random sampling.
4.2 DATASETS 44
4.2.3 SeaShips
The publicly available subset [22] of the SeaShips dataset is the other maritime
dataset used in the experiment. The subset contains 7 000 images of vessels from
the original dataset. The vessels are divided into six categories: ‘ore carrier‘, ‘bulk
cargo carrier‘, ‘general cargo ship‘, ‘container ship‘, ‘fishing boat‘ and ‘passenger
ship‘. The category distribution of the dataset is shown in Figure 4.3 and an example
image in Figure 1.4.
Figure 4.3: The category distribution of the SeaShips (7000) dataset.
Data Split
While the publicly available subset of the SeaShips dataset is distributed with a
predefined data split, an explicit data split is performed for the purposes of the
experiment. The predefined data split consists of 25% (1 750 images) for training,
25% (1 750 images) for validation and 50% (3 500 images) for testing. In order
to achieve comparable results with the ABOships dataset, the dataset is split into
three subsets with a ratio of 70% (4 668 images, 6 134 vessels) for training, 15% (1
166 images, 1 550 vessels) for validation and 15% (1 166 images, 1 537 vessels) for
testing. The split procedure is the same as with the ABOships dataset, grouping by
filename and using stratified random sampling to ensure that the distribution of the
4.3 MODELS 45
classes is preserved in the subsets. The class distributions of the subsets are shown
in Figure 4.4 and of the original dataset in Figure 4.3.
Figure 4.4: The class distributions of the subsets of the SeaShips dataset using
stratified random sampling.
4.3 Models
The experiment is conducted using two pre-trained object detectors, SSD (Section
3.2.1) and Faster R-CNN (Section 3.1.1). The SSD models use FPN (Section 3.3.2)
with the backbone feature extractors. These pre-trained object detectors are avail-
able in the TensorFlow 2 Detection Model Zoo [53] with various feature extractors,
of which ResNet50, ResNet101 and ResNet152 are used in the experiment. The
ResNet feature extractors are named according to the number of layers in the net-
work, e.g. ResNet50 has 50 layers. All object detectors have been pre-trained on
the COCO dataset and are fine-tuned separately on the ABOships and SeaShips
datasets. The configurations used in the experiment are presented in Table 4.1.
The pre-trained models are distributed with a configuration file, which contains
the definition of the model architecture and the hyperparameters used in training.
The configuration files are kept largely unchanged, with the exception of slight
4.4 TRAINING 46
Detector Feature Extractor Input Size
SSD ResNet50 FPN 640x640
SSD ResNet101 FPN 640x640
SSD ResNet152 FPN 640x640
Faster R-CNN ResNet50 640x640
Faster R-CNN ResNet101 640x640
Faster R-CNN ResNet152 640x640
Table 4.1: The configurations of the pre-trained object detectors used in the exper-
iment.
modifications to the hyperparameters. The training process is described in more
detail in Section 4.4.
4.4 Training
Preparation
Before training, all datasets are converted into the TensorFlow’s TFRecord [54]
format. The images and annotations of the datasets are split into multiple TFRecord
files, i.e. sharded, for performance reasons.
Configuration
For all models, the minibatch size is reduced to 4 from the pre-configured 64, in
order to not exceed the memory capacity of the GPU. A drastic change in batch
size should be taken into account when configuring the learning rate, according to
Krizhevsky [55, p. 5]. As a heuristic rule, Kirzhevsky suggests to multiply the
learning rate by k when multiplying the batch size by k. In this case, the learning
rate is multiplied by 4
64
= 1
16
= 0.0625, as the batch size is reduced from 64 to
4. Both the warmup phase and the decay phase of the learning rate schedule are
adjusted accordingly. Experimentation in the context of this thesis showed that the
learning rate change was necessary in order to achieve stable training.
All models in this thesis are optimized using the Momentum (SGD) optimizer
4.5 EVALUATION 47
with a momentum value of 0.9, which is the pre-configured default value in the
model configuration files. The default cosine decay learning rate schedule, which
consists of a warmup phase and a decay phase, is used for all models. The warmup
phase is used to gradually increase the learning rate to the base learning rate, and
lasts for 2000 steps. The decay phase is used to gradually decrease the learning rate
and lasts for 23 000 steps, making the total training of a model to last for 25 000
steps.
Data augmentation is disabled for all models, as the aim of the experiment
is to study the raw performance in the context of vessel detection. The chosen
object detectors are fine-tuned using the training scripts of the TensorFlow 2 Object
Detection API [56] and the training process is monitored using TensorBoard [57].
4.5 Evaluation
The evaluation is performed using the COCO Detection Evaluation metrics (see
Section 3.4.3). This is achieved by modifying the configuration file of the object
detectors to use said metrics. The evaluation is executed using the evaluation scripts
of the TensorFlow 2 Object Detection API, similarly to the training process.
In the experiment, continuous evaluation with the validation set is used to mon-
itor the performance of the vessel detectors during training. The validation is per-
formed at every 1 000 steps.
The out-of-sample performance of the vessel detectors is evaluated using the
test set. The results of the out-of-sample evaluation are presented in Chapter 5.
Cross-validation is not used in the experiment. This is because of the lack of the
computational resources required to perform cross-validation, as well as the test set
being large enough to produce a reliable point estimate of the performance of the
vessel detectors.
4.6 ENVIRONMENT 48
4.6 Environment
The experiment is conducted on a single consumer-grade computer with the following
specifications:
CPU AMD Ryzen 7 7800X3D
GPU NVIDIA GeForce RTX 4080 16 GB GDDR6X
RAM 64 GB DDR5
The following operating system, programming language and libraries are used:
Version
Ubuntu (Windows 11 Pro WSL2) 23.04
Python 3.10.13
TensorFlow & Object Detection API [56] 2.6.0
CUDA Toolkit 11.8
cuDNN 8.9.5.29
TensorRT 8.6.1
5 Experiment Results
Chapter 5 presents the results of the experiment described in Chapter 4, attempts
to answer the research questions posed in Section 1.1.2 and presents the challenges
in maritime vessel detection. Section 5.4 reflects on the results, compares the results
to other work, and presents discussion about potential future work. Figures 5.1 and
5.2 show examples of the vessel detections from the ABOships and SeaShips test
sets, respectively.
5.1 Results
Section 5.1 presents the numerical results of the experiment. Both precision and
recall metrics are shown for each of the detectors and datasets. Out of sample
performance is measured using a test set that was not used during training, as
described in Section 4.2.
Overall, the experiment’s results are mostly inline what was expected from the
experiment setting, albeit some of the results are surprising.
5.1.1 Precision
COCO precision metrics (see Section 3.4.3) for ABOships dataset are presented in
Table 5.1 and for SeaShips in Table 5.2.
5.1 RESULTS 50
Figure 5.1: Vessel detection example from ABOships test set using SSD FPN with
ResNet 101 backbone.
ABOships
ABOships precision metrics are relatively low, which is expected since the dataset
is challenging in terms of small vessel sizes and occlusions. All detectors perform
similarly, although SSD tends to perform better than Faster RCNN, regardless of
the ResNet backbone size. This is likely because the SSD with FPN is designed to
extract multiple feature maps at different scales at different levels of the network,
whereas Faster RCNN only extracts features at the backbone network. In the case
of Faster RCNN, there is some evidence that the larger ResNet backbones increase
5.1 RESULTS 51
Figure 5.2: Vessel detection example from SeaShips test set using SSD FPN with
ResNet 101 backbone.
the performance of the detector. For SSD, such evidence is not clearly present.
SeaShips
In SeaShips, the precision metrics are much higher than in ABOships, which indi-
cates that the dataset is easier for the detectors. The results are consistent across
the detectors, with SSD performing better than Faster RCNN. The APsmall metric
shows larger variance than the equivalent metric in ABOships.
5.1 RESULTS 52
Detector AP APIoU=.75 APIoU=.50 APlarge APmedium APsmall
Faster RCNN 50 0.2470 0.1807 0.5510 0.4000 0.2496 0.0980
Faster RCNN 101 0.2537 0.1951 0.5615 0.3944 0.2675 0.0991
Faster RCNN 152 0.2612 0.2051 0.5647 0.4184 0.2732 0.1142
SSD 50 FPN 0.2654 0.2155 0.5720 0.4131 0.2841 0.1003
SSD 101 FPN 0.2772 0.2367 0.5750 0.4198 0.2976 0.0995
SSD 152 FPN 0.2725 0.2243 0.5764 0.4073 0.2898 0.1003
Table 5.1: COCO precision metrics of each of the detectors on the ABOships test
set.
Detector AP APIoU=.75 APIoU=.50 APlarge APmedium APsmall
Faster RCNN 50 0.7517 0.8956 0.9851 0.7685 0.4861 0.1837
Faster RCNN 101 0.7620 0.8944 0.9826 0.7786 0.5150 0.1584
Faster RCNN 152 0.7630 0.8949 0.9834 0.7775 0.6007 0.0673
SSD 50 FPN 0.7688 0.9063 0.9694 0.7798 0.6702 0.1005
SSD 101 FPN 0.7740 0.9020 0.9729 0.7850 0.6591 0.2337
SSD 152 FPN 0.7697 0.9032 0.9720 0.7809 0.5794 0.0950
Table 5.2: COCO precision metrics of each of the detectors on the SeaShips test set.
5.1.2 Recall
COCO recall metrics (see Section 3.4.3) for ABOships dataset are presented in Table
5.3 and for SeaShips in Table 5.4.
ABOships
Detector ARmax=1 ARmax=10 ARmax=100 ARlarge ARmedium ARsmall
Faster RCNN 50 0.2823 0.3972 0.4193 0.5874 0.4332 0.2624
Faster RCNN 101 0.2837 0.3996 0.4218 0.5990 0.4440 0.2708
Faster RCNN 152 0.2963 0.4127 0.4332 0.6255 0.4527 0.2935
SSD 50 FPN 0.2956 0.4386 0.4585 0.5853 0.4769 0.3133
SSD 101 FPN 0.3000 0.4458 0.4635 0.5727 0.4863 0.3305
SSD 152 FPN 0.3013 0.4484 0.4679 0.5892 0.4931 0.3058
Table 5.3: COCO recall metrics of each of the detectors on the ABOships test set.
Recall results for ABOships further confirm the difficulty of the dataset, with
all detectors having a modest recall. Moreover, the recall results validate the con-
5.2 RESEARCH QUESTIONS 53
clusions from the precision results, with SSD performing better than Faster RCNN.
Similar effect of the ResNet backbone size is observed for Faster RCNN, where de-
tectors with larger backbones tend to perform better. For SSD, the effect is not as
clear.
SeaShips
Detector ARmax=1 ARmax=10 ARmax=100 ARlarge ARmedium ARsmall
Faster RCNN 50 0.7363 0.8064 0.8079 0.8245 0.5789 0.2333
Faster RCNN 101 0.7473 0.8161 0.8172 0.8319 0.5959 0.2333
Faster RCNN 152 0.7518 0.8187 0.8196 0.8331 0.6568 0.0666
SSD 50 FPN 0.7536 0.8268 0.8280 0.8376 0.7186 0.2667
SSD 101 FPN 0.7580 0.8298 0.8308 0.8417 0.6997 0.3333
SSD 152 FPN 0.7530 0.8261 0.8272 0.8368 0.7151 0.3667
Table 5.4: COCO recall metrics of each of the detectors on the SeaShips test set.
Recall results for SeaShips across the detectors are high. Faster RCNN tends to
benefit from a larger ResNet backbone. SSD has a higher recall than Faster RCNN.
The ARsmall metric is very low for the Faster RCNN with ResNet 152 backbone,
indicating issues with small vessel detection.
5.2 Research Questions
Section 5.2 attempts to answer the research questions posed in Section 4.1. These
answers are based on the numerical results presented in Section 5.1.
5.2.1 Detector Architecture
How does the vessel detection performance differ between the two ob-
ject detector architectures, one-stage SSD FPN and two-stage Faster R-
CNN?
5.2 RESEARCH QUESTIONS 54
Overall, the results from both datasets indicate that SSD performs better than
Faster RCNN. This is surprising, as Faster RCNN is generally considered to be a
more accurate detector than SSD, while SSD provides faster inference times. The
difference in performance is not large, but it is consistent across the datasets and
the ResNet backbones. Both detector architectures are suitable for vessel detection.
5.2.2 Datasets
How does the performance of the object detectors differ when fine-tuned
on the maritime datasets, ABOships and SeaShips?
Both datasets are challenging for all of the detectors used in the experiment.
ABOships is a more challenging dataset with many small vessels and occlusions,
while SeaShips remains easier with larger vessels. Notably, the setting of the datasets
is very different, one being images mostly from port areas with large commercial
vessels (SeaShips), and the other one being images from the Finnish Archipelago
with smaller vessels (ABOships).
For building a robust and general vessel detection system, a combination of both
datasets would be beneficial. This way the detector would be applicable to a wider
range of maritime scenarios, provided that the detector is able to generalize well.
Both datasets include images with varying environmental conditions, such as low-
light conditions, which is important for building a robust vessel detection system.
For proper night-time vessel detection, a different sensor, such as thermal camera is
needed. For challenges in vessel detection using these datasets, see Section 5.3.
5.2.3 ResNet
How do different ResNet (3.3.1) feature extractors sizes affect the per-
formance of SSD FPN and Faster R-CNN object detectors?
5.3 CHALLENGES 55
The numerical results in Section 5.1 indicate that the ResNet backbone size has
an effect on the performance of the detectors, however it is not consistent across the
detectors. With Faster RCNN, the larger ResNet backbones perform better. Both
precision and recall metrics show increased numerical values with larger ResNet
backbones. For example, in ABOships using the Faster RCNN detector architecture,
the AP metric increases from 0.2470 to 0.2612 when the ResNet backbone is changed
from 50 to 152. Similarly in SeaShips, the AP metric increases from 0.7517 to 0.7630
when the ResNet backbone is changed from 50 to 152.
The effect of the ResNet backbone size with the SSD architecture varies, showing
no clear evidence on improvement in precision. The recall metrics show similar
results, where the effect of the ResNet backbone size is not as clear as with Faster
RCNN.
Altogether, the numerical results indicate that a larger ResNet backbone is ben-
eficial for the detectors. The feature extraction in the Faster RCNN architecture is
performed only at the backbone network (cf. SSD with FPN), which is likely the
reason for the effect of the ResNet backbone size. However, since SSD does not show
as strong evidence of the effect of the ResNet backbone size, simply using a larger
ResNet backbone is not a guaranteed way to improve the performance of the de-
tector. A larger ResNet backbone increases the computational cost of the detector,
which means both training and inference are more computationally expensive.
5.3 Challenges
5.3.1 Vessel Size and Distance
In vessel detection, the size and distance of the vessels play an important role in
detections. A large vessel (such as a cargo ship) can appear to be of similar size
as a small vessel (such as a fishing boat) in the image, as seen in Figure 5.3. The
5.3 CHALLENGES 56
Figure 5.3: An example from ABOships test set using SSD FPN with ResNet 101
backbone. Vessels of different real world sizes appear to be of similar size in the
image.
detection of the second military ship from the left (score 0.96) is almost the same
size as the passenger boat at the foreground (score 0.52), even though the military
ship is much larger in reality. At the open sea, where distances are large, the issue
is much more problematic than in port areas. Thus, future research should focus on
improving the detection of vessels at larger distances, i.e. improving the detection
performance of perceived ’small’ instances.
5.3 CHALLENGES 57
5.3.2 Occlusions
Figure 5.4: An occlusion example from SeaShips test set using SSD FPN with
ResNet 101 backbone. An incorrect and inaccurate detection of ’ore carrier’ as
’bulk cargo carrier’ due to the occluding ’fishing boat’.
Occlusions are a challenge not only in vessel detection, but in object detection
in general. In maritime vessel detection, especially in port areas where the traffic
is dense, the vessels are often occluded by other vessels. In the experiment of the
thesis, occlusions were a challenge, as seen in Figure 5.4. The ’ore carrier’ is occluded
by the ’fishing boat’, which causes the detector to incorrectly detect the ’ore carrier’
as a ’bulk cargo carrier’ with an incorrect bounding box.
5.3 CHALLENGES 58
5.3.3 Environmental Conditions
Figure 5.5: A late evening image from SeaShips test set, with detection score thresh-
old of 0.3. Left: Faster RCNN ResNet 152 with a detection (score ≈ 1.0). Right:
SSD FPN ResNet 152 without a detection (score ≈ 0.2).
Environmental conditions proved out to have a significant impact on the perfor-
mance of the detectors. Especially low-light conditions, such as late evening and
night-time, were problematic in terms of consistent detections. This is especially
evident in Figure 5.5. Faster RCNN with ResNet 152 backbone is able to detect
the ’ore carrier’ with high confidence, while SSD FPN with ResNet 152 backbone
is not. This one example already shows the issue with low-light conditions in vessel
detection. The ’ore carrier’ occupies a large portion of the image, and is clearly
visible to the human eye, but is problematic for the detectors.
Low-light vessel detection performance could be improved by using a different
sensor, such as thermal camera and sensor fusion. Farahnakian and Heikkonen [7]
have explored different sensor fusion architectures (RGB + IR) for vessel detection
especially in low-light conditions.
5.4 DISCUSSION AND FUTURE WORK 59
5.4 Discussion and Future Work
The results of this thesis are inline with the baseline results of both ABOships and
SeaShips datasets. The research findings of this thesis provide additional value in
understanding the performance of detectors and different size ResNet backbones in
maritime vessel detection.
ABOships Baseline Comparison
The authors of ABOships [18] have presented baseline detection results with using
various detector architectures and backbones pre-trained on the COCO dataset.
The results of this thesis are in line with the baseline results, although it is unclear
what input size the baseline results refer to (640 × 640 in this thesis). Notably,
the baseline results use a modified definition of the small vessel class (162 < area
< 322) versus the COCO definition of small objects (area < 322). In other words,
the authors have excluded the vessels with area smaller than 162 pixels, whereas
this thesis uses all vessels in the dataset. This difference makes the APsmall and AP
metrics incomparable between the baseline and this thesis, yet APmedium and APlarge
remain comparable. Comparison of results of this thesis and the baseline results,
where applicable, are shown in Table 5.5. The authors have achieved the highest
AP (0.3518) with Faster RCNN and Inception ResNet V2 backbone.
Detector Feature Extractor APmedium APlarge
Faster RCNN ResNet 101 (640× 640) 0.2675 0.3944ResNet 101 0.2507 0.3817
SSD ResNet 101 FPN (640× 640) 0.2976 0.4198ResNet 101 FPN 0.3118 0.4207
Table 5.5: Comparison of the results of this thesis and the baseline results of ABO-
ships [18]. Thesis results are in bold.
5.4 DISCUSSION AND FUTURE WORK 60
SeaShips Baseline Comparison
The authors of SeaShips [21] have similarly presented baseline detection results with
different architectures. The results of this thesis are mostly in line with the base
detection results from the SeaShips paper, although there are some differences. This
thesis uses the publicly available subset of the SeaShips dataset [22], with only 7 000
images out of the total 31 455 images. As with ABOships, it is unclear what input
sizes some of the baseline results use. For example, SSD with MobileNet backbone
and 608×608 input size achieves an AP of 0.7950, which is close to the AP of 0.7740
achieved in this thesis. Faster RCNN shows a larger difference in the AP metric,
with the baseline result being 0.9240 and this thesis achieving 0.7620. This is likely
due to the different input sizes used in the baseline and this thesis.
Detector Feature Extractor AP
Faster RCNN ResNet 50 (640× 640) 0.7517ResNet 50 0.9165
Faster RCNN ResNet 101 (640× 640) 0.7620ResNet 101 0.9240
SSD ResNet 101 FPN (640× 640) 0.7740MobileNet (608× 608) 0.7950
Table 5.6: Comparison of the results of this thesis and the baseline results of Sea-
Ships [21]. Thesis results are in bold.
Future Work
Work following this thesis could focus on improving the performance of the detectors
in challenging maritime scenarios. Sensor Fusion is a simple yet effective way to
improve the performance of the detectors, especially in low-light conditions, as shown
in [7]. The use of RGB and IR cameras together can result in a more robust vessel
detection system.
Elevating the number of sensors of different types, such as RGB, IR, and radar,
can quickly become computationally expensive. Haghbayan et al. [58] have proposed
5.4 DISCUSSION AND FUTURE WORK 61
an efficient sensor fusion architecture for object detection in maritime environments.
They have fused radar, LiDAR, RGB and IR sensors together using a probabilistic
data association method, achieving reliable object detection in the maritime context.
Another interesting area of future work is to use an inertial sensor in order to
alleviate the problem of waves and vessel movement, if using an on-board camera.
As an example, Bertozzi et al. [59] have used an inertial sensor to reduce the problem
of miscalibrations in obstacle detection and classification in the automotive context.
Similar approach could be useful in maritime vessel detection, when using cameras
on-board a vessel.
Finally, the development and use of even larger and more diverse maritime
datasets would be beneficial for building a robust and general vessel detection sys-
tem.
6 Conclusion
This thesis has presented relevant theory (Chapters 1, 2 & 3) on the topic of mar-
itime vessel detection using deep neural networks, and implemented an experimental
study (Chapter 4) with the aim of understanding the effect of different object detec-
tion architectures, the choice of backbone networks, and the feasibility of different
maritime datasets for developing a vessel detection system. The results of the ex-
periment (Chapter 5) have shown that Transfer Learning is a viable approach for
developing a well-performing vessel detection system, even on consumer-grade hard-
ware.
Both one-stage (SSD FPN) and two-stage (Faster RCNN) detector architectures
were used in the experiment, and shown to perform well in maritime vessel detection
and to produce similar results. In the experiment, SSD FPN architecture was shown
to perform slightly better than Faster RCNN, but the difference was not large enough
to deem Faster RCNN as unsuitable for the task.
The choice of the backbone network size (ResNet) was shown to have an effect
when using the Faster RCNN architecture, but no significant effect was observed
when using the SSD FPN architecture. The results suggest that using a larger
backbone network size is beneficial when using the Faster RCNN architecture, but
not as much when using the SSD FPN architecture. Thus, the choice of the backbone
network is to be considered when developing a vessel detection system.
The feasibility of different maritime datasets was also studied. The experiment
CHAPTER 6. CONCLUSION 63
used two different maritime datasets, ABOships and SeaShips, with different char-
acteristics. The performance of the vessel detection system was vastly different on
each of the datasets, where SeaShips dataset produced much higher performance
metrics than ABOships. This suggests that ABOships dataset is more difficult for
the detectors, and as such should be considered in future studies to improve the
understanding of the performance of vessel detection systems in maritime environ-
ment.
The results of the thesis are inline with the baseline results from the literature.
The thesis has provided additional insight into the effect of different object detec-
tion architectures, the choice of backbone networks, and the challenges in vessel
detection in the maritime context. These results support the implementation of
a vessel detection system in real-world maritime applications, and provide further
understanding of the challenges in the maritime context.
Future research could focus on improving the performance of vessel detection
systems in maritime environment, by handling the challenging aspects of the mar-
itime context. Sensor fusion and multi-modal sensor data could be used to improve
the performance of vessel detection systems, especially in low visibility conditions.
The results of this thesis have shown that development of a vessel detection system
for real-world maritime applications is feasible, and that the performance of the
system can be improved with further research and experimentation.
References
[1] N. A. Stanton, P. R. Chambers, and J. Piggott, ”Situational awareness and
safety”, Safety science, vol. 39, no. 3, pp. 189–204, 2001.
[2] W. Dai, Y. Mao, R. Yuan, Y. Liu, X. Pu, and C. Li, ”A novel detector based
on convolution neural networks for multiscale sar ship detection in complex
background”, Sensors, vol. 20, no. 9, p. 2547, 2020.
[3] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, ”Object detection in 20 years: A
survey”, Proceedings of the IEEE, 2023.
[4] D. Lu and Q. Weng, ”A survey of image classification methods and tech-
niques for improving classification performance”, International journal of Re-
mote sensing, vol. 28, no. 5, pp. 823–870, 2007.
[5] W. Elmenreich, ”An introduction to sensor fusion”, Vienna University of Tech-
nology, Austria, vol. 502, pp. 1–28, 2002.
[6] H. Heiselberg and A. Stateczny, Remote sensing in vessel detection and navi-
gation, 2020.
[7] F. Farahnakian and J. Heikkonen, ”Deep learning based multi-modal fusion
architectures for maritime vessel detection”, Remote Sensing, vol. 12, no. 16,
p. 2509, 2020.
[8] M. R. Endsley, ”Measurement of situation awareness in dynamic systems”,
Human factors, vol. 37, no. 1, pp. 65–84, 1995.
REFERENCES 65
[9] N. Wawrzyniak, T. Hyla, and A. Popik, ”Vessel detection and tracking method
based on video surveillance”, Sensors, vol. 19, no. 23, p. 5230, 2019.
[10] A. Van den Broek, R. Neef, P. Hanckmann, S. P. van Gosliga, and D. Van
Halsema, ”Improving maritime situational awareness by fusing sensor informa-
tion and intelligence”, in 14th International Conference on Information Fusion,
IEEE, 2011, pp. 1–8.
[11] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning.
Springer, 2006, vol. 4.
[12] X.-Y. Zhang, C.-L. Liu, and C. Y. Suen, ”Towards robust pattern recognition:
A review”, Proceedings of the IEEE, vol. 108, no. 6, pp. 894–922, 2020.
[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[14] D. Berrar et al., Cross-validation. 2019.
[15] F. Zhuang, Z. Qi, K. Duan, et al., ”A comprehensive survey on transfer learn-
ing”, Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
[16] S. Puttemans, T. Callemein, and T. Goedemé, ”Building robust industrial
applicable object detection models using transfer learning and single pass deep
learning architectures”, arXiv preprint arXiv:2007.04666, 2020.
[17] F. Farahnakian, L. Zelioli, and J. Heikkonen, ”Transfer learning for maritime
vessel detection using deep neural networks”, in 2021 IEEE International In-
telligent Transportation Systems Conference (ITSC), IEEE, 2021, pp. 1–6.
[18] B. Iancu, V. Soloviev, L. Zelioli, and J. Lilius, ”Aboships—an inshore and
offshore maritime vessel detection dataset with precise annotations”, Remote
Sensing, vol. 13, no. 5, p. 988, 2021.
REFERENCES 66
[19] T.-Y. Lin, M. Maire, S. Belongie, et al., ”Microsoft coco: Common objects in
context”, in Computer Vision–ECCV 2014: 13th European Conference, Zurich,
Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014,
pp. 740–755.
[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ”Imagenet: A
large-scale hierarchical image database”, in 2009 IEEE conference on computer
vision and pattern recognition, Ieee, 2009, pp. 248–255.
[21] Z. Shao, W. Wu, Z. Wang, W. Du, and C. Li, ”Seaships: A large-scale pre-
cisely annotated dataset for ship detection”, IEEE transactions on multimedia,
vol. 20, no. 10, pp. 2593–2604, 2018.
[22] Z. Shao, W. Wu, Z. Wang, W. Du, and C. Li, SeaShips (7000), http://www.
lmars.whu.edu.cn/prof_web/shaozhenfeng/datasets/SeaShips(7000)
.zip, Accessed: 2023-11-07.
[23] S. Haykin, Neural networks: a comprehensive foundation. Prentice Hall PTR,
1998.
[24] L. Torrey and J. Shavlik, ”Transfer learning”, in Handbook of research on ma-
chine learning applications and trends: algorithms, methods, and techniques,
IGI global, 2010, pp. 242–264.
[25] N. Buduma, N. Buduma, and J. Papa, Fundamentals of deep learning. O’Reilly
Media, Inc., 2022.
[26] F. Rosenblatt, ”The perceptron: A probabilistic model for information storage
and organization in the brain.”, Psychological review, vol. 65, no. 6, p. 386,
1958.
[27] S. R. Dubey, S. K. Singh, and B. B. Chaudhuri, ”Activation functions in deep
learning: A comprehensive survey and benchmark”, Neurocomputing, 2022.
REFERENCES 67
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”Imagenet classification with
deep convolutional neural networks”, Advances in neural information process-
ing systems, vol. 25, 2012.
[29] J. Schmidhuber, ”Annotated history of modern ai and deep learning”, arXiv
preprint arXiv:2212.11279, 2022.
[30] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
Brain Mechanisms (Cornell Aeronautical Laboratory. Report no. VG-1196-G-
8). Spartan Books, 1962.
[31] D. Svozil, V. Kvasnicka, and J. Pospichal, ”Introduction to multi-layer feed-
forward neural networks”, Chemometrics and intelligent laboratory systems,
vol. 39, no. 1, pp. 43–62, 1997.
[32] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, ”Dive into deep learning”,
arXiv preprint arXiv:2106.11342, 2021.
[33] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, ”Object detection with deep learn-
ing: A review”, IEEE transactions on neural networks and learning systems,
vol. 30, no. 11, pp. 3212–3232, 2019.
[34] A. Dhillon and G. K. Verma, ”Convolutional neural network: A review of mod-
els, methodologies and applications to object detection”, Progress in Artificial
Intelligence, vol. 9, no. 2, pp. 85–112, 2020.
[35] D. J. MacKay, Information theory, inference and learning algorithms. Cam-
bridge university press, 2003.
[36] R. Girshick, ”Fast r-cnn”, in Proceedings of the IEEE international conference
on computer vision, 2015, pp. 1440–1448.
[37] B. T. Polyak, ”Some methods of speeding up the convergence of iteration
methods”, Ussr computational mathematics and mathematical physics, vol. 4,
no. 5, pp. 1–17, 1964.
REFERENCES 68
[38] D. P. Kingma and J. Ba, ”Adam: A method for stochastic optimization”, arXiv
preprint arXiv:1412.6980, 2014.
[39] J. Duchi, E. Hazan, and Y. Singer, ”Adaptive subgradient methods for online
learning and stochastic optimization.”, Journal of machine learning research,
vol. 12, no. 7, 2011.
[40] T. Tieleman, G. Hinton, et al., ”Lecture 6.5-rmsprop: Divide the gradient by
a running average of its recent magnitude”, COURSERA: Neural networks for
machine learning, vol. 4, no. 2, pp. 26–31, 2012.
[41] W. Zhiqiang and L. Jun, ”A review of object detection based on convolutional
neural network”, in 2017 36th Chinese control conference (CCC), IEEE, 2017,
pp. 11 104–11 109.
[42] R. Szeliski, Computer vision: algorithms and applications. Springer Nature,
2022.
[43] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ”Rich feature hierarchies for
accurate object detection and semantic segmentation”, in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2014, pp. 580–
587.
[44] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, ”Selec-
tive search for object recognition”, International journal of computer vision,
vol. 104, pp. 154–171, 2013.
[45] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster r-cnn: Towards real-time ob-
ject detection with region proposal networks”, Advances in neural information
processing systems, vol. 28, 2015.
[46] W. Liu, D. Anguelov, D. Erhan, et al., ”Ssd: Single shot multibox detector”, in
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
REFERENCES 69
Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016,
pp. 21–37.
[47] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for large-
scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.
[48] K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recog-
nition”, in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778.
[49] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, ”Fea-
ture pyramid networks for object detection”, in Proceedings of the IEEE con-
ference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[50] R. Padilla, S. L. Netto, and E. A. Da Silva, ”A survey on performance metrics
for object-detection algorithms”, in 2020 international conference on systems,
signals and image processing (IWSSIP), IEEE, 2020, pp. 237–242.
[51] T. Fawcett, ”An introduction to roc analysis”, Pattern recognition letters,
vol. 27, no. 8, pp. 861–874, 2006.
[52] COCO Detection Evaluation, https://cocodataset.org, Accessed: 2023-12-14.
[53] TensorFlow 2 Detection Model Zoo, https://github.com/tensorflow/
models/blob/master/research/object_detection/g3doc/tf2_detection_
zoo.md, Accessed: 2023-10-31.
[54] TFRecord, https://www.tensorflow.org/tutorials/load_data/tfrecord,
Accessed: 2023-11-15.
[55] A. Krizhevsky, ”One weird trick for parallelizing convolutional neural net-
works”, arXiv preprint arXiv:1404.5997, 2014.
[56] J. Huang, V. Rathod, C. Sun, et al., ”Speed/accuracy trade-offs for mod-
ern convolutional object detectors”, in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2017, pp. 7310–7311.
REFERENCES 70
[57] TensorBoard, https://github.com/tensorflow/tensorboard, Accessed:
2023-11-15.
[58] M.-H. Haghbayan, F. Farahnakian, J. Poikonen, et al., ”An efficient multi-
sensor fusion approach for object detection in maritime environments”, in 2018
21st International Conference on Intelligent Transportation Systems (ITSC),
IEEE, 2018, pp. 2163–2170.
[59] M. Bertozzi, L. Bombini, P. Cerri, P. Medici, P. C. Antonello, and M. Miglietta,
”Obstacle detection and classification fusing radar and vision”, in 2008 IEEE
Intelligent Vehicles Symposium, IEEE, 2008, pp. 608–613.