Camera + LiDAR Sensor Fusion Methods for Semantic Segmentation in Autonomous Driving A Literature Review Department of Mechanical and Materials Engineering Bachelor's thesis Author: Joonas Sallmén Supervisors: M.Sc. Carlos Roberto Cueto Zumaya Prof. Wallace Moreira Bessa 16.5.2025 Turku The originality of this thesis has been checked in accordance with the University of Turku quality assurance system using the Turnitin Originality Check service. Bachelor's thesis Subject: Mechanical Engineering Author: Joonas Sallmén Title: Camera + LiDAR Sensor Fusion Methods for Semantic Segmentation in Autonomous Driving Supervisor(s): M.Sc. Carlos Roberto Cueto Zumaya, Prof. Wallace Moreira Bessa Number of pages: 28 pages Date: 16.5.2025 Autonomous vehicles require perception systems that are highly accurate, robust and capable of processing data in real-time to ensure reliable operation. To fulfill these requirements, autonomous systems can benefit from sensor fusion for semantic segmentation. The focus of this thesis is on the discussion about the fusion of a common sensor combination in autonomous vehicles; cameras and LiDAR. This thesis starts by giving an overview of semantic segmentation and sensor fusion in autonomous vehicles. Then follows explanations on fusion approaches (early, mid, late and asymmetric fusion) and common architecture types that serve as a basis for most of the fusion methods. The thesis ends with the literature review focused on deep learning-based fusion methods for camera and LiDAR, followed by a discussion on limitations of approaches and future directions. Key words: semantic segmentation, sensor fusion, deep learning, autonomous driving, camera, LiDAR Table of contents 1 Introduction 4 2 Fundamentals and Background 5 2.1 Semantic Segmentation 5 2.2 Sensor Fusion in Autonomous Driving 6 2.3 Sensor Fusion Approaches 7 2.3.1 Early Fusion 7 2.3.2 Mid-Fusion 8 2.3.3 Late Fusion 8 2.3.4 Asymmetric Fusion 9 3 Literature Review 10 3.1 Architecture Types 10 3.1.1 Convolutional Neural Network (CNN) 10 3.1.2 Common CNN-based Architectures 11 3.1.3 Vision Transformer (ViT) 13 3.2 Review of Recent Sensor Fusion Methods for Semantic Segmentation 15 3.2.1 Methods for Early Fusion 15 3.2.2 Methods for Mid-Fusion 17 3.2.3 Methods for Late Fusion 18 3.2.4 Methods for Asymmetric Fusion 19 4 Discussion 21 4.1 Limitations of Existing Approaches 21 4.2 Trends and Potential Future Directions 22 4.3 Conclusion 23 5 References 24 4 1 Introduction Semantic segmentation is a task in computer vision, where an image is divided into segments or areas of interest by pixel-wise labeling [1]. It has multiple different use cases ranging from satellite image analysis, medical industry and the automotive industry, where computer vision is getting increasingly vital with the rise of autonomy in traffic. Autonomous vehicles rely on sensor systems, through which they can acquire accurate, robust and real-time information about their surroundings [2]. However, individual sensors have their strengths and weaknesses. Cameras provide RGB data but lack depth information and fail in low light conditions. LiDAR, on the other hand, contains the depth information needed, but point clouds can get sparse, and they are missing the RGB information that a camera has [3]. This is where the importance of sensor fusion in intelligent perception becomes evident. Sensor fusion is the act of fusing complementary data from multiple sources together for improved perception [4]. Currently, the field is developing swiftly, and new methods used for sensor fusion are coming out at a steady pace. Especially with the rise of methods based on the vision transformer (ViT), a new and updated literature review on the topic is needed. The goal of this literature review is to review recent sensor fusion methods that could be used in autonomous driving systems that are equipped with cameras and LiDAR. This composition was chosen because the two sensors are widely used in autonomous vehicles [3]. The focus is on gathering new deep learning -based methods and grouping them by fusion approach (early, mid, late and asymmetric fusion). It was considered important to include approaches that use recent methods like self-supervised learning and ViTs. This thesis is structured in the following way: Section 2 gives an overview of the fundamental concepts that the thesis is relies on; Section 3 explains the main architecture types used in the fusion methods and contains the literature review that is organized by fusion approach; Section 4 discusses the strengths and limitations of existing approaches, highlights trends in the field, speculates future directions and ends with a conclusion of the thesis. 5 2 Fundamentals and Background 2.1 Semantic Segmentation Semantic segmentation is a core topic in the field of computer vision and serves as the basis for many complex visual tasks [5]. Compared to traditional image classification where a single label is assigned to an image or object detection which identifies objects within bounding boxes, in semantic segmentation each pixel in an image is labeled. It is used to get a detailed understanding of visual scenes in a wide range of application areas, such as autonomous driving, medical imaging and satellite image analysis. In autonomous driving, semantic segmentation is used to identify surfaces and objects from the surrounding environment. Ensuring safe navigation is crucial when autonomous vehicles operate alongside other road users and pedestrians [5]. Figure 1 shows an example of a segmented 2D image. Figure 1: Segmented image [6], licensed under CC BY 4.0 Earlier approaches to semantic segmentation depended on machine learning methods such as Support Vector Machines (SVM) and Conditional Random Fields (CRF) which required extensive feature engineering and pre-processing [5]. SVMs work by representing pixels as feature vectors and separating these vectors into classes by using a hyperplane. For example, this hyperplane could separate vector representations of pixels that represent the road and pixels that do not. CRFs are used to create probabilistic models for finding connections between pixels after they have been individually labeled by another method like an SVM, therefore leading to a better segmentation result. While CRFs can be used as a post-processing layer, on their own 6 these traditional methods tend to be inefficient for complex problems such as segmenting multiple objects in an image [5]. Due to recent developments in deep learning, convolutional neural networks (CNN) took over and have become the most common method for semantic segmentation, offering significant improvements in segmentation accuracy while simultaneously decreasing the need for human involvement [7]. There are various deep learning architectures used for semantic segmentation, but many of them are designed around the working principle of the CNN. A CNN consists of a convolutional layer that filters input data (pixels) and detects patterns, an activation layer that learns relevant features automatically, a pooling layer reduces the size of the image while keeping all the most important information and a fully connected layer makes the decision on the final segmentation result [7]. New methods like the ViT have shown very promising results against CNN-based methods and are a key focus in ongoing research. 2.2 Sensor Fusion in Autonomous Driving Sensor fusion is the process of combining data from multiple sensors. It is used to reduce uncertainty in the obtained information by leveraging strengths of different sensors [3]. In this thesis, we are focusing on the fusion of cameras and Light Detection and Ranging sensors (LiDAR) because they are widely used in autonomous vehicles [3]. Figure 2 demonstrates the information we can acquire from a 2D camera and LiDAR. Figure 2: Camera image (left) and LiDAR point cloud (right), modified from [4] © IEEE (2024) Perception systems must meet the following criteria; high accuracy, high robustness and rapid real-time processing. There is little room for uncertainty, and sensor fusion is a crucial technology to overcome this problem [3]. Cameras provide a 2D image with color and texture information, but they are vulnerable to low light conditions and lack depth information. Sensors like LiDAR, RADAR and Ultrasonic sensors are good complementary sensors with cameras 7 since they provide the information about depth and surface reflectivity that cameras do not capture [3]. LiDAR uses pulses of laser light to sense the distance between the sensor and surfaces around it. Based on this information gathered, it creates a 3D point cloud (3DPC) where each point has its position in the world coordinate system and an intensity value that represents the reflectivity of the surface where the laser made contact [3]. LiDAR provides accurate measurements and works in low-light conditions where normal cameras struggle. However, LiDAR data is increasingly sparse depending on the distance to the points measured. 2.3 Sensor Fusion Approaches Sensor fusion approaches are typically divided into three categories based on the point in time when the information is fused. These three categories are early fusion, mid-fusion and late fusion [3], [4]. We are going to discuss each one of these more in depth and finally go over asymmetric fusion between modalities. Figure 3 contains the fusion approaches. Figure 3: Traditional Fusion Approaches 2.3.1 Early Fusion Early fusion combines sensor data before any features have been extracted. It can also be called data-data fusion because all modalities are fused at the data level [3]. This is achieved by making all inputs follow the same coordinate system, merging them into a unified tensor, and extracting features from it [4], [8]. For camera images and 3DPC this could mean turning the point cloud into a 2D representation of the points before merging or vice versa. Two common ways to achieve this are spherical projection and perspective projection. In spherical projection, LiDAR points are projected to a spherical map, where each point contains values for horizontal angle, vertical angle, depth and reflectivity, thus allowing direct point to pixel correspondences. In perspective projection, LiDAR points are projected to the camera image, but the downside of this method is that usually point clouds are sparse and require interpolation to get a corresponding point to every pixel [9]. These issues and growth of computational costs when 8 data gets more complex, are the main challenges with early fusion. Fusing information at an early stage can be advantageous since the fusion happens before pre-processing, causing only minimal information losses [3]. 2.3.2 Mid-Fusion Mid-fusion happens after features have been extracted, which is why it is commonly called feature-feature fusion [3]. Data from each sensor is processed independently and after one or more intermediate layers features are extracted. After feature extraction, feature maps from each sensor are merged using concatenation (combining feature maps into a single longer list), element-wise addition (summing corresponding elements), or more advanced strategies [3], [4]. Mid-fusion is advantageous, especially when sensors capture different types of information. Extracting features before fusion allows better performance, since the data is used in its original form [3]. The time for fusion is trivial because it can happen anywhere in between the data layer and final predictions. However, determining the right time to fuse for best performance can be challenging [4]. 2.3.3 Late Fusion Late fusion combines sensor data after all modalities have been processed independently, and they have made their own final decision outputs. Outputs of each modality are then merged using weighted averaging, voting schemes or additional fusion networks to make the final prediction [3], [4]. Late fusion simplifies changing the sensor composition since the predictions are combined directly without requiring complex intermodality networks [4]. Moreover, compared to early fusion, late fusion is less sensitive to minor data misalignments since each modality is processed independently. However, challenges with late fusion include not benefiting from synergies between different sensors, which means that mid- and early fusion can make better predictions in some situations. Additionally, if different sensors make contradicting predictions, it can lead to an inconsistent or inaccurate final prediction after the fusion [3], [8]. 9 2.3.4 Asymmetric Fusion Asymmetric fusion refers to data being fused when modalities are in different stages of processing. In this case, one of the modalities acts as the primary source of information (e.g., LiDAR), while other sources (e.g., cameras) are there to support segmentation by providing context [3], [4]. Fusing data asymmetrically is beneficial since the fusion process can happen using lightweight methods, therefore reducing computational load [3] . Figure 4: Convolutional Neural Network [10], licensed under CC BY 4.0 10 3 Literature Review 3.1 Architecture Types The methods we will discuss in the literature review are in most situations based on some form of a CNN or a ViT. It is typically a modification or combination of either one, but in the next three subsections, we will discuss these architecture types in detail. 3.1.1 Convolutional Neural Network (CNN) CNN is a type of neural network that is widely used in tasks related to computer vision [5]. Figure 4 shows the structure of a CNN. A CNN consists of multiple layers, but the foundation of this kind of network is based on three types of layers: The Convolutional layer is the first layer of a CNN. It takes a tensor representation of an image as input, typically consisting of three matrices, one for each RGB channel. The matrices consisting of pixel values is then filtered using a convolutional kernel, which is a basic function of a CNN. A convolutional kernel is commonly a 3x3 matrix that stores learnable weights used for pattern detection (e.g., edges and textures). As the kernel slides over an image, it calculates dot products between pixel values and kernel values inside each local patch, resulting in a feature map. Moreover, multiple kernels can be applied in the convolutional layer to generate multiple feature maps. By adding several convolutional layers, the network can progressively detect features of increasing complexity, from small edges to larger structures [7]. Pooling layer is used for reducing dimensions in the image. It uses a filter to go over the image and either selecting the maximum value inside the filter window as the output value (max pooling) or calculating an average of the values inside the filter to produce an output value (average pooling). Fully connected layer is the final layer of the network. It connects all the nodes from the final pooling layer to produce a final output, giving a classification result using an activation function like the softmax function. Unlike traditional CNNs used for classification, semantic segmentation requires each pixel to be individually predicted instead of a single class prediction [5]. Instead of fully connected layers, segmentation networks use upsampling layers to restore spatial resolution and assign 11 class labels to each pixel [5]. Many architectures used for segmentation replace the fully connected layer with deconvolution layers, bilinear upsampling, and skip connections to further improve the segmentation output. Next, we are going to describe a few common architectures based on the CNN used for segmentation. 3.1.2 Common CNN-based Architectures CNN-based encoder-decoder [5], [7] - This architecture consists of two parts: an encoder and a decoder. During the encoding phase, the network extracts features while simultaneously reducing spatial dimensions of the feature map. These objectives are achieved through multiple convolutional and pooling layers with the former handling feature extraction and the latter responsible for downsampling. The decoder upsamples the feature map back to its original dimensions and uses transposed convolutions, skip connections or upsampling layers to refine the result. Figure 5 presents an example of the encoder-decoder, SegNet [11]. Figure 5: Example of the encoder-decoder architecture, SegNet [11], licensed under CC BY 4.0 Dilated convolution-based [7] - In this architecture the spatial resolution remains constant throughout the convolution process and feature extraction happens by changing the dilation rate of the convolution kernel. The dilation rate refers to the amount of empty space in between the pixels in the convolution kernel. Having a higher dilation rate allows the method to capture broader structures, while a lower dilation rate can be used for finer details. Figure 6 shows the effect of changing the dilation rate of the convolutional kernel. 12 Figure 6: 3x3 convolutional kernel with dilation rates a=0, b=1 and c=2 [12], licensed under CC BY 4.0 Multi-scale feature fusion [7] - There are two main strategies to multi-scale feature fusion: parallel multibranch networks and skip connections. In parallel multibranch networks, input features are processed in multiple different scales concurrently for detecting features of varying sizes. Outputs of each branch are then merged to create a comprehensive feature representation. In skip connections, early layers of the network are fused with deeper layers using a connection that skips over the layers in between. Figure 7 demonstrates the CLFusion method [13] utilizing both parallel networks and skip connections. The fusion module gradually combines features from two parallel encoders, while skip connections link corresponding layers between the encoder and decoder. Figure 7: CLFusion 3D segmentation method [13] © (2024) IEEE 13 3.1.3 Vision Transformer (ViT) The first large use case for transformers was natural language processing, where they proved to be a great success [14]. CNN-based image segmentation methods suffer from low resolution in the final output due to many pooling and convolution layers, which led to the development of the ViT [15]. Compared to CNN-based methods that learn local features, ViTs can map long- range dependencies, leading to more accurate segmentation results [15]. Transformers extract features via self-attention, whereas CNNs do so using convolutional kernels. Although ViTs have superior performance to CNNs, their downside is that they require large amounts of data to function properly [16]. A ViT has three stages: patch embedding, transformer encoder and classification. Figure 8 contains a visual representation of the architecture. Patch embedding is the first stage in a ViT. During embedding, images are split into fixed size patches, typically 16x16 pixels. These patches are flattened into a one-dimensional vector form, and through linear projection they are mapped into fixed-size feature space that can be processed by the transformer encoder [16]. Before feeding patches into the encoder, their position in the image is memorized through positional embedding, which means we have a matrix containing all the positions of different patches [16]. Figure 8: Vision Transformer Architecture [17], licensed under CC BY-NC-ND 4.0 14 The Transformer encoder receives the patches and processes them through multiple blocks containing three layers: • Multi-Head Self-Attention (MHSA) captures long-range dependencies between patches. It takes the patch embedding and creates 3 separate vectors based on it: query, key and value. They help the model to decide which patches should focus on each other. Because it is multi-headed, each input can be split into multiple heads, each creating their own query, key and value vectors [14]. • Multilayer Perceptron (MLP) expands the MHSA output dimensionality, applies a non- linear activation function and projects the image back to its original size. This process helps keep the network effective and preserve information [14]. • Normalization layer stabilizes inputs and improves convergence. Classification happens in the final stage of a typical ViT [16]. Similarly to the CNN structure, this last section needs to be replaced with a decoder that handles the upsampling for segmentation instead of a single class output. 15 3.2 Review of Recent Sensor Fusion Methods for Semantic Segmentation This section extends on the concepts introduced in Section 2.3 by categorizing and reviewing sensor fusion methods for semantic segmentation in autonomous driving. Methods are grouped by fusion approach (early, mid, late and asymmetric) and each subsection focuses on how LiDAR and camera data are integrated within the methods. It is followed by a discussion, where strengths and limitations of early, mid, late and asymmetric fusion are discussed. A summary of the methods grouped by fusion approach is introduced in Table 1. This section follows a similar structure to Section 2.3, starting with early fusion and finalizing it with asymmetric fusion methods. 3.2.1 Methods for Early Fusion The Object-based Inverse Projection Algorithm (OIPA) [18] is a method that can handle fusion of camera images with either 3D or 2D point cloud, and it uses algorithmic approach to achieve this without deep learning-based methods. It consists of two different parts: calibration and segmentation. During the calibration part, this method uses algorithms like Line Segment Detection (LSD) [19] to extract lines and ellipse-shaped features (road signs) from the 2D image. From the 2D LiDAR point cloud projection, the location of edges of the detected features are estimated with geometric calculations. As a final step, a projection matrix is created to encode the alignment of camera and LiDAR. In the segmentation part, objects are first detected with bounding boxes from 2D image using the object detection method YOLO-v10 [20]. This detection result is then inversely projected back into the LiDAR point cloud to corresponding 3D regions. Bounding boxes are used to focus on segments of the cloud where objects are located, and points outside the focused regions can be removed. This allows object-focused point cloud segmentation. XYZDIRGB [21] is a fusion method that is built on top of the SqueezeSeg [22], a CNN-based method used for point cloud segmentation. In this method, the LiDAR point cloud is converted into a polar grid map, which is a tensor representation of the cloud. In the tensor each row corresponds to a vertical LiDAR layer, each column is a horizontal angular step over a 360- degree field of view and the third dimension contains depth and reflectance values. The third- dimension values are concatenated to RGB values from the camera image during the fusion process. After concatenation this tensor is fed to SqueezeSeg which performs the feature extraction and segmentation. 16 Table 1: Sensor Fusion Methods Method Name Year Architecture Output Domain Distinctive Feature Ea rly F us io n OIPA [18] 2025 Non-DL 2D/3D Inverse projection of bounding boxes UNRLF [23] 2023 CNN 2D Used for road segmentation XYZDIRGB [21] 2019 CNN 3D Uses polar grid mapping M id -F us io n CLFusion [13] 2024 CNN + ViT 3D Self-supervised training MFSA-Net [24] 2024 CNN + Attention 3D Dual-distance attention feature aggregation CLFT [25] 2024 ViT 2D First open-source transformer-based method for camera + LiDAR CMX [26] 2023 ViT 2D Cross-modal feature rectification and can be used with various sensor configurations PMF [27] 2021 CNN 3D Perspective projection FuseSeg [28] 2019 CNN 3D Feature warping fusion XYZDI+DIRGB [21] 2019 CNN 3D Uses polar grid mapping La te F us io n PCR6+RF [29] 2024 CNN 2D Designed for multimodality EDF [29] 2024 CNN 2D Shannon entropy for decision-making SSCLF [30] 2021 CNN 3D Semi-supervised learning As ym m et ric F us io n PFN [31] 2024 CNN 3D First fusion method for 3D panoptic segmentation LIF-Seg [32] 2024 CNN 3D Coarse, offset and refinement structure CMDFusion [33] 2024 CNN 3D 2D to 3D and 3D to 2D (bidirectional) fusion scheme for 3D feature enhancement MPFN [34] 2023 CNN + Attention 3D Weakly supervised training for pixel- wise labels 17 U-Net-based RGB and LiDAR Fusion (UNRLF) [23] is a method used for road segmentation. Three variations of the same CNN-based method were created, and early fusion was found to be the most effective. Depth information from LiDAR is concatenated with RGB values of image pixels, and this 4-channel input is processed through the U-Net [35]. 3.2.2 Methods for Mid-Fusion XYZDI+DIRGB [21] is a mid-fusion version of the method XYZDIRGB discussed earlier in this review, and they were introduced in the same paper. In this method, features are extracted from both sensors in their own encoders, the resulting feature maps are concatenated in the third dimension of the tensor and segmentation happens in a shared decoder. This method was considered to be a worse option to the early fusion method, since it is computationally more complex and offers only a slightly improved accuracy. Perception-aware Multi-sensor Fusion (PMF) [27] is a fusion method, where a 3DPC is projected into a camera coordinate system and features are extracted from modalities individually using a two-stream network. These streams are connected through residual-based fusion networks that are used for fusing features from both modalities together. Camera-LiDAR Fusion Transformer (CLFT) [25] is a ViT-based fusion method designed for semantic segmentation applications in autonomous driving. It works by processing camera and LiDAR inputs separately in the ViT encoder stage, and features are fused using a cross-fusion strategy. This means that they are not fused at a single point, and the fusion is an ongoing process during the whole decoder stage. Cross-Modal Fusion for RGB-X (CMX) [26] is a transformer-based fusion method, where RGB- X refers to the fact that this method can be used for fusing camera RGB data to various other modalities besides LiDAR. It works through a similar two-stream network used in PMF [27]. The difference in the CMX comes from the ability of features to rectify each other using Cross- Modal Feature Rectification units if one of the modalities is providing noisy information. After feature rectification, same level features are fused, and a similar structure continues throughout the network until the decoder. FuseSeg [36] is a fusion method that is built as an extension to SqueezeSeg [22], the same method that was used for point cloud segmentation in the XYZDIRGB [21]. SqueezeSeg obtains information about reflectance, range and three-dimensional coordinates by using spherical projection. The camera image is processed through a CNN, features are extracted in 18 multiple layers, and they are concatenated with LiDAR representation using point correspondences between the two. CLFusion [13] is a two-stream encoder-decoder method used for point cloud segmentation. It consists of 4 different parts: camera, LiDAR, fusion and supervision. Camera and LiDAR are processed by separate CNNs, in which features are gradually fused in the encoder and each layer uses skip connections to the corresponding decoder layers. Following the fusion of corresponding layers in the camera and LiDAR pipelines, these fused features are fed into the Swin Transformer [37] to obtain attention-based features. Then convolutional and attention- based features are injected back into the camera and LiDAR networks using sensor-specific weights. Finally, the combined features are upsampled in the decoder to produce the output. A visual representation of the CLFusion architecture is shown in Figure 7. CLFusion can be trained self-supervised on unlabeled or partially labeled datasets using a pretrained network for creating pseudo-labels. The network used is a modified version PIDNet [38], and it works by performing semantic segmentation in real-time with a layer that assigns confidence values to predictions. These confidence-based calculations are then used for filtering, allowing only high confidence labels to end up in the training set. MFSA-Net [24] is a mid-fusion method that contains three different modules: DDSA3D, CATSR and perception-aware loss module. The method projects 3DPC to 2-dimensional space, and the nearest neighbors to each point are located using DDSA3D. The same method is used for utilizing feature- and Euclidian distances to gain information about local contextual feature information among the points. The extracted features are injected into the LiDAR stream in a two-stream network that is based on cross-attention. After both modalities have been processed through the network, the method uses a perceptual-aware loss module to calculate confidence values for each branch in the network, considering that camera pixels in the middle versus the edges of an object have a large difference in semantic relevance. 3.2.3 Methods for Late Fusion PCR6+ Rule-based fusion (PCR6+RF) [29] uses 2 identical CNN-based encoder-decoder architectures for segmentation to produce softmax probability distributions for both camera and LiDAR inputs. Both softmax outputs are converted to basic belief assignments (BBA), which are a more flexible alternative to traditional probabilities. They allow belief to be assigned to single outcomes or combinations of outcomes. They are then merged using the PCR6+ fusion rule, which is designed to handle conflicting information from multiple sources. It identifies 19 areas where the methods disagree and shares the uncertainty depending on how confident each method was in its prediction. The result is a belief distribution for each pixel, and classes with the highest belief can be selected as the final prediction for that pixel. Entropy-weighted decision fusion (EDF) [29] uses similar CNN-based encoder-decoder architectures as PCR6+ method for generating softmax probability distributions on camera and LiDAR inputs. Instead of converting outputs to belief assignments like PCR6+ Rule-based fusion [29], it makes early decisions based on the class with the highest softmax value for each input source. Each decision is then assigned a confidence weight calculated using Shannon entropy (lower entropy indicates higher confidence). The final class for each pixel is determined by averaging the two decisions, weighted by their respective entropies. Semi-Supervised LiDAR-Camera Fusion (SSLCF) [30] is a fusion method based on the fully convolutional network FCN-ResNet50 [39]. The method contains 3 separate networks: camera, LiDAR and fusion. Features are extracted from the camera and LiDAR branch during the first 4 stages of the FCN, and then they are concatenated in the fusion network, which will proceed with the final prediction. This method takes advantage of semi-supervised learning by training the fusion network branch on labeled data and using the trained network to create predictions or proxy labels for unlabeled data. These labels are then considered as the ground truth when training the single modality networks, reducing the need for manual annotation. 3.2.4 Methods for Asymmetric Fusion Panoptic FusionNet (PFN) [31] is a fusion method used for panoptic segmentation. It means that rather than focusing solely on semantic segmentation, the method can also detect individual instances of an object from the same class (instance segmentation) and create a combined result. It works by first processing camera and LiDAR in separate networks and extracting features from both sensors. Voxel features from LiDAR and 2D features from the camera are fused using a correspondence table, which associates corresponding sections from each sensor together. After fusion, the method has 3 parts: semantic head, instance head and a panoptic processing module. Semantic segmentation happens in the semantic head by concatenating voxel-wise global features with point-wise features and passing them through multiple MLPs. Multi-Phase Fusion Network (MPFN) [34] is a fusion method that combines mid and late fusion. In this method, two separate networks are used to process LiDAR and camera data, the one used for LiDAR being the main network and the camera providing complementary 20 information. Before injecting the extracted image features to the point cloud, this method uses an attention-based feature fusion module to filter off irrelevant features. Finally, late fusion happens to both branches, where each pixel is given a confidence value and merged, resulting in the final output. The specialty of this method is that it takes advantage of weak supervision to address the lack of pixel-wise labels in the datasets used for training these networks. This is achieved through projecting LiDAR labels into the image space and using weak supervision to generate more labels. LIF-Seg [32] is an asymmetric fusion method used for point cloud segmentation. It is a coarse- to-fine framework that involves 3 stages called coarse feature extraction, offset learning and refinement. In the first stage, LiDAR points are projected onto camera images and concatenated with the corresponding image context information. A LiDAR segmentation network is then used, leading to coarse features. The second stage takes care of alignment by first segmenting the camera image on a separate segmentation network and in the end projecting the segmentation result back into 3D space where it is concatenated with the coarse features from the previous stage. In the last stage, the fusion result from the second stage is fed into a similar segmentation network that was used in the first stage, creating the final prediction. CMDFusion [33] is a bidirectional fusion method designed for 3D semantic segmentation. Bidirectional refers to the fact that this method uses a Bidirectional Fusion Block (BFB) to benefit from projecting a 2D knowledge branch into a 3DPC and vice versa. This way, features in the three-dimensional space can be enhanced directly and indirectly. The 2D-to-3D projection is used for the main 3D feature extraction (direct) and features from 3D-to-2D projection are used to enhance the main features (indirect). It also introduces a Cross-Modality Distillation (CMD) that allows the LiDAR network to be trained in a way that it can remember information from the camera network (2D knowledge branch). When camera images are not available in a certain direction, CMD can use the 3DPC to generate the information. 21 4 Discussion In this section, we will discuss approaches to sensor fusion and potential future directions where the field is heading. Table 2 summarizes the advantages and disadvantages of sensor fusion approaches. The trends in the field that we will discuss, are mainly trying to address the limitations of current approaches. 4.1 Limitations of Existing Approaches When carrying out the research for this literature review, it became obvious that mid-fusion is a popular approach in the current fusion methods. In the methods designed for complex scenarios, mid-fusion or a hybrid version of this approach appears to be a valid option. Mid- fusion offers the perfect balance, it is not as sensitive to misalignment as early fusion and still takes advantage of information available by the multiple modalities [3]. Table 2: Fusion Approaches Summary Advantages Disadvantages E a rl y F u s io n Fusion happens on raw data, leveraging all the available information Requirements for memory and computing power are low since modalities are processed simultaneously Changes to sensor configuration are difficult to execute, because it requires retraining the network Prone to data misalignment that can be caused by faulty calibration, sensor malfunction and sampling rate mismatch M id -F u s io n Can have a better perceptual understanding, because fusion happens at the feature-level, highly flexible Choosing the optimal time to fuse is difficult L a te F u s io n Configuration can be changed easily, because every sensor is trained individually High costs in computation and memory Does not take full advantage of synergy between sensors 22 The majority of the more complex methods included in the literature review have found unique ways of aligning multimodal data to minimize information losses and errors in the segmentation result. Still, sensor misalignment is a big issue, and it is difficult to do it perfectly. CNN and ViT hybrid mehtods were popular because the former is better at small details, while the latter captures larger features. The methods require a lot of data for training, and many of the methods in the literature review have already implemented ways to address this problem, like CLFusion [13], that implements self-supervised learning, or SSCLF [30] and MPFN [34], that have developed their own ways of using weakly supervised learning. There is still a lot of work to be done, and the more labeled datasets for training will lead to better overall segmentation performance. 4.2 Trends and Potential Future Directions There has been a clear trajectory in semantic segmentation from traditional machine learning methods to CNNs and recently the ViT. CNNs gained popularity due to automatic feature extraction, and now ViTs have become interesting because of their ability to understand the features globally. Driven by high success rate and strong performance of the ViTs, they are one of the main research focuses for visual tasks [15]. The downside of ViTs is that they require large, labeled datasets for training and especially in semantic segmentation where the focus is on pixel-wise predictions. A future direction in the field is to investigate ways of how to overcome this, and there are several potential solutions. Weakly supervised and self-supervised learning methods are one of the promising solutions to the dataset problem [9], [40]. Weakly supervised refers to datasets that are only partially labeled or contain noisy data, while self-supervised means that the network can work on the data even when the data is completely unlabeled [1]. These methods are still in the development phase and have not yet been able to surpass fully supervised methods [40], but as seen in the literature review, methods like [13], [30], [34] are starting to implement these methods. Another potential future direction to address the challenge of the methods requiring a large amount of labeled data is open vocabulary segmentation, which leverages visual language models (VLMs). CLIP-based [41] VLMs can perform segmentation using natural language and visual feature pairs without the need for pixel-wise labels. Examples of methods already doing this include LSeg [42], ZegFormer [43] and MaskCLIP [44]. There are examples of visual language models being used to create pseudo-labels for data that is used to train other 23 segmentation methods in a weakly supervised way (MaskCLIP+ [45], OVSeg [44]), similar to what CLFusion [13] proposed. The future potential of visual language models has a lot to offer. Finally, researchers have begun exploring state space models (SSMs) for visuals tasks as an alternative to the computationally challenging ViT. Implementing the Mamba architecture for vision [46] has shown promising results, by offering notable improvements in terms of efficiency and segmentation accuracy, while still capturing long-range dependencies in the data. However, this is a very recent development, gaining momentum only after the release of the original Mamba architecture [47], and research into its application on visual tasks is still in its early stages. 4.3 Conclusion This thesis provided an overview of sensor fusion approaches (early, mid, late and asymmetric fusion) that could be used in autonomous driving systems equipped with cameras and LiDAR for semantic segmentation. The focus was on gathering new deep learning -based methods and organizing them by fusion approach, showcasing the working principles of each method. During the review, it was found that although ViTs have gained a substantial amount of research interest and are a key research focus in the field, many methods still depend on CNN-based fusion methods. Hybrid architectures combining the CNN and ViT were common, due to CNNs still being the dominant method for detecting smaller local features, while ViTs excel at capturing long-range dependencies. In the future, open vocabulary segmentation [48], weakly supervised and self-supervised methods have a lot of potential to drive this highly data dependent field forward. 24 5 References [1] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image Segmentation Using Deep Learning: A Survey,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2021, doi: 10.1109/TPAMI.2021.3059968. [2] D. Feng et al., “Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges,” IEEE Trans. Intell. Transport. Syst., vol. 22, no. 3, pp. 1341–1360, Mar. 2021, doi: 10.1109/TITS.2020.2972974. [3] C. Xiang et al., “Multi-Sensor Fusion and Cooperative Perception for Autonomous Driving: A Review,” IEEE Intell. Transport. Syst. Mag., vol. 15, no. 5, pp. 36–58, Sep. 2023, doi: 10.1109/MITS.2023.3283864. [4] K. Huang, B. Shi, X. Li, X. Li, S. Huang, and Y. Li, “Multi-modal Sensor Fusion for Auto Driving Perception: A Survey,” Dec. 16, 2024, arXiv: arXiv:2202.02703. doi: 10.48550/arXiv.2202.02703. [5] Y. Guo, G. Nie, W. Gao, and M. Liao, “2D Semantic Segmentation: Recent Developments and Future Directions,” Future Internet, vol. 15, no. 6, p. 205, Jun. 2023, doi: 10.3390/fi15060205. [6] M. Cordts et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 3213–3223. doi: 10.1109/CVPR.2016.350. [7] Z. Xiao et al., “Research Advances in Deep Learning for Image Semantic Segmentation Techniques,” IEEE Access, vol. 12, pp. 175715–175741, 2024, doi: 10.1109/ACCESS.2024.3496723. [8] B. Marsh, A. H. Sadka, and H. Bahai, “A Critical Review of Deep Learning-Based Multi-Sensor Fusion Techniques,” Sensors, vol. 22, no. 23, p. 9364, Dec. 2022, doi: 10.3390/s22239364. [9] G. Rizzoli, F. Barbato, and P. Zanuttigh, “Multimodal Semantic Segmentation in Autonomous Driving: A Review of Current Approaches and Future Perspectives,” Technologies, vol. 10, no. 4, p. 90, Jul. 2022, doi: 10.3390/technologies10040090. [10] N. Aherwadi, U. Mittal, J. Singla, N. Z. Jhanjhi, A. Yassine, and M. S. Hossain, “Prediction of Fruit Maturity, Quality, and Its Life Using Deep Learning Algorithms,” Electronics, vol. 11, no. 24, p. 4100, Dec. 2022, doi: 10.3390/electronics11244100. 25 [11] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017, doi: 10.1109/TPAMI.2016.2644615. [12] Y. Duan, W. Zhang, P. Huang, G. He, and H. Guo, “A New Lightweight Convolutional Neural Network for Multi-Scale Land Surface Water Extraction from GaoFen-1D Satellite Images,” Remote Sensing, vol. 13, no. 22, p. 4576, Nov. 2021, doi: 10.3390/rs13224576. [13] T. Wang, R. Song, Z. Xiao, B. Yan, H. Qin, and D. He, “CLFusion:3D Semantic Segmentation Based on Camera and Lidar Fusion,” in 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore: IEEE, May 2024, pp. 1–5. doi: 10.1109/ISCAS58744.2024.10558356. [14] K. Han et al., “A Survey on Visual Transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023, doi: 10.1109/TPAMI.2022.3152247. [15] H. Thisanke, C. Deshan, K. Chamith, S. Seneviratne, R. Vidanaarachchi, and D. Herath, “Semantic segmentation using Vision Transformers: A survey,” Engineering Applications of Artificial Intelligence, vol. 126, p. 106669, Nov. 2023, doi: 10.1016/j.engappai.2023.106669. [16] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 03, 2021, arXiv: arXiv:2010.11929. doi: 10.48550/arXiv.2010.11929. [17] J.-H. Bang et al., “CA-CMT: Coordinate Attention for Optimizing CMT Networks,” IEEE Access, vol. 11, pp. 76691–76702, 2023, doi: 10.1109/ACCESS.2023.3297206. [18] X. Yuan, S. Wang, Y. Xie, S. Q. Xie, C. Wang, and T. Xiong, “Object-Based Semantic Fusion Algorithm of Lidar and Camera via Inverse Projection,” IEEE Trans. Instrum. Meas., pp. 1–1, 2025, doi: 10.1109/TIM.2025.3548241. [19] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “LSD: A Fast Line Segment Detector with a False Detection Control,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 4, pp. 722–732, Apr. 2010, doi: 10.1109/TPAMI.2008.300. [20] A. Wang et al., “YOLOv10: Real-Time End-to-End Object Detection,” Oct. 30, 2024, arXiv: arXiv:2405.14458. doi: 10.48550/arXiv.2405.14458. [21] K. E. Madawy, H. Rashed, A. E. Sallab, O. Nasr, H. Kamel, and S. Yogamani, “RGB and LiDAR fusion based 3D Semantic Segmentation for Autonomous Driving,” Jul. 17, 2019, arXiv: arXiv:1906.00208. doi: 10.48550/arXiv.1906.00208. 26 [22] B. Wu, A. Wan, X. Yue, and K. Keutzer, “SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD: IEEE, May 2018, pp. 1887–1893. doi: 10.1109/ICRA.2018.8462926. [23] A. T. Candan and H. Kalkan, “U-Net-based RGB and LiDAR image fusion for road segmentation,” SIViP, vol. 17, no. 6, pp. 2837–2843, Sep. 2023, doi: 10.1007/s11760- 023-02502-5. [24] Y. Duan et al., “MFSA-Net: Semantic Segmentation With Camera-LiDAR Cross- Attention Fusion Based on Fast Neighbor Feature Aggregation,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing, vol. 17, pp. 19627–19639, 2024, doi: 10.1109/JSTARS.2024.3472751. [25] J. Gu, M. Bellone, T. Pivoňka, and R. Sell, “CLFT: Camera-LiDAR Fusion Transformer for Semantic Segmentation in Autonomous Driving,” IEEE Trans. Intell. Veh., pp. 1–12, 2024, doi: 10.1109/TIV.2024.3454971. [26] J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers,” IEEE Trans. Intell. Transport. Syst., vol. 24, no. 12, pp. 14679–14694, Dec. 2023, doi: 10.1109/TITS.2023.3300537. [27] Z. Zhuang, R. Li, K. Jia, Q. Wang, Y. Li, and M. Tan, “Perception-Aware Multi- Sensor Fusion for 3D LiDAR Semantic Segmentation,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, Oct. 2021, pp. 16260–16270. doi: 10.1109/ICCV48922.2021.01597. [28] G. Krispel, M. Opitz, G. Waltner, H. Possegger, and H. Bischof, “FuseSeg: LiDAR Point Cloud Segmentation Fusing Multi-Modal Data,” Dec. 19, 2019, arXiv: arXiv:1912.08487. doi: 10.48550/arXiv.1912.08487. [29] D.-V. Giurgi, J. Dezert, T. Josso-Laurain, M. Devanne, and J.-P. Lauffenburger, “Fusion of Semantic Segmentation Models for Vehicle Perception Tasks,” in 2024 27th International Conference on Information Fusion (FUSION), Venice, Italy: IEEE, Jul. 2024, pp. 1–8. doi: 10.23919/FUSION59988.2024.10706336. [30] L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, and R. Sell, “Lidar–Camera Semi-Supervised Learning for Semantic Segmentation,” Sensors, vol. 21, no. 14, p. 4813, Jul. 2021, doi: 10.3390/s21144813. [31] H. Song, J. Cho, J. Ha, J. Park, and K. Jo, “Panoptic-FusionNet: Camera-LiDAR fusion-based point cloud panoptic segmentation for autonomous driving,” Expert 27 Systems with Applications, vol. 251, p. 123950, Oct. 2024, doi: 10.1016/j.eswa.2024.123950. [32] L. Zhao, H. Zhou, X. Zhu, X. Song, H. Li, and W. Tao, “LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation,” IEEE Trans. Multimedia, vol. 26, pp. 1158–1168, 2024, doi: 10.1109/TMM.2023.3277281. [33] J. Cen et al., “CMDFusion: Bidirectional Fusion Network With Cross-Modality Knowledge Distillation for LiDAR Semantic Segmentation,” IEEE Robot. Autom. Lett., vol. 9, no. 1, pp. 771–778, Jan. 2024, doi: 10.1109/LRA.2023.3335771. [34] X. Chang, H. Pan, W. Sun, and H. Gao, “A Multi-Phase Camera-LiDAR Fusion Network for 3D Semantic Segmentation With Weak Supervision,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 3737–3746, Aug. 2023, doi: 10.1109/TCSVT.2023.3241641. [35] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” May 18, 2015, arXiv: arXiv:1505.04597. doi: 10.48550/arXiv.1505.04597. [36] G. Krispel, M. Opitz, G. Waltner, H. Possegger, and H. Bischof, “FuseSeg: LiDAR Point Cloud Segmentation Fusing Multi-Modal Data,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 1863–1872. doi: 10.1109/WACV45572.2020.9093584. [37] Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” Aug. 17, 2021, arXiv: arXiv:2103.14030. doi: 10.48550/arXiv.2103.14030. [38] J. Xu, Z. Xiong, and S. P. Bhattacharyya, “PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers,” Apr. 07, 2023, arXiv: arXiv:2206.02066. doi: 10.48550/arXiv.2206.02066. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90. [40] Y. Mo, Y. Wu, X. Yang, F. Liu, and Y. Liao, “Review the state-of-the-art technologies of semantic segmentation based on deep learning,” Neurocomputing, vol. 493, pp. 626–646, Jul. 2022, doi: 10.1016/j.neucom.2022.01.005. [41] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”. 28 [42] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven Semantic Segmentation,” Apr. 03, 2022, arXiv: arXiv:2201.03546. doi: 10.48550/arXiv.2201.03546. [43] J. Ding, N. Xue, G.-S. Xia, and D. Dai, “Decoupling Zero-Shot Semantic Segmentation,” Apr. 15, 2022, arXiv: arXiv:2112.07910. doi: 10.48550/arXiv.2112.07910. [44] Z. Ding, J. Wang, and Z. Tu, “Open-Vocabulary Universal Image Segmentation with MaskCLIP,” Jun. 08, 2023, arXiv: arXiv:2208.08984. doi: 10.48550/arXiv.2208.08984. [45] C. Zhou, C. C. Loy, and B. Dai, “Extract Free Dense Labels from CLIP,” Jul. 27, 2022, arXiv: arXiv:2112.01071. doi: 10.48550/arXiv.2112.01071. [46] H. Zhang et al., “A Survey on Visual Mamba,” Applied Sciences, vol. 14, no. 13, p. 5683, Jun. 2024, doi: 10.3390/app14135683. [47] A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” May 31, 2024, arXiv: arXiv:2312.00752. doi: 10.48550/arXiv.2312.00752. [48] C. Zhu and L. Chen, “A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 8954–8975, Dec. 2024, doi: 10.1109/TPAMI.2024.3413013.