Enhancing Explainability and Performance
in Intrusion Detection Systems using Deep
Learning Models and LLMs
University of Turku
Department of Computing
Master of Science (Tech) Thesis
Cyber Security Engineering
July 2025
Mohd Junaid Ahmed
Supervisors:
Tahir Mohammad (University of Turku)
Ismayil Hasanov (University of Turku)
The originality of this thesis has been checked in accordance with the University of Turku quality assurance
system using the Turnitin OriginalityCheck service.
UNIVERSITY OF TURKU
Department of Computing
Mohd Junaid Ahmed: Enhancing Explainability and Performance in Intrusion
Detection Systems using Deep Learning Models and LLMs
Master of Science (Tech) Thesis, 80 p., 14 app. p.
Cyber Security Engineering
July 2025
The evolving landscape of cyber security highlights the importance of the effective-
ness and transparency of Intrusion Detection Systems (IDS), which play a critical
role in protecting computer networks from malicious activities. However, many
advanced Machine Learning (ML) models used in IDS often present challenges in
interpretation, thereby constraining their reliability and practical deployment. This
research aims to improve the detection performance and explainability of IDS by
combining powerful tabular Deep Learning (DL) models with open-source Large
Language Models (LLMs).
In this study, the CSE-CIC-IDS2018 dataset is used as a benchmark for training and
evaluating several ML models. The models include TabNet, a DL model specifically
designed for tabular data, and various AutoGluon-based models such as Neural
Network implemented with PyTorch (NN_TORCH), Gradient Boosting Machine
(GBM), Categorical Boosting (CaTBoost), and Extreme Gradient Boosting (XG-
Boost). These models are evaluated based on their ability to detect different kinds
of network intrusions reliably.
After making predictions, the outputs obtained from the models are passed to open-
source LLMs which generate natural language explanations. This step is intended
to make the decision-making process of the models more understandable to human,
including security analysts and system administrators. By integrating LLMs into
the IDS pipeline, the system not only effectively identifies threats but also provides
an explanation of the reasons behind each prediction in a human-readable format.
The experimental findings indicate that the suggested method delivers a high detec-
tion accuracy. AutoGluon ensembles attained up to 98.1% accuracy, while TabNet
achieved 97.8%. Additionally, this approach provides clear and beneficial explana-
tions through LLMs. Although the performance wasn’t consistent across all minor
attack classes, integrating DL with LLMs significantly increased the system’s trans-
parency and utility for analysts. This improvement in understandability enhances
the system’s practical applicability in cyber security contexts.
Keywords: Intrusion Detection Systems, Tabular Deep Learning, TabNet, Auto-
Gluon, Neural Networks, Gradient Boosting, Explainable AI, Large Language
Models, CSE-CIC-IDS2018, Cyber Security
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research and Development Need . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Questions and Objectives . . . . . . . . . . . . . . . . . . . 4
1.4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Statement of AI Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background and Related Work 8
2.0.1 Key AI Concepts: . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.0.2 Additional Concepts and Evaluation Metrics in AI-driven IDS 9
2.1 Intrusion Detection Systems (IDS) . . . . . . . . . . . . . . . . . . . 11
2.1.1 Evolution of IDS Approaches: From Rule-Based to LLM-
Integrated Systems . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Explainability in Machine Learning . . . . . . . . . . . . . . . . . . . 13
2.2.1 Importance in Cyber Security . . . . . . . . . . . . . . . . . . 13
2.2.2 SHAP and LIME . . . . . . . . . . . . . . . . . . . . . . . . . 14
i
2.2.3 LLM-Based Explanations . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Operational Integration in SOCs . . . . . . . . . . . . . . . . 15
2.3 Tabular Deep Learning Models . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 TabNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 AutoGluon: Automated ML for Tabular Data . . . . . . . . . 17
2.4 Large Language Models (LLMs) for Explanation . . . . . . . . . . . . 18
2.4.1 Open-Source LLMs . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Relevance to Security Decision Support . . . . . . . . . . . . . . . . . 20
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 Deep Learning in IDS . . . . . . . . . . . . . . . . . . . . . . 21
2.6.2 XAI and LLM Integration . . . . . . . . . . . . . . . . . . . . 21
2.6.3 LLMs in IDS Pipelines . . . . . . . . . . . . . . . . . . . . . . 22
2.6.4 Comparative Summary of Related Studies . . . . . . . . . . . 22
2.6.5 Gaps in Existing Research . . . . . . . . . . . . . . . . . . . . 22
2.6.6 Challenges in LLM-Based IDS Research . . . . . . . . . . . . 24
2.6.7 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . 25
3 Description of Dataset and Preprocessing 26
3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Feature Structure . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Attack Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.4 Original Data Source and Crediting . . . . . . . . . . . . . . . 29
3.2 Data Cleaning and Feature Engineering . . . . . . . . . . . . . . . . . 29
3.2.1 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Encoding Categorical Features . . . . . . . . . . . . . . . . . . 31
3.2.4 Normalization and Scaling . . . . . . . . . . . . . . . . . . . . 31
ii
3.3 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Splitting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Rationale for Stratification . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Temporal Decorrelation . . . . . . . . . . . . . . . . . . . . . 33
3.4 Justification of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Realistic and Diverse Attack Scenarios . . . . . . . . . . . . . 33
3.4.2 Rich Feature Set for Deep Learning Models . . . . . . . . . . 33
3.4.3 Explainability Relevant Characteristics . . . . . . . . . . . . . 34
3.4.4 Standard Benchmark for Reproducibility . . . . . . . . . . . . 34
4 Design and Specification of the Proposed System 35
4.1 System Architecture Overview . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Model Selection Rationale . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Why TabNet? . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Why AutoGluon? . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Comparative Strengths . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Trade-offs Between Performance and Interpretability . . . . . 38
4.3.2 Real-Time Usability Considerations . . . . . . . . . . . . . . . 38
4.3.3 Usability for Security Analysts . . . . . . . . . . . . . . . . . . 39
4.4 Tools and Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Software Frameworks and Libraries . . . . . . . . . . . . . . . 40
4.4.2 Hardware and Execution Environments . . . . . . . . . . . . . 41
4.4.3 Training Efficiency and Resource Considerations . . . . . . . . 41
4.4.4 Reproducibility and Open Source Tools . . . . . . . . . . . . . 42
4.5 LLM Integration for Explainability . . . . . . . . . . . . . . . . . . . 43
4.5.1 System Design for LLM Integration . . . . . . . . . . . . . . . 43
4.6 Choice of Open-Source LLMs . . . . . . . . . . . . . . . . . . . . . . 43
iii
4.6.1 Prompt Engineering Strategies . . . . . . . . . . . . . . . . . . 44
4.6.2 Handling Failures and Output Storage . . . . . . . . . . . . . 44
4.7 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7.1 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . 45
4.7.2 Explainability Evaluation Metrics . . . . . . . . . . . . . . . . 46
5 Implementation and Verification 47
5.1 Model Training and Tuning . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 TabNet Training Configuration . . . . . . . . . . . . . . . . . 47
5.1.2 AutoGluon Training Configuration . . . . . . . . . . . . . . . 48
5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 AutoGluon Model Results . . . . . . . . . . . . . . . . . . . . 50
5.2.2 TabNet Results . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.3 Comparative Analysis: TabNet vs AutoGluon . . . . . . . . . 52
5.2.4 Per-Class Performance Analysis . . . . . . . . . . . . . . . . . 53
5.3 LLM Explanation Outputs . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1 Purpose of LLM Integration . . . . . . . . . . . . . . . . . . . 55
5.3.2 Models Used and Prompting Strategy . . . . . . . . . . . . . . 55
5.3.3 Sample Explanation Output . . . . . . . . . . . . . . . . . . . 55
5.4 Validation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Results in Relation to Research Questions . . . . . . . . . . . . . . . 59
5.7 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Discussion 62
6.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
iv
6.1.1 Insights from Model Performance Results . . . . . . . . . . . . 63
6.1.2 Trade-offs Between Models . . . . . . . . . . . . . . . . . . . . 63
6.2 Explainability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Qualitative Evaluation of LLM Output . . . . . . . . . . . . . 64
6.3 Analyst Usability Discussion . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4.1 Dataset Coverage and Scalability . . . . . . . . . . . . . . . . 68
6.4.2 Limits of LLM-Generated Explanations . . . . . . . . . . . . . 69
6.5 Implications for Cyber security . . . . . . . . . . . . . . . . . . . . . 70
6.5.1 Operational Integration into SOC Workflows . . . . . . . . . . 70
6.5.2 Human-in-the-Loop Threat Analysis . . . . . . . . . . . . . . 71
6.5.3 Compliance and Reporting Support . . . . . . . . . . . . . . . 71
6.5.4 Future Directions in Cyber Defense . . . . . . . . . . . . . . . 71
7 Conclusion and Future Work 73
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Main Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2.1 Performance Uplift via Tabular DL Models . . . . . . . . . . . 74
7.2.2 Enhanced Interpretability via LLMs . . . . . . . . . . . . . . . 75
7.2.3 Robust Evaluation Metrics Support Findings . . . . . . . . . . 75
7.2.4 Class-wise Observation . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Impact Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.1 Bridging the Performance–Explainability Gap in IDS . . . . . 77
7.3.2 Elevating Human-in-the-Loop Analysis . . . . . . . . . . . . . 77
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
References 81
Appendices
v
A Sample Code Snippets A-1
A.1 TabNet Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
A.2 AutoGluon Training Setup . . . . . . . . . . . . . . . . . . . . . . . . A-1
B LLM Prompt-Responses B-3
B.1 Prompt Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
B.2 Response from Mistral-7B-Instruct . . . . . . . . . . . . . . . . . . . B-5
B.3 Response from DeepSeek-R1 . . . . . . . . . . . . . . . . . . . . . . . B-6
B.4 Response from LLaMA3-8B-Instruct . . . . . . . . . . . . . . . . . . B-9
B.5 Response from Mixtral-8x7B-Instruct . . . . . . . . . . . . . . . . . . B-11
B.6 Response from Qwen2.5 . . . . . . . . . . . . . . . . . . . . . . . . . B-12
B.7 Response from Qwen3 . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
vi
List of Figures
4.1 System architecture for IDS with LLM explanation. Created using
draw.io. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Per-class performance metrics showing Precision, Recall, and F1-
Score for each network traffic class. The plot highlights strong perfor-
mance for most classes but low scores for Infiltration and Web Attack. 52
5.2 Comparison of key metrics (Accuracy, Precision, F1-Score) between
TabNet and AutoGluon. . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 LLM Explanation Word Counts across Models . . . . . . . . . . . . . 57
5.4 TabNet Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 AutoGluon Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . 58
6.1 Flesch-Kincaid Grade Level (Lower is Easier) . . . . . . . . . . . . . . 67
6.2 Flesch Reading Ease Score (Higher is Easier) . . . . . . . . . . . . . . 68
7.1 System architecture sketch showing real-time streaming IDS with
async LLM explanation flow. . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Flowchart showing SHAP–LLM hybrid explanation pipeline. . . . . . 79
vii
List of Tables
2.1 Comparison of studies using DL models and LLM/XAI methods for
IDS interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Distribution of Network Traffic by Label . . . . . . . . . . . . . . . . 28
4.1 Software frameworks for model training and evaluation . . . . . . . . 40
4.2 LLM-related libraries used for inference and backend integration . . . 40
4.3 Visualization libraries used in the IDS and LLM explanation pipeline 41
4.4 Execution hardware and runtime environment on Google Colab . . . 41
4.5 Summary of platforms, models, frameworks, and system specifications 42
4.6 Comparison of Instruction-Tuned Language Models . . . . . . . . . . 45
4.7 Performance Metrics Used . . . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Explainability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Summary of TabNet training configuration including core hyperpa-
rameters and training settings. . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Summary of AutoGluon training configuration including model types,
ensemble strategy, and best model selection. . . . . . . . . . . . . . . 49
5.3 Other Notable Models . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Per-Class Performance of TabNet . . . . . . . . . . . . . . . . . . . . 52
5.5 Comparison of TabNet and AutoGluon (Best Model) . . . . . . . . . 53
5.6 Overview of Selected LLMs Used . . . . . . . . . . . . . . . . . . . . 55
viii
6.1 Flesch-Kincaid Grade Level interpretation. . . . . . . . . . . . . . . . 66
6.2 Readability Metrics for LLM Outputs (Measured via Flesch-Kincaid
Calculator) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1 Comparison of TabNet and AutoGluon (Best Model) [4], [5] . . . . . 75
7.2 Per-Class Performance of TabNet . . . . . . . . . . . . . . . . . . . . 76
7.3 Dataset comparison matrix. . . . . . . . . . . . . . . . . . . . . . . . 79
ix
List of acronyms
API Application Programming Interface
APIs Application Programming Interfaces
AutoGluon AutoML toolkit developed by AWS for structured/tabular data
CaTBoost Categorical Boosting
CIC Canadian Institute for Cyber security
CNN Convolutional Neural Network
CSE-CIC-IDS2018 Canadian Institute for Cyber Security – Intrusion Detection
System 2018 Dataset
DL deep learning
GBM Gradient Boosting Machine
IAT Inter-Arrival Time
IDS Intrusion Detection Systems
LIME Local Interpretable Model-Agnostic Explanations
LLMs Large Language Models
ML machine learning
x
NLP Natural Language Processing
NN_TORCH Neural Network implemented with PyTorch (within AutoGluon)
RNN Recurrent Neural Network
SHAP SHapley Additive exPlanations
SOCs Security Operations Centers
TabNet Tabular Network, a deep learning architecture for tabular data
XAI Explainable AI
XGBoost Extreme Gradient Boosting
xi
1 Introduction
The complexity of modern cyber threats necessitates a more advanced approach
than traditional signature-based methods. These methods have become inadequate
against adaptive and evolving attacks. In this thesis, a hybrid framework is proposed
to improve the efficiency of IDS.
Deep Learning (DL) has shown promise in enhancing detection capabilities, but
its decision-making processes can be opaque, reducing trust and potentially delay-
ing incident responses. To address this issue, the proposed framework combines
advanced tabular DL models with explanations from LLMs.
The goal of this approach is to improve both the accuracy of detection in IDS
and to improve the explainability of its decisions, making it easier for humans to
understand and trust the system’s outputs. By integrating the strengths of both DL
and LLMs, the hybrid framework aims to provide a more robust and transparent
solution for IDS.
1.1 Background
Cyber security has become vital for digital infrastructure, since networks encounter
more complex threats. Intrusion Detection Systems (IDS) are now playing a key
role in defending from malicious activities and unauthorized access. IDS monitors
network traffic, identifies potential threats, and alert security professionals in real-
time. IDS approaches can be classified in broad terms. They do include signature-
1.2 PROBLEM STATEMENT 2
based systems as well as anomaly-based systems. Anomaly-based systems utilize
ML to identify deviations beyond normal behavior while signature-based systems
rely on predefined attack patterns [1].
Advances in Machine Learning (ML), particularly DL, have resulted in the cre-
ation of precise models for anomaly detection. However, these advanced models,
including deep neural networks and ensemble methods, often function as black
boxes, making their underlying decision-making processes unclear. This lack of
transparency poses substantial issues in critical domains such as cyber security,
where understanding the rationale behind a detection is crucial for trust and swift
response [2].
The structure of cyber security datasets, like CSE-CIC-IDS2018 [3], is fundamen-
tally tabular and includes both categorical and numerical features with imbalanced
classes. This poses difficulties for standard DL architectures that are typically de-
signed to process unstructured data such as images and text. Advancements in
tabular DL models, such as TabNet [4], and automated ML frameworks like Auto-
Gluon [5] have shown potential for handling tabular data with competitive perfor-
mance. However, their application in IDS requires solutions that address the dual
challenges of high performance and explainability.
1.2 Problem Statement
DL models have greatly enhanced the precision of IDS. However, their lack of trans-
parency poses a challenge for their implementation in practical scenarios. Cyber
security analysts need not just accurate alerts but also an understanding of why
a particular traffic pattern was identified as malicious. Without interpretability,
models cannot be fully trusted, which can delay incident responses or lead to mis-
interpretation of benign events as threats (false positives) [6].
Traditional explainability methods such as SHAP [7] or LIME provide some
1.3 RESEARCH AND DEVELOPMENT NEED 3
insights into model behavior, but they often produce technical explanations that
may not be understandable to all stakeholders. Moreover, they do not generate
context-aware, natural language explanations that could facilitate quicker decision-
making.
Additionally, there is a lack of approaches that integrate both high-performance
detection models and human-interpretable explanations, especially ones tailored to
tabular cyber security datasets. This creates a performance-explainability tradeoff,
where enhancing one often degrades the other. The gap between predictive perfor-
mance and interpretability in IDS remains a critical shortcoming in current research
and applications.
1.3 Research and Development Need
To address these challenges, there is a clear need for a hybrid IDS framework that
not only achieves high detection accuracy using tabular DL models but also inte-
grates explainability mechanisms capable of producing understandable and action-
able insights. The recent progress in open-source LLMs, such as GPT-J, Falcon, and
LLaMA, presents an opportunity to translate complex model outputs into natural
language explanations, which are more accessible to human users [8].
Combining TabNet and AutoGluon-based model including NN_TORCH, GBM,
CaTBoost and XGBoost with LLM-driven explanation modules can provide a ro-
bust and interpretable IDS pipeline. Such a system can potentially support analysts
by making DL models’ reasoning transparent and improving response effectiveness.
Moreover, using open-source LLMs ensures transparency, reproducibility, and adapt-
ability to various environments without depending on proprietary APIs.
1.4 RESEARCH QUESTIONS AND OBJECTIVES 4
1.4 Research Questions and Objectives
The focus of this thesis is on the discrepancy between advanced IDS models and their
explainability. While models such as TabNet and AutoGluon exhibit high precision
in detecting threats, their decision-making processes are typically not transparent,
which is crucial for trust and efficient real-time security operations. Consequently,
this research aims to determine if natural language explanations generated by LLMs
can help bridge this gap.
1.4.1 Research Questions
• RQ1: How do tabular deep learning models like TabNet and AutoGluon per-
form on a cyber security dataset like CSE-CIC-IDS2018?
• RQ2: Can LLMs effectively generate accurate and understandable explana-
tions for predictions made by tabular models?
• RQ3: What is the trade-off between performance and explainability when
integrating LLMs in the IDS pipeline?
1.4.2 Objectives
• To evaluate the predictive performance of TabNet and AutoGluon (includ-
ing NN_TORCH, GBM, CaTBoost, and XGBoost) on the CSE-CIC-IDS2018
dataset.
• To design and implement a modular IDS pipeline that integrates model outputs
with open-source LLMs for explanation.
• To develop a reproducible, open-source solution for explainable IDS frame-
works.
1.6 STRUCTURE OF THE THESIS 5
1.5 Contributions
In this thesis, a novel IDS pipeline is proposed that integrates high-performance tab-
ular DL models, namely TabNet and AutoGluon ensembles (NN_TORCH, GBM,
CaTBoost, XGBoost) with open-source LLMs to generate human-readable expla-
nations of detection outcomes. This approach addresses the dual challenges of
explainability and performance in IDS, using the CSE-CIC-IDS2018 dataset as a
benchmark.
The core contributions of this thesis include:
• A performance evaluation of various tabular DL models on the CSE-CIC-
IDS2018 dataset.
• A modular framework for integrating LLMs to interpret model outputs in
natural language.
• A qualitative and quantitative analysis of the generated explanations to assess
trustworthiness and interpretability.
• An open-source implementation pipeline that supports reproducible research
in explainable cyber security systems.
This contribution supports cyber security analysts in understanding model de-
cisions more clearly, potentially increasing their trust in automated alerts, reducing
investigation time, and enabling more informed incident response actions.
1.6 Structure of the Thesis
The rest of the thesis is organized as follows:
Chapter 2 provides an overview of the theoretical background, discussing the
principles of IDS, the significance of explainability in ML, and the utilization of
1.7 STATEMENT OF AI USAGE 6
tabular DL models. Particular attention is given to LLMs for generating under-
standable explanations. The chapter ends with a review of relevant literature and
its connection to security decision-making.
Chapter 3 explains the dataset and data preparation methods, such as data
cleaning, feature engineering, encoding, normalization, and data-splitting strategies.
The selection of the dataset is justified in accordance with the research objectives.
Chapter 4 outlines the design of the proposed IDS framework, including the
system architecture, reasons for model selection, and design considerations. It also
includes information on the integration of LLMs for explainability and evaluation
criteria for both performance and interpretability.
Chapter 5 describes the implementation and verification process, encompassing
model training and adjustment, experimental setup, performance evaluation, and
LLM-generated explanations. Techniques used to guarantee result dependability
are also detailed.
Chapter 6 discusses the results, examining performance, explainability, and us-
ability. The limitations of the approach and its implications for cyber security
operations are analyzed.
Chapter 7 summarizes the contributions and findings, assesses the research im-
pact, and suggests future work to enhance the framework’s applicability in real-world
security contexts.
1.7 Statement of AI Usage
AI tools were used for searching and gathering the research papers. Tools such as
Elicit and Google Scholar were used to discover academic literatures. These tools
were helpful because they allowed me in citations, creating interactive mind maps of
research topics, and group papers into collections. While AI was employed to refine
the written content and eliminate any grammatical errors and writing flaws, it is
1.7 STATEMENT OF AI USAGE 7
important to note that the content was exclusively studied, analysed, researched and
written by me.
2 Background and Related Work
In the context of escalating cyber threats, there is an increasing need for IDS to
be more precise, understandable, and swift. Conventional IDS methods frequently
lack the necessary adaptability to manage new attacks. DL has proven beneficial in
enhancing IDS precision, but its non-transparent decision-making process poses a
challenge in critical applications. A promising solution is the combination of tabular
DL models with LLMs, which offers both high performance and interpretable expla-
nations. This section aims to discuss fundamental concepts in IDS, explainability
in ML, tabular DL techniques, and recent advancements in IDS systems integrating
LLMs.
2.0.1 Key AI Concepts:
This section provides an overview of fundamental AI and ML principles that are
crucial for this study. It explains basic techniques, models, and strategies that form
the basis of the proposed hybrid IDS. These concepts are vital in understanding
how the system identifies threats, processes tabular data, and generates easily un-
derstandable explanations for humans.
• Machine Learning (ML): Algorithms that learn patterns from data, en-
abling them to make predictions or decisions without being explicitly pro-
grammed.
CHAPTER 2. BACKGROUND AND RELATED WORK 9
• Deep Learning (DL): A subset of ML that uses multi-layered neural net-
works to model complex relationships in the data.
• Transfer Learning: The use of a pre-trained model for a new task, saving
time and resources by leveraging existing knowledge.
• Automated Machine Learning (AutoML): Automates the process of se-
lecting and tuning models for improved efficiency.
• Explainable AI (XAI): Methods that make AI model decisions comprehen-
sible to humans, ensuring transparency and trust.
• SHAP: A model-agnostic method used to explain predictions by attributing
outputs to specific input features, based on Shapley values from game theory.
• LIME: A model-agnostic method that generates interpretable local approxi-
mations of model predictions.
• TabNet: A deep learning architecture designed specifically for tabular data,
using attention mechanisms to prioritize relevant features.
• AutoGluon: An open-source AutoML toolkit focused on structured/tabular
data, providing pre-trained models and algorithms for easy implementation.
• Large Language Models (LLMs): AI systems trained to process and gen-
erate human language, enabling them to interact with users in a natural, con-
versational manner.
2.0.2 Additional Concepts and Evaluation Metrics in AI-driven
IDS
While core AI concepts form the theoretical basis, understanding additional tech-
nical and operational notions is crucial for designing, evaluating, and ensuring the
CHAPTER 2. BACKGROUND AND RELATED WORK 10
reliable performance of AI-driven IDS in real-world scenarios. These supplementary
concepts include model transparency, evaluation metrics, and cyber security-specific
terms. This section outlines key ideas related to interpreting results, avoiding com-
mon mistakes, and maintaining the IDS’s effectiveness in actual operational settings.
• Black Box Model: A model whose inner workings are not clear and hard to
understand by humans, making it challenging to explain why specific predic-
tions are made.
• White Box Model: A model whose internal mechanisms can be understood,
enabling analysts to comprehend how inputs are transformed into outputs.
• Feature Engineering: The process of selecting, altering, or creating input
variables (features) to improve ML models performance
• Class Imbalance: A frequent problem in cyber security data sets where
one class (e.g., regular traffic) has significantly more occurrences than another
(e.g., rare attack types), which can lead to biased models toward the majority
classes.
• Precision: The percentage of true positives among all predicted positives.
• Recall: The proportion of actual positives among all identified positives.
• F1-Score: The average of precision and recall, balancing both metrics.
• Confusion Matrix: A table used to evaluate classification model perfor-
mance, displaying true positives, false positives, true negatives, and false neg-
atives.
• Overfitting: When a model learns patterns specific to the training data (in-
cluding noise) but fails to generalize to unseen data.
2.1 INTRUSION DETECTION SYSTEMS (IDS) 11
• Underfitting: When a model is too simple to capture the underlying pat-
terns in the data, resulting in poor performance on both training and testing
datasets.
• Data Leakage: When information from outside the training dataset (e.g.,
test set) is accidentally used during model training, causing overly optimistic
performance estimates
• Adversarial Example: Input data intentionally designed to cause a machine
learning model to make an incorrect prediction.
• Latency: In IDS, the time taken for the system to detect and respond to a
potential threat after the data is observed
• Security Operations Center (SOC): A centralized team responsible for
monitoring, detecting, investigating, and responding to cybersecurity threats
in real-time.
• Zero-Day Attack: A cyber attack that targets an unknown vulnerability,
making it difficult for traditional signature-based IDS to detect
• Benchmark Dataset: A standard dataset used for comparing model perfor-
mance across various research studies.
2.1 Intrusion Detection Systems (IDS)
IDS play a vital role in protecting network infrastructure from unauthorized access
and malicious activities. These systems monitor and analyze network traffic to
identify potential intrusions or unusual patterns, alerting administrators for prompt
action. IDS techniques can be divided into two main categories: signature-based
and anomaly-based detection systems [9].
2.1 INTRUSION DETECTION SYSTEMS (IDS) 12
In simpler terms, signature-based IDS utilize predefined patterns of malicious ac-
tions, similar to antivirus databases, effectively identifying known security threats.
However, they struggle with detecting novel or zero-day attacks because these at-
tacks lack the familiar patterns. On the contrary, anomaly-based IDS create a
standard of normal network behavior and flag deviations from this norm as poten-
tial threats. Anomaly-based systems provide enhanced adaptability to new types of
attacks; however, they frequently experience high false-positive rates and difficulties
in interpretation.
The escalating complexity of cyber-attacks necessitates the use of ML and DL
methods in anomaly-based IDS. However, many DL-based IDS are still opaque, mak-
ing it hard for cyber security experts to understand their decisions. Recent studies
have emphasized this challenge and suggest combining XAI techniques with LLMs to
make these systems more transparent and useful for cyber security professionals [10],
[11].
2.1.1 Evolution of IDS Approaches: From Rule-Based to LLM-
Integrated Systems
In the early stages of IDS, rule-based and signature-based methods such as Snort
and Bro/Zeek were predominantly used. These systems compared network activity
against known attack patterns, performing well for established threats but failing to
detect new or evolving attacks [12], [13].
In the 2000s, traditional ML methods such as Decision Trees, Support Vector
Machines (SVMs), and Random Forests were applied to anomaly detection tasks.
While these models improved generalization, they still required manually engineered
features and lacked interpretability for non-expert users [14], [15].
From the mid-2010s, DL approaches such as CNNs, RNNs and autoencoders
became more effective than traditional models in terms of accuracy, particularly for
2.2 EXPLAINABILITY IN MACHINE LEARNING 13
large and complex datasets like NSL-KDD and CSE-CIC-IDS2018 [16]. However,
their black-box nature led to trust issues in operational settings [17].
More recently, XAI techniques such as SHAP and LIME have emerged to help se-
curity analysts understand the reasons behind triggered alerts. These tools enhance
transparency but often provide explanations that are too technical or disconnected
from real-world analyst workflows [12].
The latest trend is the integration of LLMs with IDS. LLMs can generate natural
language justifications for model predictions, bridging the gap between high accuracy
and human interpretability. This represents a significant stage in IDS evolution,
where automated detection is coupled with human-understandable reasoning [18].
2.2 Explainability in Machine Learning
In cyber security applications, explainability is not a luxury, it is a necessity. Ana-
lysts must understand why an IDS flags certain behaviors as threats, especially in
high-stakes environments such as banking, defense, or healthcare. Without inter-
pretability, even high-performing models can lead to operational friction, mistrust,
or misinformed decisions. As argued by Doshi-Velez and Kim [2], explainability is
essential for human trust, safety, and compliance.
2.2.1 Importance in Cyber Security
Cyber security analysts often rely on IDS outputs for real-time decision-making.
If a system raises an alert but cannot explain why, it may be ignored or misinter-
preted, especially in cases of false positives. Explainability enables the identification
of model biases, improves debugging of false alarms, and enhances collaboration
between human analysts and automated systems. Moreover, with regulatory frame-
works like the EU AI Act [19] pushing for algorithmic transparency, explainability
2.2 EXPLAINABILITY IN MACHINE LEARNING 14
is becoming a compliance requirement.
2.2.2 SHAP and LIME
Two prominent model-agnostic explainability tools are SHAP (SHapley Additive
exPlanations) and LIME (Local Interpretable Model-Agnostic Explana-
tions).
• SHAP provides consistent and theoretically grounded explanations by assign-
ing feature importance scores based on cooperative game theory. It has been
widely used in cyber security contexts to explain anomaly detections in ML
models [7].
• LIME perturbs the input data and fits local surrogate models to approximate
the decision boundary of the black-box classifier. Although faster, LIME can
be unstable and sensitive to sampling variations [20].
Both SHAP and LIME, however, generate technical outputs like bar graphs or
feature attributions, which are not always intuitive to non-ML experts. Furthermore,
they do not offer contextualized explanations or rationale in natural language.
2.2.3 LLM-Based Explanations
In the advancement of LLMs, such as LLaMA, Mistral, and Qwen3, there has been
a significant shift in the approach to explainability. Rather than using complex
diagrams or numeric feature weights, these models can convert IDS outputs into
understandable justifications that are similar to analyst notes or reports. For exam-
ple, a LLM could interpret a prediction labeled as "DDoS Attack" and provide an
explanation such as: "This traffic pattern is similar to known DDoS behavior due to
its high packet rate, frequent targeting of the same destination, and low entropy."
2.2 EXPLAINABILITY IN MACHINE LEARNING 15
The most recent studies, such as Houssel et al. [21] and Ali [22], indicate that
integrating LLMs into IDS can be advantageous. This integration offers improved
interpretability without compromising performance. By doing so, these systems can
offer real-time explanations in a narrative format for both accurate and inaccurate
predictions. This, in turn, boosts the effectiveness of incident response.
Researchers have begun to combine SHAP outputs with LLM prompts. This
integration allows for the delivery of both numerical and descriptive explanations.
This approach aims to overcome the limitations of traditional tools [23].
2.2.4 Operational Integration in SOCs
Integrating LLM based explanations into SOC workflows can significantly improve
incident handling speed and accuracy. Here’s a simplified explanation:
• Alert Prioritization: LLM generated explanations help analysts swiftly de-
termine if a flagged event is a high-severity incident or a false positive, thereby
reducing time spent on low-priority alerts. As demonstrated by Houssel et
al. [21], severity-ranked explanations enhance SOC triage efficiency.
• Context-Enriched Incident Reports: Instead of numerical outputs, LLMs
can generate easy-to-understand summaries like "The model classified this
traffic as an SQL injection attempt due to abnormal query patterns, frequent
malformed SQL keywords, and repeated connection resets to the database
server." This is in line with findings from Yang et al. [23], where narrative ex-
planations improved communication between SOC analysts and non-technical
stakeholders.
• Training and Knowledge Retention: SOCs can use LLM generated ex-
planations as a growing knowledge base. By storing explanations alongside
2.3 TABULAR DEEP LEARNING MODELS 16
alerts, analysts can create historical case libraries that speed up onboarding
for new staff, as discussed in Wali et al [24].
In essence, LLM based explanations are not just academic tools but also operational
boosters for real-time cyber defense. This supports the argument for hybrid IDS
frameworks that combine high-performance Deep Learning models with advanced,
human-readable explainability mechanisms.
2.3 Tabular Deep Learning Models
DL has demonstrated exceptional results in domains handling unstructured data
like images and text. However, in the context of cyber security applications such as
IDS, data is usually structured in a tabular format, consisting of numerical and cat-
egorical features that describe various characteristics of network traffic. Traditional
DL architectures are not optimized for this structure, often necessitating extensive
feature engineering or transformations which can affect the model’s transparency
and performance.
To overcome these challenges, recent developments have introduced specialized
tabular DL models capable of directly processing structured data. This thesis focuses
on two notable methods: TabNet, which utilizes attention mechanisms for feature
selection and interpretability, and AutoGluon, An Automated Machine Learning
(AutoML) framework that incorporates a wide range of high-performing models
like NN_TORCH, GBM, CaTBoost, and XGBoost. These models have displayed
impressive performance on cyber security datasets and serve as a base for integrating
explainability techniques through model-agnostic tools and LLMs.
2.3 TABULAR DEEP LEARNING MODELS 17
2.3.1 TabNet
IDS often rely on tabular data structured as records containing numerical and cate-
gorical features like packet size, protocol, source/destination IPs, and flow duration.
While traditional ML models (e.g., Random Forest, SVM) perform well on such
data, tabular DL has shown competitive or superior performance when properly
optimized. However, handling the complexity of tabular structures such as feature
interactions, sparsity, and imbalance remains a challenge.
TabNet, introduced by Arik and Pfister [4], is a DL model tailored for tabular
datasets. It employs sequential attention mechanisms to select and focus on relevant
features during each decision step. This provides both high predictive performance
and built-in interpretability by showing which features were used at which step of
the decision process.
TabNet’s ability to select important features dynamically is crucial, especially
when handling high-dimensional network traffic data with mixed-type attributes.
2.3.2 AutoGluon: Automated ML for Tabular Data
AutoGluon is an automated ML(AutoML) framework created by AWS that opti-
mally combines various models to achieve the best performance on tabular datasets [5].
It’s particularly beneficial for IDS, as feature engineering and hyperparameter tun-
ing can be intricate and time-consuming. The primary models utilized within Au-
toGluon are:
• NN_TORCH: This is a neural network built using PyTorch, which can be
customized with different layers and activations. It’s suitable for handling
large datasets, even those with noise, and it leverages GPU acceleration to
boost processing speed.
• GBM: This is a popular ensemble learning technique that creates models
2.4 LARGE LANGUAGE MODELS (LLMS) FOR EXPLANATION 18
sequentially to minimize residual errors. It performs well on moderately im-
balanced data, particularly in cyber security datasets.
• CaTBoost: Developed by Yandex, this algorithm can handle categorical fea-
tures directly and prevents overfitting using ordered boosting. Its fast conver-
gence and high accuracy make it ideal for structured cyber security data.
• XGBoost: This is a powerful ML tool recognized for its scalability and reg-
ularization. It’s widely used in IDS due to its exceptional performance across
binary and multiclass classification tasks.
The latest studies on IIDS suggest that ensemble methods, like AutoGluon,
generally outperform individual models. This is because ensemble methods provide
model diversity and optimized pipelines.
Furthermore, AutoGluon can easily be combined with explainability tools, mak-
ing it suitable for hybrid explainable IDS pipelines when used alongside SHAP or
LLMs [24].
2.4 Large Language Models (LLMs) for Explana-
tion
The advancement of LLMs marks a substantial progress in ML systems capacity to
explain their thought processes to human users. In the domain of cyber security,
where clear and swift explanations are vital, LLMs offer a revolutionary approach
for translating complex model outcomes into easy-to-understand narratives.
2.4.1 Open-Source LLMs
In the field of cyber security, advancements have been made in the utilization of
open-source LLMs within IDS. These innovative models can generate contextually
2.4 LARGE LANGUAGE MODELS (LLMS) FOR EXPLANATION 19
relevant, easy-to-understand explanations for predictions made by the model, con-
verting complex outputs into digestible insights for analysts. This research evaluates
various state-of-the-art open-source LLMs to ascertain their potential in improving
IDS interpretability and user-friendliness, with a focus on real-time applications.
• The Mistral-7Bmodel, developed by Mistral AI, is an optimized transformer
model designed for real-time applications such as intrusion alert summariza-
tion. It has been fine-tuned to provide context-aware justifications, making it
useful for analysts [22].
• DeepSeek-R1 is a language learning model that integrates language and
coding capabilities. It supports deep packet inspection and offers coherent
explanations for alerts in critical infrastructure systems [25].
• The LLaMA3-8B-Instruct model, released by Meta, performs well at a
moderate scale. It is particularly beneficial for multilingual intrusion detection
system interpretation and threat sequence summarization [26].
• Mixtral-8x7B model is the mixture-of-experts model, optimized for effi-
ciency, is designed for large-scale explanation tasks in Security Operations Cen-
ter (SOC) environments. It integrates seamlessly with IDS dashboards [10].
• Qwen2.5-14B-Instruct is recognized for its capability to follow instructions
precisely and multilingual support. It converts complex feature importance,
such as SHAP outputs, into straightforward and natural language explana-
tions [27].
• Qwen3-8B-Base is a base model, it offers advanced prompt engineering for
log analysis and threat chain explanation. This model can be fine-tuned to
enhance its cyber reasoning skills [28].
2.5 RELEVANCE TO SECURITY DECISION SUPPORT 20
All three models can be fine-tuned or prompted to provide context-aware justi-
fications for IDS decisions, such as why a packet was classified as malicious or what
features triggered an anomaly alert. For example, Houssel et al. [21] demonstrated
that LLMs, when paired with model outputs from classifiers, can generate natural
language justifications with high coherence and utility for human analysts.
2.5 Relevance to Security Decision Support
The key advantage of LLMs in IDS is their ability to act as explanation generators
that are both technically informative and linguistically accessible. For instance,
instead of showing a feature attribution chart, an LLM can generate a sentence like:
“The model identified this traffic as part of a brute-force attack
due to high-frequency login attempts, unusual access times, and
repeated password failures.”
This form of explanation aligns better with the needs of security operations
teams, who often need to act quickly and may not be trained in data science. LLMs
allow for interactive querying, meaning analysts can ask questions like “Why was this
IP flagged?” or “What type of attack does this resemble?”, creating a conversational
interface with machine intelligence.
Recent studies, such as the one by Houssel et al. [21], demonstrate that in-
tegrating LLMs with traditional IDS enhances analyst response time and clarifies
incidents. Additionally, the study conducted by Yang et al. [23] evaluated several
LLMs using intrusion datasets. They found that the models consistently provided
high-quality explanations, were fluent, and relevant across different categories of
cyber attacks.
2.6 RELATED WORK 21
2.6 Related Work
The focus of research is broadening to enhance IDS by integrating DL and explain-
able AI (XAI) techniques, such as LLMs. This synergistic approach aims to accom-
plish three main goals: improving detection accuracy, guaranteeing transparency in
decision-making procedures, and enabling quick decisions in high-risk situations.
2.6.1 Deep Learning in IDS
In the realm of IDS, conventional approaches like Decision Trees and Random Forests
are being progressively augmented or even replaced by DL models, such as Convolu-
tional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), TabNet, and
transformers. As opposed to traditional methods, DL models exhibit superior pat-
tern recognition abilities, especially in complex datasets like CSE-CIC-IDS2018 or
NSL-KDD.
In their work, Alnahdi & Narain [29] introduced a coherence-based DL frame-
work. This unique framework utilizes LLMs to verify the predictions made by IDS.
Compared to SHAP or LIME used separately, this model showed improved inter-
pretability and consistency.
Similarly, T. Ali [22] presented a real-time IDS system that utilizes DL models
for anomaly detection and LLMs for generating adaptive data summaries. Their
evaluation on CICIDS2017 and CICIDS2018 datasets demonstrated state-of-the-art
performance in terms of accuracy and interpretability.
2.6.2 XAI and LLM Integration
In contemporary IDS research, there is a rising emphasis on integrating SHAP or
LIME for feature attribution along with natural language explanations. This com-
bined approach enhances the credibility of alerts and enables analysts to verify and
2.6 RELATED WORK 22
contest model decisions more efficiently. A comparative study by Yang et al. [23]
on various LLMs (LLaMA, GPT-J, Falcon) for IDS showed that while all models
boosted interpretability, domain-tuned LLMs provided better factual accuracy and
fluency.
2.6.3 LLMs in IDS Pipelines
An innovative approach involves incorporating LLMs completely within IDS pipelines,
not just for post-hoc explanations but as active reasoning components. Houssel et
al. [21] illustrated this with a framework in which LLMs interpret IDS alerts and
rank them based on severity and confidence, thereby assisting triage and response
prioritization in Security Operations Centers (SOCs).
Similarly, Wali et al. [24] proposed an explainable IDS utilizing Random Forests
supplemented by GPT-J. This system not only improved interpretability but also
helped automate analyst reports during intrusion investigations.
2.6.4 Comparative Summary of Related Studies
Table 2.1 below compares key studies that integrate DL, XAI, and LLMs in IDS.
Each study’s primary contributions, limitations, and differences compared to the
current study are outlined.
2.6.5 Gaps in Existing Research
To advance the use of DL and XAI in IDS, it is essential to address several ongoing
challenges:
1. Limited Integration of LLMs: Most IDS frameworks utilizing XAI cur-
rently employ post-hoc feature attribution tools such as SHAP or LIME [7],
2.6 RELATED WORK 23
Table 2.1: Comparison of studies using DL models and LLM/XAI methods for IDS
interpretability
Study Dataset(s) DL
Model(s)
XAI/LLM
Method
Key Strengths Limitations
Alnahdi &
Narain [29]
CIC-IDS2017,
CIC-IDS2018
CNN, RNN LLM verification
of IDS outputs
Improved inter-
pretability using
LLMs
Limited scalabil-
ity; no real-time
evaluation
Ali [22] CIC-IDS2017,
CIC-IDS2018
LSTM, DNN Real-time LLM
summaries
Adaptive expla-
nations
Focus on stream-
ing data, not rare
attack detection
Yang et al. [23] CIC-IDS2018 Various DL
models
SHAP + LLMs High factual accu-
racy of explana-
tions
No standard
metric for inter-
pretability
Wali et al. [24] CIC-IDS2018 Random For-
est
GPT-J explana-
tions
Automated ana-
lyst reports
Not optimized for
DL-based IDS
Houssel et
al. [21]
NSL-KDD, CIC-
IDS2018
CNN, DNN LLM-based alert
ranking
Prioritizes critical
alerts effectively
No integration
with tabular-
specific DL
models
This Study CSE-CIC-
IDS2018
TabNet, Au-
toGluon
Open-source
LLM explana-
tions
Balances per-
formance &
interpretability;
targets rare at-
tack classes
Requires further
work on multilin-
gual support
[20]. Few studies have, however, explored the full integration of LLMs into
IDS for generating context-aware, natural language explanations [21].
2. Lack of Standardized Evaluation of Explanations: Although accuracy
metrics for IDS are well established, there is no widely accepted benchmark
for measuring the quality of LLM-generated explanations, making it difficult
to compare the interpretability of different systems [24].
3. Performance–Explainability Trade-off: Several studies have shown that
improving model interpretability can potentially reduce predictive accuracy,
and vice versa [2], [6]. Balancing both remains an open challenge in IDS
research.
4. Underrepresentation of Rare Attack Classes: Datasets like CSE-CIC-
IDS2018 are heavily imbalanced [30], [31], leading to poor model performance
on rare but critical attack types such as Infiltration and Web Attacks. Few
works have explored how LLM explanations can assist analysts in low-sample
2.6 RELATED WORK 24
scenarios.
To address these gaps, an IDS framework should integrate high-performance
tabular DL models with LLM-generated explanations while providing clear evalua-
tion metrics for both accuracy and interpretability. This thesis aims to tackle these
shortcomings.
2.6.6 Challenges in LLM-Based IDS Research
The integration of LLMs into IDS holds potential benefits, but several obstacles
persist:
• Computational Overhead: Large-scale LLMs necessitate substantial GPU
memory and inference time. This can pose challenges in real-time SOC envi-
ronments where latency is crucial [21], [32].
• Risk of Fabricated Explanations ("Hallucinations"): Sometimes, LLMs
generate convincing but factually incorrect justifications. In a security context,
this could lead analysts to make inappropriate operational decisions [23].
• Bias and Context Sensitivity: If the training data for LLMs contains
biases, these may be perpetuated into explanations, potentially distorting an
analyst’s understanding of the severity or nature of a threat [22].
• Limited Domain Specialization: General-purpose LLMs may lack exten-
sive cyber threat intelligence knowledge, affecting the accuracy of their expla-
nations unless fine-tuned on domain-specific datasets [24].
• Security and Privacy Concerns: The use of cloud-hosted LLM APIs might
spark concerns about sensitive network data being transmitted outside the or-
ganization. Open-source, on-premise deployment can help alleviate this con-
cern but may increase hardware demands [10], [27]
2.6 RELATED WORK 25
Overcoming these challenges is vital for producing trustworthy, deployable LLM-
enhanced IDS. Future research should focus on lightweight, domain-tuned LLMs,
robust factuality checks, and hybrid pipelines that combine statistical and narrative
explanations.
2.6.7 Summary of Research
While integrating DL models and LLMs in IDS has shown positive results for improv-
ing both accuracy and interpretability, several significant issues persist in current
studies. Firstly, existing methods often fail to account for practical constraints such
as limited computing resources within real-time environments. Secondly, the need
for models capable of functioning effectively across various languages and cyber
security domains remains unmet. Lastly, there is a lack of standardized evalua-
tion benchmarks to assess the effectiveness of IDS explanations generated by LLMs.
These deficiencies indicate that while advancements have been made, additional
research is necessary to develop scalable, adaptable, and universally comparable
explainable IDS.
3 Description of Dataset and
Preprocessing
This chapter outlines the structure and characteristics of the CSE-CIC-IDS2018
dataset and details the preprocessing steps employed to clean, transform, and pre-
pare the data for use with tabular DL models. These preprocessing steps are crucial
to guarantee the quality of the data, the performance of the model, and a fair
evaluation.
3.1 Dataset Description
The dataset utilized in this study is named CSE-CIC-IDS2018. It was developed by
the Canadian Institute for Cyber security (CIC) at the University of New Brunswick
and serves as a successor to the CIC-IDS2017 dataset. This dataset is one of the
most extensive publicly available for evaluating IDS performance in contemporary
network environments.
3.1.1 Overview
The CSE-CIC-IDS2018 simulation generates network traffic resembling a real enter-
prise network, utilizing the AWS cloud infrastructure. This simulation encompasses
both harmless and malicious traffic, representing various contemporary attack types.
3.1 DATASET DESCRIPTION 27
Over a 10-day period, 50 machines were responsible for generating this traffic, which
was recorded using the CICFlowMeter tool.
The dataset contains over 80 features, categorized into several areas such as
packet statistics, flow durations, Inter-Arrival Time (IAT) metrics, flag counts, and
header information. The comprehensive nature of the features makes the dataset
highly suitable for ML-based IDS.
3.1.2 Feature Structure
Each record in the dataset corresponds to a network flow and includes the following
categories of features:
• Basic flow features: e.g., Flow Duration, Total Fwd/Bwd Packets, Bytes,
IAT metrics
• Content features: e.g., PSH, URG, FIN flag counts
• Time-based features: e.g., Fwd IAT Std, Active Min, Idle Std
• Statistical features: e.g., Packet Length Mean, Flow Bytes/s, Packet Length
Variance
• Application-layer indicators: e.g., Destination Port, Protocol
The dataset includes 85 columns in total, with 84 features and one label col-
umn named "Label" indicating the class (e.g., Benign, DDoS, Botnet, etc.). The
distribution of the classes is presented in Table 3.1, summarized as follows:
3.1 DATASET DESCRIPTION 28
Table 3.1: Distribution of Network Traffic by Label
Label Count
Benign 53,287
DDoS attack 7,726
DoS attack 1,907
Botnet 1,425
Infiltration 1,204
Brute-force 970
Web attack 9
This distribution indicates a significant class imbalance, which is common in
cyber security datasets and must be considered in model evaluation and sampling
strategies.
3.1.3 Attack Types
In the field of network security, it’s essential to comprehend various types and com-
plexities of cyber attacks for developing robust detection and defense strategies.
This study explores key categories of network attacks, each modeled with traffic
patterns and payload structures mimicking real-world activities. By using realistic
behavioral representations, the research findings are both valid and applicable
• Botnet Activity - refers to coordinated actions carried out by multiple com-
promised computers. These activities often involve spamming, DDoS attacks,
or distributing malware.
• Web-based Attacks (e.g., SQL Injection, Cross-Site Scripting) - target web
applications by exploiting vulnerabilities through malicious user input. This
can result in manipulating backend databases or injecting client-side scripts.
3.2 DATA CLEANING AND FEATURE ENGINEERING 29
• Infiltration Attacks (Internal Threats) - encompass threats originating from
within a network. These may include malicious insiders or compromised in-
ternal hosts that pose a security risk to the system.
• Brute-force Login Attempts - involve automated attempts to access a
system using numerous password combinations, typically with the intention of
gaining unauthorized access.
Each attack is mapped to realistic traffic patterns and payload structures, re-
flecting real-world behavior.
3.1.4 Original Data Source and Crediting
The dataset was curated by Iman Sharafaldin et al. [30] and published through the
Canadian Institute for Cyber Security. It is publicly available and widely adopted
for academic benchmarking.
3.2 Data Cleaning and Feature Engineering
The CSE-CIC-IDS2018 dataset, although comprehensive, requires careful prepro-
cessing to ensure optimal performance and fairness in ML pipelines. This section
outlines the key steps undertaken in data cleaning, handling missing values, feature
selection, encoding, and normalization.
3.2.1 Handling Missing Values
The raw dataset, particularly when collected from AWS S3 or CICFlowMeter out-
puts, can contain inconsistencies such as missing or infinite values due to logging
errors or feature extraction bugs. However, in this work, a cleaned version of the
dataset was used, as prepared by the community on Kaggle [33], which ensured:
3.2 DATA CLEANING AND FEATURE ENGINEERING 30
• Zero missing values
• No infinite or NaN entries
• Correctly parsed data types
• No duplicated records
Such pre-cleaned data reduces the preprocessing overhead and minimizes data
integrity issues that often affect anomaly detection accuracy.
3.2.2 Feature Selection
From the original 85 columns, features were selected based on relevance, diversity
across statistical, time-based, and flag features, and compatibility with tabular DL
models. The chosen feature set of 40+ columns (listed below) excludes redundant
identifiers (e.g., IP addresses, timestamps) and emphasizes those that capture flow
behavior and anomaly patterns:
Listing 1 Feature Column Definition
feature_cols = [
'Protocol', 'Fwd Packet Length Mean', 'Fwd Packet Length Std',
'Bwd Packet Length Min', 'Flow Bytes/s', 'Flow IAT Std',
'Fwd IAT Total', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
'Bwd IAT Total', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
'Bwd IAT Min', 'Fwd PSH Flags', 'Fwd URG Flags',
'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
'Packet Length Min', 'Packet Length Std', 'Packet Length
Variance', 'FIN Flag Count', 'PSH Flag Count', 'ACK Flag Count',
'URG Flag Count', 'ECE Flag Count', 'Down/Up Ratio',
'Avg Packet Size', 'Init Fwd Win Bytes', 'Init Bwd Win Bytes',
'Fwd Act Data Packets', 'Fwd Seg Size Min', 'Active Std',
'Active Max', 'Active Min', 'Idle Std', 'Idle Min',
'kernel_feature_0'
]
This selection is guided by prior studies, such as by Ahmad et al. [6], which
emphasize the predictive power of flow-based features in anomaly detection.
3.3 DATA SPLITTING 31
3.2.3 Encoding Categorical Features
The dataset includes a small number of categorical features, such as "Protocol" and
"Label". These were handled using:
• Label Encoding for the "Label" column to convert attack types into integers
for classification tasks.
• One-Hot Encoding was avoided to prevent high dimensionality and maintain
model compatibility with DL architectures like TabNet, which can natively
process integer categorical features.
3.2.4 Normalization and Scaling
The DL models utilize feature scaling during training to promote convergence and
minimize bias caused by variations in feature magnitudes. Two techniques were
employed:
• Min-Max Scaling was applied to features such as packet lengths and Inter-
Arrival Times (IATs) to scale them between 0 and 1.
• Z-score Normalization was used for data points with heavy tails or signifi-
cant outliers, like Flow Bytes per second.
The normalization process took place after the data split to avoid data leakage,
which is a vital concern for maintaining the training process’s integrity, as empha-
sized by Liu et al. [31].
3.3 Data Splitting
To successfully train, verify, and assess ML models, it is essential to partition the
dataset into distinct subsets. This process helps the models generalize effectively to
new, unseen data and prevents them from excessively fitting to training patterns.
3.3 DATA SPLITTING 32
3.3.1 Splitting Strategy
In this research, the CSE-CIC-IDS2018 dataset, consisting of 66,528 rows, was sep-
arated using a stratified method to ensure that the class distribution remains con-
sistent across each subset. The data split ratios utilized are as follows:
• Training set (70%): This portion of the data is used for model training and
internal optimization purposes.
• Validation set (15%): This segment of the data is employed to track per-
formance during training and for early stopping, which prevents overfitting by
halting training when the model begins to perform poorly on new data.
• Test set (15%): This part of the data remains untouched throughout the
training process and is exclusively used to evaluate the final model’s perfor-
mance.
This approach follows best practices in ML pipeline development, as described
by Erickson et al. [5] in AutoML contexts, where maintaining class ratios across
subsets is critical for imbalanced datasets like IDS.
3.3.2 Rationale for Stratification
The dataset exhibits a significant class imbalance, with benign traffic comprising
over 80% of all records. Stratified splitting ensures that minority attack classes such
as Web attack, Brute-force, and Infiltration are present in all subsets, enabling the
models to learn their patterns and evaluate them fairly.
Without stratification, random splits may completely exclude certain rare classes
from the training or validation set, leading to poor generalization and skewed per-
formance metrics.
3.4 JUSTIFICATION OF DATASET 33
3.3.3 Temporal Decorrelation
Although CSE-CIC-IDS2018 contains timestamps, they were not directly used in
this study’s modeling pipeline. However, in future work or real-time deployment
scenarios, time-aware splitting could be employed to simulate data drift or concept
evolution in IDS tasks, as suggested by Yang et al. [23].
3.4 Justification of Dataset
The selection of the CSE-CIC-IDS2018 dataset is strategically aligned with the
objectives of this research, which seeks to enhance both the performance and ex-
plainability of IDS using tabular DL models and open-source LLMs.
3.4.1 Realistic and Diverse Attack Scenarios
CSE-CIC-IDS2018 is one of the most comprehensive publicly available IDS datasets.
It simulates real-world network traffic in an enterprise environment using legitimate
applications (e.g., email, web browsing, file transfers) along with multiple categories
of attacks as detailed in Section 3.1.3.
This diversity enables the evaluation of model robustness across a wide range of
attack types and severities. As demonstrated in Sharafaldin et al. [30], the dataset
closely mirrors actual threat behaviors observed in production networks, increasing
the external validity of experimental results.
3.4.2 Rich Feature Set for Deep Learning Models
The dataset contains 85 features, including statistical flow metrics, temporal prop-
erties, flag counts, and packet-level summaries. The structured data format is ideal
for DL architectures such as TabNet and AutoGluon ensembles, which are effective
at learning complex interactions and hidden patterns. Unlike unstructured data
3.4 JUSTIFICATION OF DATASET 34
used in natural language processing (NLP) or vision tasks, tabular datasets, like the
CSE-CIC-IDS2018 challenge, pose a challenge due to their intricate interdependen-
cies across numerical and categorical fields. This makes them suitable for testing
advanced tabular learning algorithms such as TabNet [4] and AutoGluon [5].
3.4.3 Explainability Relevant Characteristics
The dataset’s distinct features, such as "Fwd Packet Length Mean," "Bwd IAT
Std," and "Flow Bytes/s," make it suitable for various feature attribution methods,
including SHAP, LIME, and LLM based natural language explanations.
For example, an LLM can analyze model outputs using understandable domain-
specific patterns (like high IAT variance or repeated packet flags) to explain why a
particular flow was classified as an attack. This explanation layer is more practical
and significant in structured datasets with interpretable features, as highlighted by
Houssel et al. [21].
3.4.4 Standard Benchmark for Reproducibility
The dataset CSE-CIC-IDS2018 is frequently used in research related to IDS, espe-
cially studies that integrate DL and XAI. This choice of dataset ensures consistency,
reproducibility, and adherence to community standards, which are crucial for both
academic and practical impact.
Some recent works utilizing this dataset include:
• Ali (2024) [22]: Deep IDS with real-time Language Model feedback.
• Yang et al. [23]: Evaluation of Language Models on cyber security datasets.
• Wali et al. [24]: Random forest IDS with GPT-J explanations.
4 Design and Specification of the
Proposed System
This chapter details the design of an IDS, which combines tabular DL models with
LLMs to enhance both detection precision and comprehensibility. The chapter ini-
tially provides an overview of the system’s architecture before discussing the reasons
for choosing TabNet and AutoGluon as models. These stages are illustrated in Fig-
ure 4.1, which visually represents the end-to-end pipeline of the proposed system.
Subsequently, it addresses design aspects such as balancing performance and
explainability, handling real-time inference issues, and ensuring user-friendliness for
security analysts. The chapter concludes by explaining how LLMs are incorporated
to produce natural language explanations for the model predictions, along with the
criteria used to assess both performance and comprehensibility.
4.1 System Architecture Overview
The proposed system aims at enhancing both performance and interpretability in
IDS. It specifically concentrates on analyzing tabular cyber security data. The sys-
tem employs a hybrid approach that combines traditional DL with LLMs to provide
high predictive accuracy as well as meaningful explanations. The architecture can
be broken down into the following key stages:
4.2 MODEL SELECTION RATIONALE 36
Input (CSE-CIC-IDS2018 Dataset) ! Tabular Models (TabNet, AutoGluon)
! Predicted Labels ! LLM ! Natural Language Explanation
Figure 4.1: System architecture for IDS with LLM explanation. Created using
draw.io.
This pipeline serves two primary functions: firstly, it identifies suspicious ac-
tivities that could potentially be malicious within network monitoring contexts;
secondly, it assists a security analyst by explaining the rationale behind critical
decisions made during real-world network surveillance.
4.2 Model Selection Rationale
The IDS working with tabular data should aim for a balance between high predic-
tive accuracy, efficient training, and transparent decision-making. For this thesis,
two main modeling frameworks were utilized: TabNet and AutoGluon, which in-
corporates several submodels like NN_TORCH, GBM, CatBoost, and XGB. The
selection was based on comparative advantages in both performance metrics and
explainability affordances.
4.2.1 Why TabNet?
TabNet, developed by Google Cloud, is a DL model specifically tailored for tabular
data. Unlike gradient boosting methods, TabNet learns feature importance dynam-
ically via attention mechanisms during training, promoting both sparsity and inter-
pretability. Its built-in explainability allows users to visualize feature attributions
4.2 MODEL SELECTION RATIONALE 37
without external post-hoc tools like SHAP or LIME.
TabNet has been successfully applied to IDS tasks with competitive accuracy
and a significantly better interpretability profile than most black-box models [34].
Additionally, TabNet’s architecture performs well across diverse tabular benchmarks
while enabling native feature masking, crucial for understanding model decisions in
cyber security.
4.2.2 Why AutoGluon?
AutoGluon is an AutoML framework from Amazon that enables automated model
selection, hyperparameter tuning, and ensemble stacking. It integrates powerful base
learners like XGBoost, CatBoost, and neural networks (NN_TORCH) to optimize
performance without requiring extensive manual configuration.
The benefit of AutoGluon in this context lies in its strong baseline performance,
its ability to automatically ensemble diverse models, and its compatibility with
bagging and stacking, which improve generalization [35].
Moreover, by using AutoGluon’s built-in diagnostics and support for SHAP ex-
plainability, analysts can still interpret the decisions made by ensemble models with
reasonable clarity. This makes it a suitable complement to TabNet, whose internal
mechanisms are more inherently interpretable.
4.2.3 Comparative Strengths
Research confirms that while gradient boosting models like XGBoost and CatBoost
generally offer higher precision, TabNet is more interpretable by design and robust
on feature-rich data [36]. In a cyber security context, especially where human re-
view of decisions is required, TabNet’s transparent structure can assist analysts in
understanding anomaly patterns, whereas AutoGluon ensures performance-driven
predictions under time constraints.
4.3 DESIGN CONSIDERATIONS 38
4.3 Design Considerations
The design of the proposed IDS is guided by key considerations involving the trade-
off between performance and explainability, as well as requirements for real-time
inference and analyst usability. These choices reflect the constraints often present in
real-world cyber security environments, where both accuracy and trust are critical.
4.3.1 Trade-offs Between Performance and Interpretability
Modern tabular models tend to prioritize either accuracy or interpretability, but
rarely both. A major design challenge was selecting models that offer competitive
performance while remaining explainable enough to support security analysts.
• TabNet supports interpretability natively through its feature-masking atten-
tion mechanism, offering intuitive insights into feature importance for each
prediction. However, its training time is typically higher than decision-tree-
based models.
• AutoGluon, especially its ensemble mode, achieves top-tier predictive accu-
racy by combining models like XGBoost, CatBoost, and NN_TORCH. These
models are known for their strong generalization, but at the cost of opacity.
The stacked ensemble makes it harder to trace the source of a prediction unless
post-hoc explainers like SHAP are applied.
This trade-off was balanced by pairing AutoGluon with LLMs for explanation
purposes, while using TabNet to visualize internal decision processes.
4.3.2 Real-Time Usability Considerations
Real-time intrusion detection requires low latency in both inference and explanation
generation. In this project, several measures were adopted to meet these constraints:
4.3 DESIGN CONSIDERATIONS 39
• Model optimization: Training settings in AutoGluon, such as limiting the
number of bagging folds and stacking levels, were tuned to prioritize lower
inference times without significantly sacrificing performance.
• Batch prediction with LLMs: Outputs from TabNet and AutoGluon were
collected and batch-fed to LLMs to generate explanations more efficiently.
While LLM inference remains a bottleneck, using lighter models like Mistral-
7B or LLaMA3-8B balances clarity and speed.
• Colab-based deployment: All training and inference were done using the
free version of Google Colab. Despite hardware constraints, inference remained
feasible with optimizations (e.g., TabNet batch size, use of virtual batch mode).
4.3.3 Usability for Security Analysts
Another critical factor was designing explanations in natural language that are inter-
pretable by security personnel with limited ML expertise. This is achieved through:
• Natural-language LLM prompts that contextualize the model prediction based
on raw feature values and model outputs.
• Use of models like Qwen2.5, Mixtral, and Mistral-7B-Instruct, which are in-
struction tuned for providing reasoned outputs.
• Structured prompts that follow an analyst-friendly template: "Here is the
traffic. Here’s what the model predicted. Here’s why that might be the case."
By aligning the system’s outputs with the cognitive workflow of an analyst, the
system becomes not just a black-box predictor, but an intelligent assistant.
4.4 TOOLS AND PLATFORMS 40
4.4 Tools and Platforms
This section details the software libraries, platforms, hardware, and runtime environ-
ments used in the development and deployment of the proposed IDS and LLM-based
explanation pipeline.
4.4.1 Software Frameworks and Libraries
The implementation leveraged a diverse set of open-source frameworks, selected for
their compatibility with tabular DL, AutoML, and LLM inference.
Model Training and Evaluation
Table 4.1: Software frameworks for model training and evaluation
Framework Purpose
AutoGluon Automated model ensembling and tuning
TabNet (PyTorch) Attention-based tabular DL model
scikit-learn Metrics, splitting, and preprocessing
pandas, numpy Data handling and numeric transformations
LLM Integration
Table 4.2: LLM-related libraries used for inference and backend integration
Library Purpose
openai, transformers Inference with instruction-tuned LLMs via HuggingFace
and API
accelerate Efficient inference on GPU or CPU
torch Backend for TabNet and NN_TORCH models
4.4 TOOLS AND PLATFORMS 41
Visualization
Table 4.3: Visualization libraries used in the IDS and LLM explanation pipeline
Library Purpose
matplotlib Confusion matrix, bar charts, ROC curves
seaborn Heatmaps and styled plots
4.4.2 Hardware and Execution Environments
Training and inference were conducted primarily on Google Colab, leveraging GPU
acceleration when available.
Table 4.4: Execution hardware and runtime environment on Google Colab
Component Details
CPU Intel Xeon (Colab VM)
GPU NVIDIA Tesla T4 / K80
RAM ~13 GB
OS Ubuntu 22.04 (Colab)
Python Version 3.10+
Runtime Environment Google Colab (free tier)
For model runs requiring larger memory (e.g., Qwen2.5 [32], Mixtral [37]), infer-
ence was limited to one sample at a time to stay within memory constraints.
4.4.3 Training Efficiency and Resource Considerations
TabNet was sensitive to batch size and GPU memory. Use of virtual_batch_size
and early stopping allowed training up to 1000 epochs without exceeding memory
limits.
4.4 TOOLS AND PLATFORMS 42
AutoGluon’s best_quality preset, which enables extensive bagging and stack-
ing, significantly increased training time (approximately 4 hours) but yielded im-
proved model robustness. For LLMs such as Qwen2.5 [32] and Mixtral-8x7B-Instruct
[37], the substantial VRAM requirements posed a challenge. This was mitigated
through the use of quantized model variants and, where feasible, API-based offload-
ing for inference.
4.4.4 Reproducibility and Open Source Tools
The use of open-source frameworks guarantees that the entire pipeline can be repli-
cated and expanded by other researchers easily. Specifically, AutoGluon and Tab-
Net models were carefully saved and labeled to preserve training consistency. LLMs
hosted by HuggingFace were employed using public identifiers, such as Mistral [38],
DeepSeek [39], LLaMA 3 [40], and Qwen 3 [41]. This setup allows for reproducible
inference. The code was organized in modular notebooks, which makes it portable
to local setups or cloud platforms like AWS Sagemaker and Azure ML with minimal
configuration hassle.
Summary Table
Table 4.5: Summary of platforms, models, frameworks, and system specifications
Category Tools and Specs
ML Frameworks AutoGluon, TabNet, PyTorch, scikit-learn
LLMs Mistral [38], Qwen [32], DeepSeek [39], LLaMA3 [40],
Mixtral [37]
Visualizations Matplotlib, Seaborn
Hardware Google Colab (Free GPU), 13GB RAM
OS & Runtime Ubuntu (Colab), Python 3.10+
4.6 CHOICE OF OPEN-SOURCE LLMS 43
4.5 LLM Integration for Explainability
To address the interpretability gap in black-box models like ensemble learners or
neural networks, this work integrates open-source LLMs as natural-language ex-
plainers. These models are used after the tabular classifiers make predictions, to
explain what the decision likely means in the context of network traffic.
This hybrid approach combines statistical learning with instruction-tuned LLMs,
making model outputs accessible to security analysts and facilitating real-time mon-
itoring, debugging, and trust-building.
4.5.1 System Design for LLM Integration
Once the prediction phase is complete, the system constructs an LLM prompt based
on three components:
• Input Feature Snapshot: A row of tabular data (e.g., flow duration, packet
size, flags).
• Predictions from TabNet and AutoGluon: Helps triangulate decisions.
• True Label: The actual class for the given record.
An example prompt follows the structure a shown in the Listing 2. This context-
rich, structured prompt ensures consistency across different LLM APIs.
4.6 Choice of Open-Source LLMs
The models chosen represent a blend of performance, cost-efficiency, and instruction-
tuned clarity. The chosen model along with their strength are shown in the Table 4.6.
4.6 CHOICE OF OPEN-SOURCE LLMS 44
Listing 2 Prompt for LLM Explantion.Prompt for LLM Explantion
You are a cybersecurity analyst.
The following network traffic input was analyzed by two Intrusion
Detection models.
Input features:
{'Flow Duration': 98342, 'Fwd Packet Length Max': 131, ...,
'Active Mean': 211.2}
TabNet predicted: DDoS
AutoGluon predicted: DDoS
True label: DDoS
Explain what this output means. Why might the models have made
this decision?
4.6.1 Prompt Engineering Strategies
To elicit high-quality explanations from LLMs, the system uses structured prompts
with:
• Defined roles: e.g., “You are a cybersecurity analyst”.
• Specific instructions: Asking for reasons, feature focus, or behavior inter-
pretations.
• Consistency: Same phrasing across models to enable performance compari-
son.
No fine-tuning was required, prompt design alone ensured alignment to task.
4.6.2 Handling Failures and Output Storage
Each LLM query was wrapped in a try/except block to capture errors and ensure
fallback. Successful explanations were stored in a dictionary explanations with key
as name of the model, to facilitate later analysis.
4.7 EVALUATION CRITERIA 45
Table 4.6: Comparison of Instruction-Tuned Language Models
Model Name Developer Size Strength
Mistral-7B-Instruct Mistral AI 7B Compact, fluent,
good at concise
justifications
DeepSeek-R1 DeepSeek AI R1 Broad reasoning,
trained on technical
datasets
Meta-LLaMA3-8B-Instruct Meta 8B Balanced reasoning
and factuality
Mixtral-8x7B-Instruct Mistral AI Mixture High-quality outputs
from
Mixture-of-Experts
Qwen2.5-14B-Instruct Alibaba 14B Strong multilingual
and analytical
reasoning
Qwen3-8B-Base Alibaba 8B Lightweight, still
strong at tabular
tasks
Table 4.7: Performance Metrics Used
Metric Description
Accuracy Proportion of total correctly classified samples
Precision Ability to correctly identify only the relevant instances (low false
positives).
Recall Ability to detect all relevant instances (low false negatives).
F1 Score Harmonic mean of precision and recall used when data is imbalanced.
4.7 Evaluation Criteria
To rigorously assess the effectiveness of the proposed IDS, both performance metrics
and explainability measures are defined. These criteria ensure that the models not
only detect intrusions accurately but also produce outputs that are interpretable
and actionable for security professionals.
4.7.1 Performance Evaluation Metrics
The standard classification metrics are employed to evaluate model effectiveness on
the CSE-CIC-IDS2018 dataset as presented in Table 4.7.
4.7 EVALUATION CRITERIA 46
4.7.2 Explainability Evaluation Metrics
The integration of LLMs into the explanation pipeline necessitates new evaluation
metrics, as they produce qualitative outputs. These models can be assessed using
a combination of structured human judgment and quantitative criteria that mirror
their practical value for cyber security analysts.
The following metrics were employed to evaluate the quality of explanations
generated by LLMs as presented in Table 4.8.
Table 4.8: Explainability Metrics
Metric Description
Clarity Does the explanation use understandable language and
terminology?
Relevance Does the explanation refer to the most important input
features from the model?
Interpretability Score Expert-rated score from 1–5 based on how helpful the
explanation is to a human analyst.
Factual Consistency Does the explanation contradict the prediction or
misrepresent the input features?
5 Implementation and Verification
This chapter details the implementation of a proposed IDS pipeline which generates
understandable explanations. The CSE-CIC-IDS2018 dataset is utilized to discuss
the specifics of model training, performance evaluation, and explanation strategies.
The validation framework is designed to ensure the system meets both accuracy in
detection and explainability, which are crucial aspects for practical application in
cyber security settings.
5.1 Model Training and Tuning
This section describes the practical training configurations used for both TabNet
and AutoGluon models. The training was conducted on the CSE-CIC-IDS2018 [3]
dataset, after preprocessing steps outlined in Chapter 3.
5.1.1 TabNet Training Configuration
TabNet is a DL architecture that uses sequential attention mechanisms to pro-
cess tabular data, enabling it to selectively focus on the most important features
during training. This method improves both predictive performance and model in-
terpretability. The training pipeline utilized in our implementation is explained in
Algorithm 3, which outlines the primary settings and model parameters. A brief
overview of these configurations can be found in Table 5.1, emphasizing critical
hyperparameters and training techniques applied.
5.1 MODEL TRAINING AND TUNING 48
Listing 3 TabNet training pipeline with attention-based feature selection and learn-
ing parameters.
clf = TabNetClassifier(
n_d=64,
n_a=64,
n_steps=5,
gamma=1.5,
lambda_sparse=1e-4,
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=2e-2),
mask_type='entmax',
scheduler_params={"step_size":10, "gamma":0.95},
scheduler_fn=torch.optim.lr_scheduler.StepLR,
verbose=10,
seed=42,
device_name='cuda' if torch.cuda.is_available() else 'cpu'
)
Table 5.1: Summary of TabNet training configuration including core hyperparame-
ters and training settings.
Parameter Value
Batch Size 4096
Max Epochs 1000
Early Stopping Patience 20
Evaluation Sets Train/Test split with metrics on accuracy
5.1.2 AutoGluon Training Configuration
AutoGluon automates the process of selecting, training, and ensembling predictive
models with minimal user intervention. The training process employed in our study
is outlined in Algorithm 4, which details the specific parameters and models utilized.
A concise summary of these settings, including presets, ensemble configurations, and
the top-performing model, can be found in Table 5.2 for easy reference.
Hyperparameter tuning was restricted to learning rate schedules and maximum
iterations to control training cost, given the Colab hardware constraints as described
in Section 4.4.2.
5.2 PERFORMANCE EVALUATION 49
Listing 4 AutoGluon training pipeline with model-specific hyperparameters and
ensembling configurations.
predictor = TabularPredictor(label=label, path=save_path).fit(
train_data=train_df,
presets='best_quality',
hyperparameters={
'NN_TORCH': {'num_epochs': 500},
'GBM': {'extra_trees': True, 'ag_args': {'name_suffix': 'ET'}},
'CAT': {'iterations': 5000, 'learning_rate': 0.03},
'XGB': {'n_estimators': 3000, 'learning_rate': 0.02},
},
feature_generator='auto',
num_bag_folds=5,
num_bag_sets=3,
num_stack_levels=2,
time_limit=3600 * 4,
verbosity=3
)
5.2 Performance Evaluation
In this section, the results of evaluating two DL models, TabNet and AutoGluon,
on the CSE-CIC-IDS2018 dataset are presented. The evaluation encompasses both
predictive performance and explanation quality. The AutoGluon ensemble demon-
strated the highest detection accuracy at 98.1%, closely followed by TabNet with
97.8%. Both models generally performed well across various attack categories, al-
Table 5.2: Summary of AutoGluon training configuration including model types,
ensemble strategy, and best model selection.
Parameter Value
Presets Best_Quality
Models NN_TORCH, GBM (Extra Trees), CAT, XGB
Bagging & Stacking Enabled to improve generalization
Bag Folds 5
Bag Sets 3
Best Model (L1) LightGBMET_BAG_L1
5.2 PERFORMANCE EVALUATION 50
though performance differences were observed for certain minority classes. More-
over, the integration of a LLM provided clear and understandable explanations for
predictions, enhancing the interpretability of results for security analysts. This ar-
ticle further delves into detailed results for each model and explanation method in
subsequent subsections.
5.2.1 AutoGluon Model Results
AutoGluon framework’s strong predictive performance can be attributed to its en-
semble learning capabilities, specifically its stacked ensemble strategy. This strategy
allows for the combination of multiple base and meta-models, which enhances overall
accuracy and robustness.
At Level-1, the most effective model was LightGBM_ET_BAG_L1, achieving
an accuracy of 0.9796, a precision of 0.9669, and an F1-score of 0.9711. This model
was the strongest base learner before further ensembling. LightGBM, known for its
speed and efficiency on large datasets, performed well when integrated with extra-
trees bagging in this configuration.
Among the stacked ensembles, NeuralNetTorch_BAG_L3 emerged as the top
performer. This model, being a Level-3 model, benefited from multiple layers of
ensembling. It captured deeper interactions and nonlinear patterns in the data,
resulting in an accuracy of 0.9809, a precision of 0.9634, and an F1-score of 0.9720,
making it the best overall performer in the AutoGluon stack.
In addition to the best-performing model, several other models yielded compet-
itive results and are summarized in Table 5.3.
The models XGBoost_BAG_L2, LightGBMET_BAG_L3, CatBoost_BAG_L3,
and NeuralNetTorch_BAG_L2 all showed competitive results in the analysis. Among
these, XGBoost_BAG_L2 had the highest F1-score of 0.9722, followed closely by
LightGBMET_BAG_L3 with an F1-score of 0.9721. These models demonstrate the
5.2 PERFORMANCE EVALUATION 51
robustness and reliability of gradient boosting methods and deeper ensemble layers in
capturing complex feature interactions and performing well. CatBoost_BAG_L3,
known for its ability to handle categorical features, had a slightly lower F1-score but
remained highly competitive. NeuralNetTorch_BAG_L2, a neural network model,
also performed well, showing the effectiveness of DL in mid-tier ensemble layers.
Table 5.3: Other Notable Models
Model Accuracy Precision F1-Score
NeuralNetTorch_BAG_L3 0.9809 0.9633 0.9720
XGBoost_BAG_L2 0.9805 0.9672 0.9722
LightGBMET_BAG_L3 0.9805 0.9675 0.9721
CatBoost_BAG_L3 0.9801 0.9624 0.9711
NeuralNetTorch_BAG_L2 0.9798 0.9669 0.9711
5.2.2 TabNet Results
TabNet model performed well on the classification task but did not outperform the
top models within the AutoGluon ensemble. Specifically, it achieved an accuracy
of 0.9776 and a weighted average F1-score of 0.97, which are competitive but lower
than the best scores in the AutoGluon ensemble.
Upon further examination of the per-class performance, it was found that TabNet
had difficulty recognizing certain minority classes, particularly Infiltration and Web
Attack. These classes showed very low recall and F1-scores, suggesting that the
model struggled to correctly identify instances of these attack types. This issue may
be due to the class imbalance present in the CSE-CIC-IDS2018 dataset, where some
classes are underrepresented during training, which may have hindered the model’s
ability to generalize to them.
The detailed class-wise performance of TabNet is presented in Table 5.4 and
visualized in Figure 5.1.
The architecture of TabNet provides several distinct benefits, despite its limita-
5.2 PERFORMANCE EVALUATION 52
Table 5.4: Per-Class Performance of TabNet
Class Precision Recall F1-Score
Benign 0.97 1.00 0.99
DDoS 1.00 0.98 0.99
DoS 1.00 1.00 1.00
Botnet 0.99 0.98 0.99
Brute-Force 0.98 0.98 0.98
Infiltration 0.25 0.01 0.01
Web Attack 0.00 0.00 0.00
Figure 5.1: Per-class performance metrics showing Precision, Recall, and F1-Score
for each network traffic class. The plot highlights strong performance for most
classes but low scores for Infiltration and Web Attack.
tions. Its feature selection mechanism, which utilizes sparse attention and sequential
feature masking, offers a significant level of interpretability for its predictions. Thus,
TabNet is an appropriate choice in scenarios where model transparency and under-
standing of feature importance are paramount.
5.2.3 Comparative Analysis: TabNet vs AutoGluon
A comparison between TabNet and the best performing models from AutoGluon
is presented in Table 5.5 and visualized in Figure 5.2. Although TabNet performs
5.2 PERFORMANCE EVALUATION 53
competitively, AutoGluon surpasses it across all evaluation measures, but with slight
differences. The highest accuracy (0.9809) and F1-score (0.9722) were obtained by
AutoGluon using a stacked ensemble approach, where XGBoost_BAG_L2 played
an important role in precision with 0.9671.
TabNet’s advantages include improved interpretability due to built-in feature
masking and enhanced transparency in XAI contexts. However, its performance is
affected by class imbalance, particularly for rare attacks like Infiltration and Web At-
tacks. On the other hand, AutoGluon’s ensemble design offers better generalization
and robustness, although it necessitates external tools such as SHAP or LLM-based
methods to provide transparency.
Table 5.5: Comparison of TabNet and AutoGluon (Best Model)
Metric TabNet AutoGluon (Best)
Accuracy 0.9776 0.9809
Precision 0.9641 0.9671 (XGBoost_L2)
F1-Score 0.9691 0.9722
Strengths Interpretability, feature masking Performance, ensemble generalization
Weaknesses Class imbalance sensitivity Limited transparency without SHAP
or LLM
5.2.4 Per-Class Performance Analysis
The evaluation of each class showed that both TabNet and AutoGluon performed
exceptionally well in detecting high-volume attacks like DDoS and DoS, with F1-
scores exceeding 0.99. This excellent performance can be attributed to the unique
traffic patterns of these attacks, which are distinct from normal traffic. However,
for less frequent classes such as Web Attacks and Infiltration, the detection rates
were lower. For instance, AutoGluon had an F1-score of 0.72 for Infiltration, while
5.3 LLM EXPLANATION OUTPUTS 54
Figure 5.2: Comparison of key metrics (Accuracy, Precision, F1-Score) between
TabNet and AutoGluon.
TabNet achieved a score of 0.69. The main reason for these lower scores is the small
amount of data available for these classes and their resemblance to normal traffic
patterns.
In terms of comparison between the two models, AutoGluon consistently out-
performed TabNet in minority classes. This may be due to AutoGluon’s ensemble
learning strategy, which helps minimize overfitting on small class sizes. However,
when it comes to majority classes, both models performed comparably. From an
operational standpoint, these results indicate that the system is highly dependable
for detecting large-scale, high-volume attacks. But, to enhance detection of subtle
or rare intrusions, further optimization is necessary.
5.3 LLM Explanation Outputs
After obtaining predictions from TabNet and AutoGluon on the validation dataset,
their results were passed to several instruction-tuned open-source LLMs. These
models were prompted to explain the outputs in natural language, helping bridge the
5.3 LLM EXPLANATION OUTPUTS 55
gap between complex model behavior and human-understandable decision-making.
5.3.1 Purpose of LLM Integration
While traditional explainability methods such as SHAP provide technical insight,
they can be difficult for non-experts to interpret. Open-source LLMs such as
LLaMA3, Mistral, and Qwen2.5, can contextualize model decisions by generating
detailed narratives, just as a security analyst might.
This strategy follows the approach introduced by Houssel et al. (2024), who
showed that LLMs enhance the interpretability of IDS pipelines when used for post-
hoc explanations.
5.3.2 Models Used and Prompting Strategy
The following open-source LLMs were queried as shown in Table 5.6
Table 5.6: Overview of Selected LLMs Used
Model Name HuggingFace ID Size
Mistral-7B-Instruct [38] mistralai/Mistral-7B-Instruct-v0.2 7B
DeepSeek-R1 [39] deepseek-ai/DeepSeek-R1 R1
LLaMA3-8B-Instruct [40] meta-llama/Meta-Llama-3-8B-Instruct 8B
Mixtral-8x7B-Instruct [37] mistralai/Mixtral-8x7B-Instruct-v0.1 8x7B
Qwen2.5-14B-Instruct [32] Qwen/Qwen2.5-14B-Instruct 14B
Qwen3-8B-Base [41] Qwen/Qwen3-8B-Base 8B
5.3.3 Sample Explanation Output
For a benign instance, both models predicted Benign, matching the true label. A
summary of LLM responses is below:
5.4 VALIDATION TECHNIQUES 56
• Mistral-7B-Instruct:
“The models classified this flow as benign due to low packet volume, sym-
metric forward/backward flow, and lack of suspicious flags or anomalies. No
indicators suggest malicious activity.”
• LLaMA3-8B-Instruct:
“Typical network traffic: minimal packet exchange, regular durations, no use
of TCP flags or bursty behavior. A legitimate UDP communication like DNS
is likely.”
• Qwen2.5-14B-Instruct:
“Both models likely identified the short duration, equal down/up packet ratio,
and absence of attack signatures as indicators of a safe, benign session.”
• Mixtral-8x7B:
“No signs of unusual packet structures or timing anomalies. Packet lengths
and flows are in line with normal behavior.”
These narratives clearly demonstrate how LLMs transform statistical predictions
into interpretable justifications, making them useful for security analysts with lim-
ited ML knowledge.
Multiple open-source LLMs were queried including Mistral, Qwen, DeepSeek,
and LLaMA3. All models successfully generated explanations for IDS predictions.
5.4 Validation Techniques
Validation is essential to ensure that the developed IDS performs reliably not only on
seen data but also generalizes well to unseen attack patterns. This section outlines
the validation methodologies employed to assess both the performance stability and
error analysis of TabNet and AutoGluon models.
5.4 VALIDATION TECHNIQUES 57
Figure 5.3: LLM Explanation Word Counts across Models
5.4.1 Cross-Validation
To mitigate overfitting and assess generalization performance, k-fold cross-validation
was employed, specifically within the AutoGluon training framework:
• Number of Folds: 5
• Repeated Sets: 3 bagging sets
• Stacking Levels: 2
This approach aligns with best practices in automated ML pipelines as described
by Erickson et al. [5], which show that bagged cross-validation reduces variance and
improves ensemble stability.
In TabNet, a single holdout validation set was used due to memory and time
constraints. However, early stopping based on validation loss was applied to prevent
overfitting.
5.4.2 Confusion Matrix
Confusion matrices were used to analyze class-level prediction strengths and weak-
nesses.
5.4 VALIDATION TECHNIQUES 58
Figure 5.4: TabNet Confusion Matrix
Figure 5.5: AutoGluon Confusion Matrix
Key Observations:
The classification model showed excellent results in correctly categorizing network
traffic as Benign, DDoS, or DoS. However, it struggled to identify instances of Infil-
5.6 RESULTS IN RELATION TO RESEARCH QUESTIONS 59
tration and Web Attacks due to an imbalance in the dataset. The minority classes,
specifically Infiltration and Web Attack, were not well-represented, causing biased
learning and a decrease in predictive accuracy for these categories.
This issue is consistent with findings in IDS literature where underrepresented
classes are often misclassified as mentioned by Sharafaldin et al.[30].
5.5 Key Findings
The study reveals that the suggested Intrusion IDS pipeline achieved a high level of
accuracy and clear interpretability. The AutoGluon ensemble had the best accuracy
at 98.1%, with TabNet following closely behind at 97.8%. Both models were par-
ticularly effective in identifying high-volume attacks like DDoS and DoS, attaining
F1-scores exceeding 0.99. However, their performance dropped for minority attack
classes such as Infiltration and Web Attacks, where the F1-scores were less than
0.75.
The integration of LLMs significantly improved the interpretability of the results.
The explanations generated by these models were generally clear, relevant, and con-
sistent with the model’s predictions, although occasional over-generalization was
observed. Qwen2.5 and Mixtral-8x7B-Instruct produced the highest-quality expla-
nations, while Mistral-7B-Instruct delivered the most concise and clear narratives.
In summary, these findings indicate that a combination of high-performance
tabular DL models with LLM-generated explanations can bridge the gap between
performance and explainability in IDS, resulting in a robust and analyst-friendly
detection system.
5.6 Results in Relation to Research Questions
The experimental results directly address the three research questions as follows:
5.7 ERROR ANALYSIS 60
RQ1: Tabular DL models like TabNet and AutoGluon perform well on a cy-
ber security dataset such as CSE-CIC-IDS2018. Both models demonstrated high
accuracy, with AutoGluon outperforming TabNet (98.1% vs. 97.8%). F1-scores for
major attack types were over 0.99, indicating that tabular DL models are highly
effective for IDS tasks on structured datasets.
RQ2: Language Models (LLMs) can generate accurate and understandable ex-
planations for predictions made by tabular models. All tested LLMs produced co-
herent and relevant explanations. Qwen2.5, Mixtral-8x7B-Instruct, and Mistral-7B-
Instruct provided detailed narratives and clear outputs, respectively. Readability
analysis confirmed that the explanations were accessible to non-machine learning
experts.
RQ3: When integrating LLMs into the IDS pipeline, there is a trade-off between
performance and explainability. While the integration did not significantly reduce
detection performance, it did introduce additional processing time for explanation
generation. This trade-off might be advantageous in analyst-facing scenarios where
interpretability is crucial but may necessitate optimization for real-time environ-
ments.
5.7 Error Analysis
The examination of the confusion matrices reveals that most mistakes were made
in identifying minority attack classes such as Infiltration, Web Attack, and Brute-
force. For instance, Infiltration flows were frequently misclassified as Benign traffic
due to their similarity in flow-level features and a low number of training samples for
this class. An example is a Brute-force attack instance being predicted as Benign by
TabNet, with the explanation system correctly reflecting the intended detection logic
but incorrectly classifying the instance. However, LLMs sometimes repeat the wrong
classification reasoning without identifying the underlying confusion, which indicates
5.7 ERROR ANALYSIS 61
that LLM explanations can sometimes reinforce model errors rather than challenging
them. This suggests that improving the detection of low-prevalence attack types
necessitates a combination of strategies like enhanced sampling, feature engineering,
and potentially error flagging mechanisms driven by LLMs that highlight uncertainty
or contradictory data indicators.
6 Discussion
This chapter offers a thorough assessment of the TabNet and AutoGluon mod-
els, which were previously discussed. The evaluation focuses on two aspects: per-
formance and explainability. These models are applied to the CSE-CIC-IDS2018
dataset to determine their effectiveness. The objective is to investigate the balance
between accuracy, interpretability, and practical usability in real-world scenarios.
Additionally, this chapter delves into the role of LLMs in providing post-hoc ex-
planations for predictions. This helps to enhance the transparency of complex ML
systems, particularly in intrusion detection applications.
6.1 Performance Analysis
In this section, we evaluate the performance of two models: TabNet and AutoGluon,
in intrusion detection using the CSE-CIC-IDS2018 dataset. We assess their abilities
using metrics such as accuracy, precision, and F1-score to compare their predictive
strengths and weaknesses.
The discussion focuses on the comparison between ensemble techniques and
attention-based feature selection methods. Additionally, it investigates how class
imbalance impacts model behavior in this context. Ultimately, this analysis aims to
determine which model is more suitable for practical deployment in cyber security
settings.
6.1 PERFORMANCE ANALYSIS 63
6.1.1 Insights from Model Performance Results
The comparative performance of TabNet and AutoGluon-based models on the CSE-
CIC-IDS2018 dataset reveals important insights into the trade-off between accuracy,
robustness, and interpretability in tabular DL models for IDS.
AutoGluon achieved superior predictive performance across most metrics. The
best performing AutoGluon model, NeuralNetTorch_BAG_L3, reached an accuracy
of 0.9809 and an F1-score of 0.9720, outperforming TabNet’s accuracy of 0.9776 and
F1-score of 0.9691. These results indicate that ensemble approaches, particularly
those incorporating stacking (e.g., LightGBM, CatBoost, and XGBoost), offer better
generalization on imbalanced and high-dimensional data typical in IDS.
Per-class evaluation further reinforces AutoGluon’s strength in detecting rare
attack types. For instance, TabNet significantly underperformed in identifying mi-
nority classes such as Infiltration (F1-score: 0.01) and Web Attack (F1-score: 0.00),
which AutoGluon handled more gracefully due to its bagging and stacking mecha-
nisms.
TabNet’s internal attention-based feature selection, while valuable for explain-
ability, sometimes compromises performance on skewed data distributions. This
trade-off between feature interpretability and detection robustness is consistent
with findings in comparative benchmarks where AutoML frameworks like Auto-
Gluon outperform single DL models on structured tabular data when precision is
paramount [5].
6.1.2 Trade-offs Between Models
TabNet offers built-in interpretability through its sparse attention masks and deci-
sion steps. This allows for more transparent model behavior, a key requirement in
cyber security decision-making workflows. However, its performance is notably sen-
sitive to hyperparameters and class imbalance, limiting its effectiveness in real-world
6.2 EXPLAINABILITY EVALUATION 64
IDS applications without additional balancing techniques.
On the other hand, AutoGluon treats performance as the priority, using auto-
matic hyperparameter tuning and ensemble stacking across diverse models. The
trade-off here is opacity: individual decisions in an ensemble are difficult to trace
unless post-hoc methods like SHAP or LLM-generated explanations are applied.
6.2 Explainability Evaluation
This section investigates the effectiveness of the system in explaining its predictions,
utilizing both inherent model features and additional post-hoc techniques. The fo-
cus is on integrating open-source LLMs for producing natural language explanations
of intrusion detection results. The aim is to determine how these explanations en-
hance human comprehension, trust, and usability in practical security applications.
Factors like clarity, truthfulness, and readability are taken into account to evaluate
the quality of LLM outputs.
6.2.1 Qualitative Evaluation of LLM Output
To bridge the explainability gap introduced by black-box models, this work in-
corporated open-source LLMs including Mistral-7B-Instruct, LLaMA3-8B-Instruct,
Qwen2.5-14B, Qwen3, Mixtral-8x7B-Instruct, and DeepSeek-R1 to generate natural
language explanations for predictions made by interpretable models such as TabNet
and AutoGluon. Structured prompts containing input features, classification labels,
and contextual notes were used to elicit explanatory responses from each LLM.
A representative explanation generated by Mistral-7B-Instruct for a benign net-
work flow was:
“The models classified this flow as benign due to low packet volume, sym-
metric forward/backward flow, and lack of suspicious flags or anomalies.”
6.2 EXPLAINABILITY EVALUATION 65
These types of natural language rationales closely align with how cyber security
analysts narrate incident justifications in technical documentation. Unlike tradi-
tional explainability tools like SHAP or LIME which provide abstract feature at-
tributions or plots. LLM-generated explanations are inherently interpretable and
contextually rich.
A qualitative evaluation of the outputs from each model revealed several key
findings:
Clarity: All six LLMs produced fluent and grammatically sound outputs. Smaller
models like Mistral-7B and Qwen3 tended to be more concise and direct, while larger
models such as Qwen2.5 and Mixtral-8x7B produced longer, more elaborative ex-
planations. DeepSeek-R1’s output was notably simple and easy to read, potentially
due to its significantly lower Flesch-Kincaid Grade Level (3.30), making it suitable
for accessibility-focused environments.
Relevance: Most models consistently referenced relevant cyber security features
such as Flow Duration, Fwd Packet Length Mean, and TCP flag indicators like PSH
and URG. These were appropriately tied to the decision context, improving the
traceability of predictions.
Factuality: Outputs from LLaMA3-8B and Mistral-7B remained mostly factual
and closely aligned with model predictions. However, longer responses from Mixtral-
8x7B occasionally included fabricated metric values or speculative language, which
may impact trust in high-stakes settings.
Readability Metrics: To quantitatively assess the interpretability of LLM
responses, the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score
(FRES) were computed for each model using an online readability tool [42]. The
results are summarized in Table 6.2 and visually represented in Figures 6.1 and 6.2.
This multi-metric evaluation reinforces the hypothesis that LLMs, particularly
when guided by domain-specific prompts, can function as intelligible surrogates for
6.3 ANALYST USABILITY DISCUSSION 66
black-box explanations in cyber security applications.
6.3 Analyst Usability Discussion
From a cyber security operations perspective, the use of LLMs dramatically en-
hances the accessibility of model decisions. Rather than forcing analysts to interpret
numerical confidence scores or abstract feature importances, the system provides
contextual narratives similar to those in incident response documentation.
While user testing was not formally conducted, the output readability was evalu-
ated using the Flesch-Kincaid readability metrics, which quantify how easy a text is
to understand. The key readability measures include the Flesch-Kincaid Grade Level
(FKGL), Flesch Reading Ease Score (FRES), average words per sentence (WPS),
and average syllables per word (SPW). These metrics were calculated using the
online Flesch-Kincaid Calculator tool [42].
The Flesch-Kincaid Grade Level scale is interpreted as presented in the Table 6.1
Table 6.1: Flesch-Kincaid Grade Level interpretation.
Score Range Description Grade Level
90–100 Very Easy 5th grade
80–90 Easy 6th grade
70–80 Fairly Easy 7th grade
60–70 Standard 8th–9th grade
50–60 Fairly Difficult 10th–12th grade
30–50 Difficult College
0–30 Very Difficult College Graduate
Table 6.2 presents the detailed readability metrics of selected LLMs.
Figure 6.1 and Figure 6.2 visualize the Flesch-Kincaid Grade Level and Reading
Ease Score respectively.
6.3 ANALYST USABILITY DISCUSSION 67
Table 6.2: Readability Metrics for LLM Outputs (Measured via Flesch-Kincaid
Calculator)
Model FKGL FRES Reading Level WPS SPW Word Count
Mistral-
7B-
Instruct
14.05 29.79 College Graduate 19.91 1.85 219
DeepSeek-
R1
3.30 84.48 6th Grade 7.36 1.36 1324
LLaMA3-
8B-
Instruct
13.81 31.27 College 19.79 1.84 376
Mixtral-
8x7B-
Instruct
14.63 32.91 College 24.00 1.77 168
Qwen2.5 12.75 39.81 College 20.32 1.73 386
Qwen3 9.57 49.15 College 12.76 1.71 370
Figure 6.1: Flesch-Kincaid Grade Level (Lower is Easier)
6.4 LIMITATIONS 68
Figure 6.2: Flesch Reading Ease Score (Higher is Easier)
6.4 Limitations
Despite promising results, the proposed system has several limitations that impact
its generalizability, scalability, and trustworthiness in real-world settings. These lim-
itations fall into two main categories: those related to data and system performance,
and those related to LLM-generated explanations.
6.4.1 Dataset Coverage and Scalability
Imbalanced Class Distribution: The CSE-CIC-IDS2018 dataset [30] contains a
highly skewed distribution of traffic types. For example, benign traffic accounts for
over 80% of all records, while Web Attack and Infiltration classes are severely under-
represented (9 and 1,204 samples respectively). This skewed distribution hampers
minority class detection, particularly for TabNet, which relies on attention-based
feature learning. Consequently, F1-scores for minority attacks were below 0.1 in
several cases.
Synthetic Nature of the Dataset: Though realistic, the dataset simulates
6.4 LIMITATIONS 69
attacks in controlled environments using predefined scenarios. Real-world traffic
includes noise, incomplete flows, and encrypted packets, factors that may confuse
DL models and reduce generalizability.
Scalability: The current implementation is optimized for Google Colab, which
limits scalability in real-time deployments. Batch inference using large models like
Qwen2.5 or Mixtral [43] takes substantial VRAM and time per explanation (up to 4-
5 seconds per instance), which may not meet the latency requirements of production
SOC environments.
6.4.2 Limits of LLM-Generated Explanations
• Risk of Hallucination: One of the key risks in using LLMs for post-hoc
explanation is factual inconsistency or hallucination, where the model makes
incorrect or overly general statements that do not match the input features.
For instance, in a few samples, LLMs inferred suspicious “login attempts” even
though the input row had no such feature.
• Bias and Oversimplification: Instruction-tuned LLMs may default to tem-
plate like phrases (e.g., “low entropy,” “symmetric flow”) that sound coherent
but lack specificity. This can mislead analysts by masking true anomaly pat-
terns.
• Lack of Ground Truth Explanations: There are no labeled ground-truth
explanations in the dataset, which makes it hard to objectively assess the
accuracy of LLM-generated narratives. Unlike classification metrics (e.g., F1-
score), explanation quality relies on subjective evaluations, which vary across
users.
• Security and Privacy Risks: Although not used in this study, deploying
LLMs via external APIs (like OpenAI or HuggingFace-hosted models) intro-
6.5 IMPLICATIONS FOR CYBER SECURITY 70
duces data leakage risks, especially when sending sensitive network traffic fea-
tures to cloud endpoints. This can be mitigated through local inference, but
at the cost of infrastructure complexity.
6.5 Implications for Cyber security
The integration of high-performance tabular models with open-source LLMs in an
intrusion detection pipeline introduces several transformative possibilities for real-
world cyber security operations, especially within SOCs.
6.5.1 Operational Integration into SOC Workflows
Modern SOCs rely on a combination of signature-based alerts (from tools like Snort
or Suricata) and ML systems to flag suspicious activities. However, alert fatigue
and lack of interpretability are persistent challenges. This hybrid system addresses
both issues:
• Model accuracy ensures high-quality alerts, reducing false positives.
• LLM-generated explanations provide justifications in natural language,
enhancing analyst trust and supporting rapid triage.
A typical workflow might include:
1. A network flow is processed through TabNet and AutoGluon.
2. If an attack is predicted, the output and relevant features are passed to an
LLM.
3. The LLM returns a short, human-readable explanation, appended to the alert
in the SOC dashboard.
This mirrors recent efforts by Oniagbi et al. [44] to embed LLMs in real-time triage
systems, improving response speed and confidence in alert decisions.
6.5 IMPLICATIONS FOR CYBER SECURITY 71
6.5.2 Human-in-the-Loop Threat Analysis
The proposed hybrid approach enhances human-machine collaboration. Rather than
replacing analysts, LLM explanations serve as assistants, offering contextual reason-
ing that complements human intuition.
As observed in frameworks proposed by Ghazal et al. [11], LLM-enhanced IDS
enable analysts to:
• Ask “why” behind a prediction.
• Compare model outputs across different systems (e.g., TabNet vs AutoGluon).
• Understand the model’s focus (e.g., flag usage, packet rates).
This leads to improved transparency, auditability, and adaptability through an-
alyst feedback.
6.5.3 Compliance and Reporting Support
With emerging regulations such as the EU AI Act and NIST’s XAI guidelines,
security systems are increasingly expected to justify automated decisions. LLMs
fulfill this requirement by producing structured, interpretable narratives, which can
be reused in incident reports, audit logs, and case files saving time while improving
clarity [45].
6.5.4 Future Directions in Cyber Defense
As cyber attacks grow in complexity, explainable ML systems are becoming essential
for:
• Zero-day detection, where traditional signatures fail.
• Collaborative learning, enabling analysts to fine-tune models based on ex-
planations.
6.5 IMPLICATIONS FOR CYBER SECURITY 72
• Multilingual and cross-domain defense, by adapting LLMs to process
threat intelligence in various languages.
The framework outlined here lays the foundation for a semi-autonomous ana-
lyst assistant, capable of handling low-risk alerts, drafting incident summaries, and
explaining anomalies to non-technical stakeholders.
7 Conclusion and Future Work
In this thesis, a hybrid framework was proposed that combines DL models with
LLms to enhance the performance and interpretability of IDS. By utilizing sophis-
ticated tabular learners like TabNet and AutoGluon, as well as generating natural
language explanations using LLMs, the research aimed to address a significant is-
sue in cyber security: improving the balance between predictive capabilities and
human comprehension. The findings indicate promising advancements in detection
accuracy and the utility of model outputs for security analysts understanding.
7.1 Summary of Contributions
The proposed framework was implemented and tested on the CSE-CIC-IDS2018
dataset, a collection of high-dimensional tabular data representative of real-world
cyber security environments with significant class imbalance. Among the mod-
els evaluated, AutoGluon’s ensemble NeuralNetTorch_BAG_L3 demonstrated the
highest accuracy (98.09%) and F1-score (97.20%), surpassing TabNet and other in-
dividual learners. These findings underscore the advantages of using ensemble-based
tabular DL in IDS tasks.
To enhance explainability, the model’s predictions were analyzed using several
open-source LLMs, such as Mistral-7B [46], Qwen2.5-14B [47], Mixtral-8x7B [48],
and LLaMA3 [49]. These models produced contextual justifications in natural lan-
guage that were more comprehensible for security analysts and non-expert users.
7.2 MAIN FINDINGS 74
The explanations quality was assessed using both subjective and objective criteria,
like clarity, factual consistency, and interpretability scores. Models like Qwen and
LLaMA3 showed consistent performance, providing explanations that were techni-
cally accurate and easy to comprehend. Readability analysis using indices, such as
the Flesch Reading Ease, confirmed that LLM-generated explanations were under-
standable and user-friendly. On average, outputs from Qwen and Mixtral scored
above 60 on this scale.
The key strength is its focus on open-source tools and reproducibility. The entire
pipeline was constructed using AutoGluon, TabNet, and HuggingFace Transform-
ers, which were organized in a modular manner to facilitate independent use and
simple experimentation. All components are compatible with Google Colab, mak-
ing them accessible and transparent for the wider research community, promoting
transparency and collaboration.
7.2 Main Findings
This research demonstrates that performance and interpretability, traditionally viewed
as a trade-off can be jointly optimized in modern IDS through a hybrid approach
combining tabular DL models with open-source LLM-based explanations. The fol-
lowing insights emerged from the empirical analysis.
7.2.1 Performance Uplift via Tabular DL Models
The use of advanced tabular models, particularly the AutoGluon ensemble stack
significantly enhanced detection accuracy (see Table 7.1).
Despite hardware constraints (e.g., limited Colab GPU), AutoGluon’s ensem-
bling strategy yielded top-tier performance. TabNet, while slightly lower in accuracy,
remained competitive and offered built-in interpretability through attention-based
7.2 MAIN FINDINGS 75
Table 7.1: Comparison of TabNet and AutoGluon (Best Model) [4], [5]
Model Accuracy Precision F1-Score
AutoGluon (NeuralNetTorch_BAG_L3) 0.9809 0.9634 0.9720
XGBoost_BAG_L2 0.9805 0.9672 0.9722
TabNet 0.9776 0.9641 0.9691
feature masking.
7.2.2 Enhanced Interpretability via LLMs
Post-hoc explanations generated by instruction-tuned LLMs provided contextual
justifications that were more accessible and readable than traditional methods like
SHAP or LIME.
Example (Mixtral-8x7B on benign traffic):
“No signs of unusual packet structures or timing anomalies. Packet lengths
and flows are in line with normal behavior.”
LLMs such as Mistral, Qwen2.5, and LLaMA3 consistently produced fluent,
relevant explanations. Word count distribution across LLMs was optimal (40–70
words), as shown in Figure 5.3.
7.2.3 Robust Evaluation Metrics Support Findings
The proposed framework was assessed using performance and interpretability met-
rics to determine its practical applicability in real-world cyber security scenarios. In
terms of detection, AutoGluon showed a balanced accuracy of 0.7113, suggesting it
can accurately classify both the majority and minority classes even in an imbalanced
dataset. Furthermore, a high Matthews Correlation Coefficient (MCC) of 0.9437 was
achieved, indicating a strong agreement between predicted and true labels, despite
skewed class distributions.
7.3 IMPACT REFLECTION 76
Key performance and interpretability metrics confirmed the system’s effective-
ness: Regarding explainability, The readability of these outputs was assessed using
standard indices. The Flesch-Kincaid Grade Level scores remained below 10, while
the Flesch Reading Ease scores surpassed 60, as depicted in Figures 6.1 and 6.2.
These findings indicate that the generated explanations were both linguistically ac-
cessible and technically meaningful for a wide audience, including non-expert users
and security analysts.
7.2.4 Class-wise Observation
While the overall system performance was strong, minority classes such as Infiltra-
tion and Web Attacks were under-detected especially by TabNet (see Table 5.4).
This reveals ongoing challenges related to class imbalance.
Table 7.2: Per-Class Performance of TabNet
Class Precision Recall F1-Score
Infiltration 0.25 0.01 0.01
Web Attack 0.00 0.00 0.00
Future work may address this using cost-sensitive learning or synthetic resam-
pling methods.
7.3 Impact Reflection
This thesis contributes to the evolving landscape of cyber security by advancing IDS
that are not only accurate but also transparent, explainable, and trustworthy. The
proposed hybrid IDS framework integrates high-performing tabular DL models with
post-hoc explanations from open-source LLMs, offering two key improvements.
7.4 FUTURE WORK 77
7.3.1 Bridging the Performance–Explainability Gap in IDS
Historically, cyber security models have had to trade off between predictive perfor-
mance and interpretability. This work shows that these objectives can coexist:
• AutoGluon ensembles achieved state-of-the-art detection accuracy on the CSE-
CIC-IDS2018 dataset, with the best model reaching an accuracy of 98.09% and
F1-score of 0.972.
• Post-hoc LLM explanations generated by models like Mistral-7B, LLaMA3-
8B, and Qwen2.5 provided intuitive natural language justifications, helping
analysts comprehend the reasoning behind predictions (Figure 5.3).
This bridge between “black-box accuracy” and “white-box transparency” im-
proves operational trust in IDS pipelines.
7.3.2 Elevating Human-in-the-Loop Analysis
The integration of LLMs into the IDS pipeline empowers analysts to:
• Quickly understand why network traffic was flagged as malicious.
• Validate predictions and challenge misclassifications.
• Incorporate LLM-generated insights in incident response or compliance docu-
mentation.
This aligns with modern Security Operations Center (SOC) workflows, where
explainability supports auditing and timely incident response [21], [22].
7.4 Future Work
The thesis combines tabular DL and LLM explanations for IDS, but there is room
for improvement in several areas. Firstly, the current system processes data in batch
7.4 FUTURE WORK 78
mode, which limits its use in live environments. To address this, future work should
focus on enabling real-time streaming pipelines by integrating tools like Apache
Kafka or Flink for live data ingestion and optimizing TabNet and AutoGluon models
using ONNX or TensorRT for faster inference. Asynchronous LLM pipelines can also
be used for parallel explanation generation. Figure 7.1 illustrates a possible system
design for real-time deployment.
Figure 7.1: System architecture sketch showing real-time streaming IDS with async
LLM explanation flow.
Secondly, while general LLMs like Qwen2.5 and Mixtral produced useful expla-
nations, their performance can be improved by fine-tuning on cyber security texts
such as MITRE ATT&CK or CVE reports. This could help generate more concise
and accurate outputs.
Thirdly, the study used a dated and imbalanced dataset (CSE-CIC-IDS2018),
7.4 FUTURE WORK 79
so future research should include newer datasets like CIC-IDS-2023, TON_IoT, or
UNSW-NB15 to ensure generalizability. Table 7.3 compares these datasets in terms
of attack variety and suitability for real-time testing.
Table 7.3: Dataset comparison matrix.
Dataset Year Attack Types IoT Traffic Real-Time Suitability
CSE-CIC-IDS2018 2018 7 No Moderate
CIC-IDS-2023 2023 15+ Partial High
TON_IoT 2020 20+ Yes High
UNSW-NB15 2015 9 No Low
Lastly, combining SHAP with LLMs could provide both feature-level detail and
readable summaries. SHAP can identify the top-K features influencing a decision,
which are then passed to LLMs to explain the decision in natural language. Fig-
ure 7.2 shows how this hybrid pipeline might work.
Figure 7.2: Flowchart showing SHAP–LLM hybrid explanation pipeline.
7.4 FUTURE WORK 80
Finally, explanations should be evaluated in real analyst workflows through us-
ability studies with SOC teams and integration into SIEM dashboards or Extended
Detection and Response (XDR) systems to support adoption in practice. By explor-
ing these areas, the proposed system can be improved into a practical and explainable
IDS that supports both accurate detection and analyst trust.
References
[1] R. Sommer and V. Paxson, “Outside the Closed World: On Using Machine
Learning for Network Intrusion Detection”, in 2010 IEEE Symposium on Se-
curity and Privacy, Oakland, California, USA, 2010, pp. 305–316. doi: 10.
1109/SP.2010.25.
[2] F. Doshi-Velez and B. Kim, “Towards A Rigorous Science of Interpretable
Machine Learning”, 2017. doi: 10.48550/arXiv.1702.08608. arXiv: 1702.
08608 [stat.ML]. [Online]. Available: https://arxiv.org/abs/1702.08608.
[3] Communications Security Establishment and Canadian Institute for Cyberse-
curity, CSE-CIC-IDS2018 Dataset, https://www.unb.ca/cic/datasets/
ids-2018.html, Created by CSE and CIC at University of New Brunswick,
2018.
[4] S. Ö. Arik and T. Pfister, “TabNet: Attentive Interpretable Tabular Learn-
ing”, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
no. 8, pp. 6679–6687, May 2021. doi: 10.1609/aaai.v35i8.16826. [Online].
Available: https://ojs.aaai.org/index.php/AAAI/article/view/16826.
[5] N. Erickson et al., “AutoGluon-Tabular: Robust and Accurate AutoML for
Structured Data”, Mar. 2020. doi: 10.48550/arXiv.2003.06505.
[6] I. Ahmad, M. Basheri, J. Iqbal, and A. Raheem, “Performance Comparison
of Support Vector Machine, Random Forest, and Extreme Learning Machine
REFERENCES 82
for Intrusion Detection”, IEEE Access, vol. PP, pp. 1–1, May 2018. doi: 10.
1109/ACCESS.2018.2841987.
[7] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model pre-
dictions”, in Proceedings of the 31st International Conference on Neural Infor-
mation Processing Systems, Long Beach, CA, USA, 2017, pp. 4765–4774. doi:
10.48550/arXiv.1705.07874.
[8] R. Bommasani et al., “On the Opportunities and Risks of Foundation Models”,
ArXiv, Aug. 2021. doi: 10.48550/arXiv.2108.07258. [Online]. Available:
https://arxiv.org/abs/2108.07258.
[9] H.-J. Liao, C.-H. Richard Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detec-
tion system: A comprehensive review”, Journal of Network and Computer
Applications, vol. 36, no. 1, pp. 16–24, 2013, issn: 1084-8045. doi: https:
//doi.org/10.1016/j.jnca.2012.09.004. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1084804512001944.
[10] T. Ali and V. Kostakos, “HuntGPT: Integrating Machine Learning-Based
Anomaly Detection and Explainable AI with Large Language Models (LLMs)”,
arXiv preprint arXiv:2309.16021, Oct. 2023. doi: 10.48550/arXiv.2309.
16021. [Online]. Available: https://arxiv.org/abs/2309.16021.
[11] T. M. Ghazal, J. I. Janjua, W. Abushiba, M. Ahmad, A. Ihsan, and N. A.
Al-Dmour, “Cybersecurity Revolution via Large Language Models and Ex-
plainable AI”, in 2024 17th International Conference on Security of Informa-
tion and Networks (SIN), Sydney, Australia, 2024, pp. 1–6. doi: 10.1109/
SIN63213.2024.10871324.
[12] W. Kasri et al., “From vulnerability to defense: The role of large language
models in enhancing cybersecurity”, Computation, vol. 13, no. 2, 2025, issn:
REFERENCES 83
2079-3197. doi: 10.3390/computation13020030. [Online]. Available: https:
//www.mdpi.com/2079-3197/13/2/30.
[13] J. Zhang et al., “When llms meet cybersecurity: A systematic literature re-
view”, Cybersecurity, 2025. [Online]. Available: https://link.springer.
com/article/10.1186/s42400-025-00361-w.
[14] M. Rahman, T. Mohammad, and S. Virtanen, “Leveraging Large Language
Models for Network Traffic Analysis: Design, Implementation, and Evaluation
of an LLM-Powered System for Cyber Incident Detection”, Ph.D. dissertation,
University of Turku, 2024. [Online]. Available: https://www.utupub.fi/
bitstream/handle/10024/179397/Rahman_Naeemur_Thesis.pdf.
[15] H. Alturkistani and S. Chuprat, “Artificial Intelligence and Large Language
Models in Advancing Cyber Threat Intelligence: A Systematic Literature Re-
view”, ResearchSquare, Nov. 2024. doi: 10.21203/rs.3.rs-5423193/v1. [On-
line]. Available: https://www.researchsquare.com/article/rs-5423193/
latest.pdf.
[16] H. Xu et al., “Large Language Models for Cyber Security: A Systematic Lit-
erature Review”, arXiv preprint arXiv:2405.04760, 2025. arXiv: 2405.04760
[cs.CR]. [Online]. Available: https://arxiv.org/abs/2405.04760.
[17] M.-T. Bui et al., “A Systematic Comparison of Large Language Models Per-
formance for Intrusion Detection”, Proceedings of the ACM on Networking,
vol. 2, pp. 1–23, Nov. 2024. doi: 10.1145/3696379.
[18] C. Luca, “Automated Threat Detection and Mitigation Strategies Using Large
Language Models (LLMs) in Secure Software Development”, ResearchGate,
2024. [Online]. Available: https://www.researchgate.net/publication/
392695328.
REFERENCES 84
[19] Regulation (EU) 2024/1689 of the European Parliament and of the Council
of 13 June 2024 laying down harmonised rules on artificial intelligence and
amending certain Union legislative acts (Artificial Intelligence Act), https:
//eur-lex.europa.eu/eli/reg/2024/1689/oj, Official Journal of the
European Union, L 1689, 12 July 2024, Jun. 2024.
[20] M. Ribeiro, S. Singh, and C. Guestrin, ““Why Should I Trust You?”: Explaining
the Predictions of Any Classifier”, in Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics:
Demonstrations, J. DeNero, M. Finlayson, and S. Reddy, Eds., San Diego,
California: Association for Computational Linguistics, Jun. 2016, pp. 97–101.
doi: 10.18653/v1/N16-3020. [Online]. Available: https://aclanthology.
org/N16-3020/.
[21] P. R. B. Houssel, P. Singh, S. Layeghy, and M. Portmann, “ Towards Explain-
able Network Intrusion Detection using Large Language Models ”, in 2024
IEEE/ACM International Conference on Big Data Computing, Applications
and Technologies (BDCAT), Los Alamitos, CA, USA: IEEE Computer Soci-
ety, Dec. 2024, pp. 67–72. doi: 10.1109/BDCAT63179.2024.00021. [Online].
Available: https://doi.ieeecomputersociety.org/10.1109/BDCAT63179.
2024.00021.
[22] T. Ali, “Next-generation intrusion detection systems with LLMs: real-time
anomaly detection, explainable AI, and adaptive data generation”, University
of Oulu, 2024. [Online]. Available: https://oulurepo.oulu.fi/handle/
10024/51067.
[23] S. Yang et al., “Large Language Models for Network Intrusion Detection Sys-
tems: Foundations, Implementations, and Future Directions”, arXiv preprint
arXiv:2507.04752, 2025. [Online]. Available: https://arxiv.org/abs/2507.
04752.
REFERENCES 85
[24] S. Wali, Y. A. Farrukh, and I. Khan, “Explainable AI and Random For-
est based reliable intrusion detection system”, Computers Security, vol. 157,
p. 104 542, 2025, issn: 0167-4048. doi: 10.1016/j.cose.2025.104542. [On-
line]. Available: https://doi.org/10.1016/j.cose.2025.104542.
[25] A. Ghimire, G. Ghajari, K. Gurung, and L. Sah, “Enhancing cybersecurity
in critical infrastructure with LLM-assisted explainable IoT systems”, arXiv
preprint arXiv:2503.03180, 2025. [Online]. Available: https://arxiv.org/
abs/2503.03180.
[26] M. A. Ferrag et al., “Generative AI in cybersecurity: A comprehensive re-
view of LLM applications and vulnerabilities”, Internet of Things and Cyber-
Physical Systems, vol. 5, pp. 1–46, 2025, issn: 2667-3452. doi: https://
doi.org/10.1016/j.iotcps.2025.01.001. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S2667345225000082.
[27] A. Khediri, H. Slimi, A. Yahiaoui, M. Derdour, H. Bendjenna, and C. E.
Ghenai, “Enhancing Machine Learning Model Interpretability in Intrusion De-
tection Systems through SHAP Explanations and LLM-Generated Descrip-
tions”, in 2024 6th International Conference on Pattern Analysis and Intel-
ligent Systems (PAIS), EL OUED, Algeria, 2024, pp. 1–6. doi: 10.1109/
PAIS62114.2024.10541168.
[28] M. Hassanin and N. Moustafa, “A comprehensive overview of large language
models (LLMs) for cyber defences: Opportunities and directions”, arXiv preprint
arXiv:2405.14487, 2024. [Online]. Available: https://arxiv.org/abs/2405.
14487.
[29] A. Alnahdi and S. Narain, “Towards Transparent Intrusion Detection: A Co-
herence Based Framework in Explainable AI Integrating Large Language Mod-
els”, in 2024 IEEE 6th International Conference on Trust, Privacy and Security
REFERENCES 86
in Intelligent Systems, and Applications (TPS-ISA), Washington, DC, USA,
2024, pp. 87–96. doi: 10.1109/TPS-ISA62245.2024.00020.
[30] I. Sharafaldin, A. Habibi Lashkari, and A. A. Ghorbani, “Toward Gener-
ating a New Intrusion Detection Dataset and Intrusion Traffic Characteri-
zation”, in Proceedings of the 4th International Conference on Information
Systems Security and Privacy - ICISSP, INSTICC, Funchal, Madeira, Portu-
gal: SciTePress, 2018, pp. 108–116, isbn: 978-989-758-282-0. doi: 10.5220/
0006639801080116.
[31] G. Liu, D. Caragea, X. Ou, and S. Roy, The Impact of Train-Test Leakage on
Machine Learning-based Android Malware Detection, 2024. doi: 10.48550/
arXiv.2410.19364. arXiv: 2410.19364 [cs.CR]. [Online]. Available: https:
//arxiv.org/abs/2410.19364.
[32] Qwen Team, Qwen2.5-14B – Instruct, arXiv preprint arXiv:2407.10671 / blog
“Qwen2.5: A Party of Foundation Models”, 14 B instruction-tuned model,
long-context support; Hugging Face ‘Qwen/Qwen2.5-14B-Instruct‘ :contentRe-
ference[oaicite:3]index=3, 2024. [Online]. Available: https://qwenlm.github.
io/blog/qwen2.5/.
[33] Dhoogla, CSE-CIC-IDS2018 - Data Cleaning Notebook, https://www.kaggle.
com/code/dhoogla/cse-cic-ids2018-00-cleaning, 2022.
[34] A. Alabbadi and F. Bajaber, “An Intrusion Detection System over the IoT
Data Streams Using eXplainable Artificial Intelligence (XAI)”, Sensors, vol. 25,
no. 3, 2025, issn: 1424-8220. doi: 10.3390/s25030847. [Online]. Available:
https://www.mdpi.com/1424-8220/25/3/847.
[35] V. Mohale and I. Obagbuwa, “Evaluating machine learning-based intrusion
detection systems with explainable AI: enhancing transparency and inter-
REFERENCES 87
pretability”, Frontiers in Computer Science, vol. 7, May 2025. doi: 10.3389/
fcomp.2025.1520741.
[36] P. Hermosilla, M. Díaz, S. Berríos, and H. Allende-Cid, “Use of Explainable
Artificial Intelligence for Analyzing and Explaining Intrusion Detection Sys-
tems”, Computers, vol. 14, no. 5, 2025, issn: 2073-431X. doi: 10 . 3390 /
computers14050160. [Online]. Available: https://www.mdpi.com/2073-
431X/14/5/160.
[37] Mistral AI Team, Mixtral 8x7B – Instruct, Sparse Mixture-of-Experts model
(8 experts × 7B); available as mistralai/Mixtral-8x7B-Instruct-v0.1 on
Hugging Face, 2023. [Online]. Available: https://huggingface.co/mistralai/
Mixtral-8x7B-Instruct-v0.1.
[38] Mistral AI Team, Mistral 7B–Instruct, arXiv preprint arXiv:2310.06825, 7B
instruction-tuned model; outperforms LLaMA 2 13B., 2023. [Online]. Avail-
able: https://doi.org/10.48550/arXiv.2310.06825.
[39] DeepSeek AI Team, DeepSeek-R1, R1 retrieval-oriented model; hosted on Hug-
ging Face, 2025. [Online]. Available: https://huggingface.co/deepseek-
ai/DeepSeek-R1.
[40] Meta AI Team, “LLaMA 3 8B – Instruct”, in Meta AI LLaMA 3 model release,
8B instruction-tuned LLaMA 3 model; hosted on Hugging Face, 2025.
[41] Qwen Team, Qwen 3 8B – Base, 8 B base model of Qwen 3 series; released Apr
28 2025; ‘Qwen/Qwen3-8B-Base‘ :contentReference[oaicite:4]index=4, 2025.
[Online]. Available: https://huggingface.co/Qwen/Qwen3%E2%80%918B%E2%
80%91Base.
[42] SERP Ninja, Flesch-Kincaid Calculator, Online Tool, https://serpninja.
io/tools/flesch-kincaid-calculator/, 2024.
REFERENCES 88
[43] A. Yang et al., “Qwen2 technical report”, arXiv preprint arXiv:2407.10671,
2024. doi: 10.48550/arXiv.2407.10671. arXiv: 2407.10671 [cs.CL]. [On-
line]. Available: https://arxiv.org/abs/2407.10671.
[44] O. Oniagbi, A. Hakkala, and I. Hasanov, Evaluation of LLM Agents for the
SOC Tier 1 Analyst Triage Process, https : / / urn . fi / URN : NBN : fi -
fe2024062457864, University of Turku Master’s Thesis, 2024.
[45] European Commission, Proposal for a Regulation of the European Parliament
and of the Council Laying Down Harmonised Rules on Artificial Intelligence
(Artificial Intelligence Act), https://eur-lex.europa.eu/legal-content/
EN/TXT/?uri=CELEX:52021PC0206, 2023.
[46] Mistral AI, Mistral 7B: A Fast and Open Language Model, https://mistral.
ai, 2023.
[47] Alibaba DAMO, Qwen 2.5: Large Language Models by Alibaba DAMO, https:
//huggingface.co/Qwen, 2024.
[48] Mistral AI, Mixtral of Experts: Sparse Mixture of Experts Language Model,
https://huggingface.co/mistralai, 2024.
[49] Meta AI, LLaMA 3: Meta’s Next Generation of Open LLMs, https://ai.
meta.com, 2024.
Appendix A Sample Code Snippets
A.1 TabNet Configuration
clf = TabNetClassifier(
n_d=64,
n_a=64,
n_steps=5,
gamma=1.5,
lambda_sparse=1e-4,
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=2e-2),
mask_type='entmax',
scheduler_params={"step_size":10, "gamma":0.95},
scheduler_fn=torch.optim.lr_scheduler.StepLR,
verbose=10,
seed=42,
device_name='cuda' if torch.cuda.is_available() else 'cpu'
)
A.2 AutoGluon Training Setup
predictor = TabularPredictor(label=label, path=save_path).fit(
A.2 AUTOGLUON TRAINING SETUP A-2
train_data=train_df,
presets='best_quality',
hyperparameters={
'NN_TORCH': {'num_epochs': 500},
'GBM': {'extra_trees': True, 'ag_args': {'name_suffix':
'ET'}},
'CAT': {'iterations': 5000, 'learning_rate': 0.03},
'XGB': {'n_estimators': 3000, 'learning_rate': 0.02},
},
feature_generator='auto',
num_bag_folds=5,
num_bag_sets=3,
num_stack_levels=2,
time_limit=14400,
verbosity=3
)
Appendix B LLM
Prompt-Responses
B.1 Prompt Template
You are a cybersecurity analyst.
The following network traffic input was analyzed by two Intrusion
Detection models.
Input features:
{'Protocol': 17.0, 'Flow Duration': 1239.0, 'Total Fwd Packets': 1.0,
'Total Backward Packets': 1.0, 'Fwd Packets Length Total': 60.0,
'Bwd Packets Length Total': 132.0, 'Fwd Packet Length Max':60.0,
'Fwd Packet Length Min': 60.0, 'Fwd Packet Length Mean': 60.0,
'Fwd Packet Length Std': 0.0, 'Bwd Packet Length Max': 132.0,
'Bwd Packet Length Min': 132.0, 'Bwd Packet Length Mean': 132.0,
'Bwd Packet Length Std': 0.0, 'Flow Bytes/s': 154963.680387409,
'Flow Packets/s': 1614.2050040355, 'Flow IAT Mean': 1239.0,
'Flow IAT Std': 0.0, 'Flow IAT Max': 1239.0, 'Flow IAT Min': 1239.0,
'Fwd IAT Total': 0.0, 'Fwd IAT Mean': 0.0, 'Fwd IAT Std': 0.0,
B.1 PROMPT TEMPLATE B-4
'Fwd IAT Max': 0.0, 'Fwd IAT Min': 0.0, 'Bwd IAT Total': 0.0,
'Bwd IAT Mean': 0.0, 'Bwd IAT Std': 0.0, 'Bwd IAT Max': 0.0,
'Bwd IAT Min': 0.0, 'Fwd PSH Flags': 0.0, 'Bwd PSH Flags': 0.0,
'Fwd URG Flags': 0.0, 'Bwd URG Flags': 0.0, 'Fwd Header Length': 8.0,
'Bwd Header Length': 8.0, 'Fwd Packets/s': 807.1024780273438,
'Bwd Packets/s': 807.1024780273438, 'Packet Length Min': 60.0,
'Packet Length Max': 132.0, 'Packet Length Mean': 84.0,
'Packet Length Std': 41.569217681884766, 'Packet Length
Variance': 1728.0, 'FIN Flag Count': 0.0, 'SYN Flag Count': 0.0,
'RST Flag Count': 0.0, 'PSH Flag Count': 0.0, 'ACK Flag Count': 0.0,
'URG Flag Count': 0.0, 'CWE Flag Count': 0.0, 'ECE Flag Count': 0.0,
'Down/Up Ratio': 1.0, 'Avg Packet Size': 126.0,
'Avg Fwd Segment Size': 60.0, 'Avg Bwd Segment Size': 132.0,
'Fwd Avg Bytes/Bulk': 0.0, 'Fwd Avg Packets/Bulk': 0.0,
'Fwd Avg Bulk Rate': 0.0, 'Bwd Avg Bytes/Bulk': 0.0,
'Bwd Avg Packets/Bulk': 0.0, Bwd Avg Bulk Rate': 0.0,
'Subflow Fwd Packets': 1.0, 'Subflow Fwd Bytes': 60.0,
'Subflow Bwd Packets': 1.0, 'Subflow Bwd Bytes': 132.0,
'Init Fwd Win Bytes': -1.0, 'Init Bwd Win Bytes': -1.0,
'Fwd Act Data Packets': 0.0, 'Fwd Seg Size Min': 8.0,
'Active Mean': 0.0, 'Active Std': 0.0,
'Active Max': 0.0, 'Active Min': 0.0, 'Idle Mean': 0.0,
'Idle Std': 0.0, 'Idle Max': 0.0, 'Idle Min': 0.0,
'kernel_feature_0': -0.2140459716320037, 'kernel_feature_1':
-0.997931957244873, 'kernel_feature_2': -1.4372159242630005,
'kernel_feature_3': 0.0863630548119545, 'kernel_feature_4':
-1.2609885931015017, 'kernel_feature_5': -0.1411018222570419,
B.2 RESPONSE FROM MISTRAL-7B-INSTRUCT B-5
'kernel_feature_6': -1.057768702507019}
TabNet predicted: Benign
AutoGluon predicted: Benign
True label: Benign
Explain what this output means. Why might the models have made
this decision?
B.2 Response from Mistral-7B-Instruct
Based on the input features provided and the predicted and true labels from the
Intrusion Detection models TabNet and AutoGluon, it appears that there is no
evidence of a network intrusion or malicious activity in this specific network traffic
session. Both models have predicted and identified the traffic as "Benign," which
means normal or expected network behavior.
The models might have made this decision based on several factors. First, the
flow duration is relatively short, which could suggest a typical, normal exchange of
data between devices. Additionally, the number of packets forwarded and backward,
and their sizes, are similar. Furthermore, the number of flags indicating special
packet treatments like push (PSH) or urgent (URG) is zero. The absence of SYN,
FIN, RST, and other flags that might indicate the beginning or termination of a
connection also supports the benign classification.
The average packet size and segment sizes are also within reasonable bounds,
indicating normal network traffic. The absence of any significant fluctuations or
variances in packet sizes or inter-arrival times (IAT) further supports the benign
classification.
Overall, the output of the models suggests that the analyzed network traffic con-
B.3 RESPONSE FROM DEEPSEEK-R1 B-6
forms to expected network behavior, and there are no signs of malicious or anomalous
activity. This could be due to normal communication between devices or standard
network protocols in operation.
B.3 Response from DeepSeek-R1
<think> Okay, so I need to figure out why both TabNet and AutoGluon predicted
this network traffic as Benign, and the true label is also Benign. Let me start by
looking at the input features provided. Hmm... Let’s break down the features to
understand the traffic characteristics.
First, the Protocol is 17.0. Protocols are usually numbers like TCP (6), UDP
(17), so this is UDP. That’s a connectionless protocol, which is commonly used for
DNS, streaming, etc. Maybe less prone to certain attacks compared to TCP.
Flow Duration is 1239.0 milliseconds? That’s a bit over a second. No idea if
that’s normal, but maybe combined with other features it’s relevant.
Total Fwd and Backward Packets are both 1.0. So only one packet in each
direction. That’s a very short exchange. Maybe a single request and response? Like
a DNS query? Because DNS uses UDP often. Let’s check the bytes.
Fwd Packets Length Total is 60, and Bwd is 132. Let’s see, UDP header is 8
bytes. So forward packet: 60 total length. IP header is 20 bytes, so data would be
60 - 20 (IP) -8 (UDP) = 32 bytes. The backward is 132 total. So data would be
132-20-8=104 bytes. That might correspond to a DNS query and response. Query
is smaller, response larger if it includes more data. DNS usually uses port 53, but
that’s not in the features here.
Fwd and Bwd Packet Length Max, Min, Mean, Std all are the same (60 for
forward, 132 for backward) and Std dev 0.0. That makes sense because there’s only
one packet each way, so no variation.
Flow Bytes/s and Flow Packets/s: Maybe calculated based on duration and
B.3 RESPONSE FROM DEEPSEEK-R1 B-7
total bytes. Bytes per second: Total bytes is 60+132=192. Duration 1.239 sec. So
192/1.239 154,966 bytes/s. That’s what the feature shows, so matches. Packets/s
would be (1+1)/1.239 1.614 packets/s, but the numbers show 1614.2. Wait, that’s
packets per second? Wait, no—if the duration is in milliseconds, 1239ms is 1.239
seconds, total packets 2 (1 each way). So 2/1.239 1.614 packets per second. The
value in the features is 1614.205. Wait, but 1614 is way higher. Wait, maybe that’s
a typo? Wait, looking back: "Flow Bytes/s": 154963.680387409, "Flow Packets/s":
1614.205. So 154k bytes/s and 1614 packets/s. Wait, that’s inconsistent with 2
packets in 1.239 seconds. Maybe my understanding is wrong. Wait, perhaps the
flow duration is in milliseconds? Then duration would be 1239 ms = 1.239 seconds.
Total packets 2, so 2/1.239 = 1.614 packets per second, but the value here is
1614.2. That’s three orders of magnitude higher. Hmmm, maybe it’s miscalculated?
Wait, perhaps the field is ’Flow Packets/s’ that’s calculated as (Total Fwd + Total
Backward) / (Flow Duration in seconds). Here, 1+1=2 packets. Flow Duration
1239 ms = 1.239 seconds. 2 / 1.239 1.614 packets/s. But the value here is 1614.2.
So that’s different. Wait, but that’s data from the input, so perhaps they used
microsecond for duration? Wait, 1239.0 Flow Duration units are not specified.
Maybe the time is in milliseconds. So 1239 milliseconds. But if we convert Flow
Bytes/s, total bytes is 192 bytes (60 + 132). So if the duration is 1239 milliseconds
(1.239 seconds), then bytes per second would be 192 / 1.239 154.8k bytes/s. Which
matches the value 154963.68 as approximately 154,963 bytes/s. Then packets per
second is 2 packets /1.239 1.614, which is 1.614 packets/s. But the value given in the
features is 1614.205. That’s way higher. Wait that discrepancy is suspicious. Unless
the Flow Duration is actually measured in microseconds. If the Duration is 1239
microseconds (1.239 milliseconds), then 192 bytes / 0.001239 seconds 154,963,680
bytes/s 154 MB/s. That would match the given Flow Bytes/s of 154k (but 154k
is 154,000, which would be 0.154 MB/s). Hmm, perhaps there’s a decimal error
B.3 RESPONSE FROM DEEPSEEK-R1 B-8
here? Alternatively, maybe the units for Flow Duration are in milliseconds but the
Packets per second is calculated in a different way, or perhaps these features are
per-direction. Wait, looking back at the input features, Flow Packets/s is listed as
1614.205, and in the explanation of features, maybe there’s a mistake in the data.
However, as this is user-provided data, perhaps there’s an error, or maybe the Flow
Duration was calculated differently, perhaps including other aspects.
Looking at Flow IAT Mean and Max, Min, all are 1239. So IAT is inter-arrival
time. Since there are two packets (forward and backward), the IAT is computed as
the time between them. But since there are only two packets, total IAT would be
the time between them. But maybe in this case, the entire flow is just the duration,
so the IAT between the two packets is 1239ms, and the statistics (mean, max, min,
etc.) are all 1239. For Flow IAT, maybe computed as the time between first and
last packet. Similarly, for Fwd IAT and Bwd IAT: the forward direction has one
packet, so there’s no IAT there (hence all zeros), same for the backward.
FPSH flags and so on are all zero, so no PUSH flags set. That’s typical for UDP
as well since it doesn’t use flags like TCP. Also, the headers: Fwd Header Length
is 8.0, same for Bwd. Which aligns with UDP headers being 8 bytes. TCP headers
vary, but UDP is fixed. So that’s another point suggesting this is a UDP flow.
Packet length statistics: Min 60, Max 132, Mean 84, Std of 41.57. Since there
are two packets (60 and 132), their mean is (60 + 132)/2 = 96? Wait, but the mean
given is 84. Wait, that doesn’t make sense. Total packets are two: forward (60),
backward (132). So average: (60+132)/2=96. Wait, but the Packet Length Mean
is 84. So that’s wrong. Wait, maybe it’s calculated as (total bytes)/(total packets).
Total bytes are 60 + 132 = 192. Divided by 2 packets is 96. But the given Packet
Length Mean is 84. Hmm, that’s a discrepancy. There’s something wrong here,
maybe a typo in the input data? That’s a problem. Alternatively, perhaps there are
more packets, but input says Total Fwd and Backward packets are both 1.0 each.
B.4 RESPONSE FROM LLAMA3-8B-INSTRUCT B-9
So total packets 2. Unless there’s some miscalculation. Wait, but in the features
provided, the mean is 84. So why is that? Let me compute 60 and 132: their average
is 96. Wait, maybe it’s calculated differently. Wait, maybe they’re including the
headers? But length fields usually represent total IP packet size. Well, maybe it’s
an error in the data. But given that, the model’s decision is based on this data,
which may be incorrect. Or perhaps my calculations are wrong, but this seems odd.
Let me check. Maybe other features.
The flags are all zero: FIN, SYN, RST, etc. So no TCP flags set. That’s
consistent with UDP. Down/Up Ratio is 1.0 (uplink and downlink are the same in
terms of number of packets? Since one forward and one backward. So that matches.
The Avg Packet Size is 126.0. Wait, total bytes 60+132=192, divided by 2
packets is 96. So 192/2=96, but here Avg Packet Size is 126. That’s confusing.
Perhaps there’s a mistake here too. Either way, maybe the data is wrong, but
assuming the data as given, the models probably are making their decision based
on the given numbers, regardless of possible data issues.
Now, looking at features like ’Active Mean’, ’Idle Mean’ - those are zeros. Maybe
indicating that there’s no activity after the initial exchange, which makes sense with
a single packet each way.
Now, some kernel_features are present, which might be transformed or engi-
neered features from the original dataset, like PCA or some other feature extraction.
For example, kernel_feature_0 to 6, which have values like -0.21, -0.99, etc. but
without context, it’s hard to know what these represent.
So, putting this all together: this traffic is a UDP flow, with a short duration,
one packet each way, no TCP flags, UDP headers, which might represent a DNS
B.4 Response from LLaMA3-8B-Instruct
**Network Traffic Analysis Output**
B.4 RESPONSE FROM LLAMA3-8B-INSTRUCT B-10
The provided output is the result of analyzing network traffic data using two
Intrusion Detection systems (IDS): TabNet and AutoGluon. Both models have
predicted that the network traffic is benign, which means they have classified it as
normal and non-malicious.
**Model Output Breakdown**
* **TabNet Prediction:** Benign + TabNet is a type of deep learning model
that uses attention-based mechanisms to focus on relevant features. Its prediction
of "benign" suggests that the model has identified patterns in the network traffic
that are consistent with normal behavior. * **AutoGluon Prediction:** Benign +
AutoGluon is a meta-learning platform that can automatically select and combine
multiple machine learning models. Its prediction of "benign" indicates that the
model has identified similar patterns in the network traffic that are also consistent
with normal behavior. * **True Label:** Benign + The true label is the actual
classification of the network traffic, which in this case is also "benign". This means
that both models have correctly identified the network traffic as normal and non-
malicious.
**Reasons for the Models’ Decision**
There are several reasons why the models might have made this decision:
1. **Lack of Anomalous Patterns:** The network traffic data does not exhibit
any anomalous patterns or features that are commonly associated with malicious
activity. The models may have identified that the data is consistent with normal
network behavior. 2. **Low Risk Indicators:** The data does not contain any
indicators of high risk, such as a high number of suspicious packets, unusual packet
sizes, or other malicious activity. 3. **Typical Network Traffic Characteristics:**
The network traffic data exhibits typical characteristics of normal network traffic,
such as a relatively low packet rate, small packet sizes, and no evidence of malicious
activity. 4. **Model Training Data:** The models may have been trained on a
B.5 RESPONSE FROM MIXTRAL-8X7B-INSTRUCT B-11
dataset that is heavily biased towards benign network traffic, which could have led
them to be more conservative in their predictions.
**Conclusion**
In conclusion, the output of the two Intrusion Detection systems suggests that the
network traffic is benign, and both models have correctly identified it as normal and
non-malicious. The reasons for this decision are likely due to the lack of anomalous
patterns, low risk indicators, typical network traffic characteristics, and the model
training data.
B.5 Response from Mixtral-8x7B-Instruct
This output means that both the TabNet and AutoGluon intrusion detection models
have analyzed the given network traffic input features and predicted that the traffic
is "Benign", which means it is normal, everyday traffic and does not exhibit any
characteristics of a cyber threat or attack.
The models have made this decision based on the input features, which include
various measurements related to the network traffic flow, packet length, header infor-
mation, and flag counts. After analyzing these features, the models have determined
that the traffic is benign and poses no threat.
The models might have made this decision because the input features do not
show any significant deviations from what is typically observed in benign network
traffic. For example, the traffic consists of only one packet in each direction, with
similar packet lengths and no unusual flags set. Additionally, the flow duration is
relatively short, and the flow and packet rates are within normal ranges. These and
other factors suggest that the traffic is benign and safe.
B.6 RESPONSE FROM QWEN2.5 B-12
B.6 Response from Qwen2.5
This output represents the analysis of network traffic data by two different Intrusion
Detection Systems (IDS) models—TabNet and AutoGluon—and their predictions
compared to the true label for that particular network traffic instance. Both models
predicted the traffic as "Benign," which aligns with the true label, indicating that
the network traffic is not malicious or anomalous according to the models’ criteria.
### Explanation of the Decision:
#### Network Traffic Characteristics: - **Protocol**: UDP (Protocol 17) -
**Flow Duration**: 1239 milliseconds - **Packet Details**: There is one forward
packet and one backward packet. The forward packet has a total length of 60 bytes,
while the backward packet has a total length of 132 bytes. - **Flow Rates**: The
flow generates around 1614 packets per second and transfers data at approximately
154,964 bytes per second. - **Flag Counts**: No flags such as FIN, SYN, RST,
PSH, ACK, URG, CWE, or ECE are observed in the packets. - **Down/Up Ratio**:
This is equal to 1, meaning the ratio of packets sent from the source to destination
versus back from the destination to the source is balanced.
#### Model Predictions: Both TabNet and AutoGluon classified this traffic
as "Benign." Given the lack of unusual activity indicators such as multiple flag
settings, high variance in packet sizes, irregular patterns in timing intervals, or any
other anomalies, the models likely did not detect any suspicious behavior in the
traffic.
# Why the Models Made This Decision:
1. **Feature Analysis**: The models likely analyzed various features of the
traffic, including the protocol type, packet sizes, flow duration, and rates. The fact
that there are only two packets (one in each direction), and they have consistent
sizes and no flagged segments, suggests normal communication.
2. **Lack of Anomalies**: There is no indication of any anomalies or suspicious
B.7 RESPONSE FROM QWEN3 B-13
activities like sudden spikes in traffic, unusual packet sizes, or irregular flag usage.
All metrics fall within expected ranges for normal traffic.
3. **Consistency with True Label**: The true label being "Benign" supports the
models’ decisions. If the traffic had been malicious, it would likely show irregularities
or deviations from typical traffic patterns, which are not present here.
In summary, the models correctly identified the network traffic as benign based
on the absence of any suspicious characteristics and the consistency of the traffic
with expected normal behavior.
B.7 Response from Qwen3
You are a cybersecurity expert.
The output indicates that both the TabNet and AutoGluon models have classified
the given network traffic as "Benign," and the true label is also "Benign." This means
that the models correctly identified the traffic as normal and not malicious.
Explanation of the Output: 1. **Benign Classification**: The models have
determined that the network traffic does not exhibit characteristics of a cyber attack
or malicious activity. This is a positive outcome, as it means the traffic is likely
legitimate.
2. **True Label**: The true label confirms that the traffic is indeed benign,
which aligns with the models’ predictions. This indicates that the models are per-
forming well on this particular instance.
Why the Models Might Have Made This Decision: 1. **Normal Traffic Char-
acteristics**: The input features suggest that the traffic is consistent with normal
network behavior. For example: - The flow duration is 1239 seconds, which is not
unusually long or short. - The packet lengths (60 and 132 bytes) are within typical
ranges for benign traffic. - The flow bytes/s and packets/s are not excessively high,
indicating no sudden bursts of traffic. - The protocol (Protocol: 17.0) is likely UDP,
B.7 RESPONSE FROM QWEN3 B-14
which is commonly used for benign applications like DNS or streaming.
2. **Lack of Malicious Indicators**: The features do not show signs of malicious
activity, such as: - No SYN, RST, or FIN flags set, which are often associated with
attack patterns. - No unusual header lengths or packet sizes. - No high variance in
packet lengths or flow rates.
3. **Model Training**: The models (TabNet and AutoGluon) were likely trained
on datasets that include examples of both benign and malicious traffic. If this traffic
matches the patterns of benign traffic in the training data, the models would classify
it as benign.
4. **Kernel Features**: The kernel features (e.g., ‘kernel_feature_0‘, ‘ker-
nel_feature_1‘) are likely derived from complex transformations of the input data.
These features might have captured subtle patterns that the models use to distin-
guish benign from malicious traffic.
### Conclusion: The models’ decision to classify the traffic as benign is con-
sistent with the true label, indicating that the traffic is likely normal. The absence
of suspicious patterns and the alignment with typical benign traffic characteristics
support this conclusion.