Design and Development of a Human-Centered Explainable Malware Classification System Using XAI and LLMs
4.21 MB
avoin
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Pysyvä osoite
Verkkojulkaisu
DOI
Tiivistelmä
This thesis explores the interpretability challenges of AI-based cybersecurity systems. Artificial intelligence (AI) has significantly improved malware detection compared to traditional signature-based approaches. However, these AI-based systems often operate as “black boxes,” as they do not provide a rationale for their outputs, making the results difficult to trust. Security professionals require clear reasoning to make informed decisions, while non-technical users need simple explanations to understand the outcomes. To address this gap, this research proposes a human centered explainable AI (XAI) framework that combines a classification layer with traditional XAI techniques such as LIME and SHAP. In the final layer, a large language model (LLM) generates clear and interpretable explanations for human users.
For this research, a balanced subset of the EMBER-2018 dataset containing Windows Portable Executable (PE) files in JSONL format was used. In the data extraction phase, 618 interpretable static features were extracted. In the classification layer, six models were implemented, with XGBoost reaching the best performance, with 97.0% accuracy and an ROC-AUC score of 0.997. In the XAI layer, LIME and SHAP were applied, identifying the compilation timestamp and high entropy as among the most important features. The LLM-based explanation layer uses lightweight local models (llama3.2:3b and deepseek-r1:1.5b), which take the top XAI features and a structured knowledge base as input. The LLM then converts these technical features into clear, human-understandable explanations for security analysts, security managers, and end users. Since the entire system operates locally without reliance on external cloud services, it enhances data security and eliminates the cost associated with API usage.