Insurance Fraud Detection Using Supervised Machine Learning and Explainable Artificial Intelligence

avoin
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Lataukset1

Verkkojulkaisu

DOI

Tiivistelmä

Insurance fraud is a significant problem that causes financial losses for insurance companies and higher prices for honest customers. Machine learning has been widely used in insurance fraud detection systems. However, fraudulent claims often represent only a small percentage of the data, and standard machine learning models struggle with this extreme data imbalance. Additionally, the most advanced models are often black boxes, meaning their predictions are not interpretable to outside observers. This inability to explain decisions is problematic given the highly regulated nature of the insurance industry. This thesis aims to develop an insurance fraud detection system by utilizing machine learning models and tools from explainable artificial intelligence (XAI) research. A publicly available, labeled dataset of vehicle insurance fraud is used to train and evaluate the models. Based on the literature, four commonly used machine learning models are selected, trained, and evaluated. A logistic regression model and a decision tree are used as transparent baseline models, and they are compared with the more advanced black-box ensemble models: random forest and eXtreme Gradient Boosting (XGBoost). The class imbalance problem is addressed with cost-sensitive learning. To make the black-box models interpretable, this thesis uses Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP). The goal is not only to effectively detect insurance fraud, but also to provide explanations for why specific claims are suspicious. The trained models were evaluated using various metrics that account for the imbalanced nature of the data. XGBoost was found to be the highest-performing model, while random forest also outperformed logistic regression and decision tree. Both SHAP and LIME were successfully applied to the XGBoost model to generate explanations for the predictions. SHAP was found to be a more robust and reliable method compared to LIME.

item.page.okmtext