ABSTRACT
Practical insurance fraud detection solutions require sufficient quality data from insurers to build effective models. However, insurance data is generally proprietary information for specific insurance companies and thus not publicly available. Also, the Insurance datasets are often imbalanced, making it challenging to develop fraud detection models that are not biased. Data privacy and class imbalance are two significant challenges when developing artificial intelligence applications in the insurance setup. In this research study, we tackle these challenges and propose a decentralized and privacy-preserving federated approach using an adjusted random forest model. The method is asynchronous federated learning of the traditional adjusted random forest classifier, i.e., achieving a higher performance and accuracy level than the traditional centralized learning approach. Based on it, we achieved secure collaborative machine learning that allows the training of quality federated fraud detection models from imbalanced data without sharing data. Experiments on Kaggle and Oracle insurance datasets demonstrate that the federated adjusted random forest classifier is more accurate and efficient than the non-federated counterpart. Our model is verified to be practical, efficient and scalable for real-life insurance fraud detection tasks.
Keywords: Fraud Detection, Federated Learning, Adjusted Random Forests, Feature Selection, Ensemble methods.
TABLE OF CONTENTS
DECLARATION I
ACKNOWLEDGEMENT II
DEDICATION III
ABSTRACT IV
CONTENTS V
LIST OF FIGURES VII
LIST OF TABLES VIII
LIST OF ABBREVIATIONS VIII
CHAPTER 1: INTRODUCTION
1.1 BACKGROUND 1
1.2 PROBLEM STATEMENT 2
1.3 RESEARCH OBJECTIVES 3
1.3.1 General Objective 3
1.3.1 Research Questions 3
1.4 JUSTIFICATIONS 4
1.5 CONTRIBUTIONS OF THE RESEARCH 5
1.6 SCOPE OF THE STUDY 5
CHAPTER 2: LITERATURE REVIEW
2.1 INTRODUCTION 6
2.2 INSURANCE FRAUD DETECTION METHODS 6
2.3 PRIVACY-PRESERVING MACHINE LEARNING METHODS 8
2.4 FEATURE ENGINEERING METHODS IN INSURANCE FRAUD DETECTION 10
2.5 RESEARCH GAP 12
2.6 CONCEPTUAL FRAMEWORK 13
CHAPTER 3: METHODOLOGY
3.1 INTRODUCTION 15
3.2 STUDY POPULATION 16
3.3 DATA COLLECTION 16
3.4 DATA ANALYSIS 17
3.4.1 Data Cleaning 21
3.4.2 Data Transformation 22
3.5 FEATURE ENGINEERING 22
3.5.1 Feature Engineering 22
3.5.2 Feature Selection 23
3.6 FEDERATED MODEL DESIGNS 24
3.6.1 Feature Alignment 24
3.6.2 Federated Adjusted Random Forests 24
3.6.3 Horizontal Federated Learning 26
3.6.4 Decentralized Architecture Design 27
3.6.5 Decentralized Algorithm Design 28
3.7 IMPLEMENTATION AND PROTOTYPING 29
3.8 MODEL EVALUATION 30
3.9 ETHICAL CONSIDERATIONS 30
CHAPTER 4: RESULTS AND DISCUSSIONS
4.1 INTRODUCTION 31
4.2 EVALUATION RESULTS 31
4.2.1 Classification Accuracy 31
4.2.2 Confusion Matrix 32
4.2.3 Classification Report 35
4.3 DISCUSSION 37
4.4 MODEL VERDICT 38
CHAPTER 5: CONCLUSION AND RECOMMENDATIONS
5.1 CONCLUSION 39
5.2 RECOMMENDATION 40
5.3 FUTURE RESEARCH 40
REFERENCES 41
APPENDICES 43
IRA STATISTICS 43
EMAIL CORRESPONDENCE TO REQUEST FOR INSURANCE DATA 44
LIST OF FIGURES
Figure 1 Conceptual Framework 14
Figure 2 Research Process 15
Figure 3 Features of various datasets 17
Figure 4 Features Correlation Heat-Map 18
Figure 5 Features correlating with fraud state 19
Figure 6 Data Analysis 20
Figure 7 Filling Missing Values 21
Figure 8 Data Transformation 22
Figure 9 Kaggle Dataset Selected Features 23
Figure 10 Oracle Dataset Selected Features 23
Figure 11 Feature Alignment 24
Figure 12 Bagging Using Adjusted Random Forest 25
Figure 13 Random Federated Forests 26
Figure 14 Decentralized Design 27
Figure 15 Federated Algorithm 28
Figure 16 Federated Random Forest Before Balancing-Kaggle Dataset 33
Figure 17 Federated Random Forest Before Balancing-Oracle Dataset 33
Figure 18 Balanced Federated Random Forest-Kaggle Dataset 34
Figure 19 Balanced Federated Random Forest- Oracle Dataset 34
Figure 20 Email to Head of Innovations IRA 44
Figure 21 Email to head of Research and Innovation IRA 45
Figure 22 Email to the C.E.O and Commissioner for Insurance IRA 46
LIST OF TABLES
Table 1 Accuracy Score for Kaggle Dataset 32
Table 2 Accuracy for Oracle Dataset 32
Table 3 Classification Report Before Feature Selection-Kaggle Dataset 35
Table 4 Classification Report Before Feature Selection-Oracle Dataset 36
Table 5 Classification Report After Feature Selection-Kaggle Dataset 36
Table 6 Classification Report After Feature Selection-Oracle Dataset 36
Table 7 Models Average Performance 38
Table 10 IRA Statistics 43
LIST OF ABBREVIATIONS
ML Machine Learning
ANN Artificial Neural Network
FL Federated Learning
B2B Business to business
IRA Insurance Regulatory Authority
RFM Recency, frequency, and monetary HOBA Homogeneity-oriented behavior analysis CART Classification and Regression Trees
CHAPTER ONE
INTRODUCTION
1.1 Background
Insurance claims fraud (illegitimate claims), other than tax fraud, is recorded to be the most practised fraud globally. The significant accumulation of liquid financial assets makes insurance companies susceptible to loot schemes and takeovers (Association of Certified Fraud Examiners, 2019). Insurance claims fraud occurs when the insured attempts to gain profits through premiums paid without complying with the insurance agreement terms (Association of Certified Fraud Examiners, 2019). Detecting fraud manually has always been costly for insurance companies. Low incidents that go undetected contribute immensely to the claim ratio. For example, the Industry average Incurred claims ratio (loss ratio) is 64.34%, with motor insurance accounting for 24.6% of the total industry paid claims under the general insurance business (Insurance Regulatory Authority, 2020).
The research community has focused on insurance fraud detection methods that require centralized datasets from specific insurers. There is vast body of literature published on fraud detection methods in the Insurance Industry. These methods, however, use insurance data from specific insurers that might not be representative of the industry fraud problem. Feeble attempts have been made to look at fraud detection methods from an industry perspective. The quality of the data needed to train predictive models is as important as the quantity required. Datasets must be representative and balanced to provide a better picture and avoid bias (Rama Devi Burri, et al., 2019). Recent studies on claim analysis using machine learning recorded data security challenges in implementing machine learning. Vast amounts of data required for machine learning have created additional risk for insurance companies. The increase in data collection and connectivity among applications can lead to data leaks and security breaches. This makes Insurers struggle to provide relevant data for training machine learning models (Rama Devi Burri, et al., 2019).
Significant studies have been conducted to explore the detection and prevention of Insurance fraud. For example, (Phua et al., 2010) have explored holistic and scientific approaches to fraud management. Their respective works observe that studies involving quantitative methods report limitations due to the lack of insurance data (Phua et al., 2010). However, because of the evolving nature of insurance fraud, there still exist challenges due to the lack of sufficient insurance data and a class imbalance problem in claim datasets that have attracted the attention of researchers. Insurance fraud detection problems are often biased because they reduce the overall error rate instead of taking care of minority classes (Johannes & Rajasvaran, 2020). Studies have shown that the lack of primary insurance data and imbalanced datasets is a challenge when developing machine learning models in insurance. Imbalanced datasets often produce biased models that cannot make correct predictions (Johannes & Rajasvaran, 2020). Insurance companies that adopt a centralized approach for insurance claim fraud detection face a class imbalance problem, a case where fraud incidents are less than the total number of claims (R Guha et al., 2017).
1.2 Problem Statement
The research community has focused on insurance fraud detection methods that require centralized datasets to train models. Centralized machine learning methods often produce biased models which are not effective in detecting insurance fraud. The bias in insurance models is primarily attributed to two issues; the class imbalance problem in datasets, where fraud incidents are less than the total number of genuine claims, insufficient insurance data to train the models and. For example, (Johannes & Rajasvaran, 2020) presents a behavioural feature engineering approach for motor insurance fraud detection. However, in the study, they observe that insurance claim data is often imbalanced where at least one class forms a tiny minority of the data. The works by (Burri et al., 2019) provides an in-depth claim analysis using machine learning. The study, however, reports challenges in finding suitable data sources and data security in implementing machine learning (Burri et al., 2019). There have been feeble attempts to look at fraud detection methods that benefit all insurance players instead of individual insurers.
Previous studies present a need for a quality fraud detection system that can tackle the class imbalance problem in the datasets. In this case, all participants who do not have sufficient datasets can collaborate in building a quality model. Studies also present a need to implement privacy- preserving methods that can be used to train machine learning models without sharing data. The methods used in the previous studies suffer drawbacks due to the quality of data used to train fraud detection models; the studies have also presented challenges in accessing quality datasets from insurers (Burri et al., 2019).
1.3 Research Objectives
1.3.1 General Objective
To implement a privacy-preserving federated machine learning framework for the Insurance setup that will be used to train fraud detection machine learning models while preserving the privacy of data. The model will be used to detect insurance claims fraud. We will evaluate the effectiveness of this framework in improving the prediction accuracy of Insurance fraud detection models. The accuracy in the prediction of the model will be assessed against past insurance claims data.
1.3.1 Research Questions
1. What practical technique can be used to build quality insurance fraud detection models while preserving policyholders’ privacy?
2. How can quality fraud detection models be built from imbalanced insurance datasets?
3. What is the prediction performance of federated insurance fraud detection models?
4. Which optimal feature engineering and selection method is used for high dimensional datasets?
1.4 Justifications
This research aims to implement a privacy-preserving machine learning architecture that will be used to train insurance fraud detection models. While machine learning methods such as classification and regression algorithms have been identified and studied in previous research (Burri et al., 2019), the studies do not show that such algorithms can train machine learning models while preserving the privacy of insurance data. In addition, little research has been done to show that decentralized methods can be used with imbalanced datasets to produce quality insurance fraud detection models. The broad topic of insurance fraud detection has received attention, including from insurers and government regulators. Still, decentralized, collaborative and privacy- preserving machine learning methods have not been the focus of that attention. Instead, while acknowledging the challenges in finding quality insurance data and the class imbalance problem in datasets, the research community currently focuses on centralized machine learning methods biased towards individual insurance companies (Dhieb et al., 2019).
The insurance industry, with hundreds of years of history, is characterized by fierce competition. Data has become a valuable resource that Insurers have to protect, hindering the development of solutions that benefit all industry players. Insurers struggle to release relevant data for training AI models (Burri et al., 2019). Brilliant ideas give value to the industry, such as automated underwriting, automated claims fraud detection require privacy-preserving machine learning methods. There is a need to introduce collaborative privacy-preserving approaches to machine learning and data science in insurance. This study will provide insights into insurance claims management practice by exploring privacy-safe methods that can be used to detect insurance claims fraud accurately.
1.5 Contributions of the Research
Research has reported problems in implementing machine learning in the insurance industry, including lack of suitable data sources, data security, and imbalanced datasets (Burri et al., 2019). Balanced datasets give a better picture and avoid bias in prediction. Imbalanced insurance data makes it challenging to produce quality models shared across the industry. This research will draw recommendations on model performance built using imbalanced datasets, improving the prediction accuracy of fraud detection models. The research seeks to demonstrate that using privacy-preserving methods to train a model on decentralized data preserves data integrity, improves prediction accuracy, and, therefore, a practical approach in claims fraud prediction. The study will contribute to the Insurance Claims Management practice and claims fraud detection. In addition, this study will contribute to the knowledge of privacy-preserving machine learning in insurance.
1.6 Scope of the Study
The study will be limited to the General Category of Insurance. This area of the Insurance business is selected because it accounts for the insurance industry’s highest-paid claims. The insurance regulatory authority regulates thirty-six Insurance companies offering a General Category of Insurance. The companies could be used in this research. However, to facilitate this project, we select three major Insurance Companies. The companies understudy would need to be actively engaged in the motor class of insurance.
Login To Comment