ABSTRACT
Fraudulent activities have caused great losses in the healthcare industry all over the world. Different methods such as upcoding, billing for services not rendered and many more are ways fraudulent activities occur. Traditional methods of fraud detection such as auditing and rule-based programming are no longer efficient due to the increase of data and complexity in the billing process of medical claims. There is a great need for new optimized methods to assist in fraud detection. Data mining and Machine learning are optimized methods that can be used to improve the sector.
The objectives of the research were to train a machine learning-based model which detects a health insurance claim considered fraudulent, identify features in insurance claims that can be used for fraud detection, identify appropriate machine algorithms models to use for fraud detection, compare the performance of different machine learning algorithms and implement a prototype for detecting fraudulent health insurance claims
The research explored the use of different machine learning methods to be able to detect fraud. The method used was the CRISP-DM process. The data went through stages of data collection, where data was collected from an insurance company which is based in Kenya, data preprocessing and transformation to ensure the data was clean, Training where the data was trained using different models, and lastly evaluation where a comparison analysis was done fo based on the performance of each model.
The results gotten from the benchmark and performance evaluation showed that the gradient boosting classifier performed the best with an accuracy of 90.0% and AUC of 95.0%. The other models that performed well included the random forest with an accuracy of 90% and ANN with an accuracy of 88.0%. The model that performed poorly was the Logistic Regression with an accuracy of 59% and Naive Bayes with an accuracy of 47%. The gradient boosting tree classifier model was then used to develop a prototype.
TABLE OF CONTENT
CHAPTER ONE
INTRODUCTION
1.1 Background 1
1.2 Problem Statement 3
1.3 Objectives 3
1.3.1 Main Objective 3
1.3.2 Specific Objectives 3
1.4 Research Questions 3
1.5 Significance 4
1.6 Scope 4
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction 5
2.2 History of Health Insurance 5
2.3 Health Insurance Categories 7
2.3.1 Public Health Insurance 7
2.3.2 Private Health Insurance 7
2.4 How Health Insurance Works 7
2.4.1 Health Insurance Companies 8
2.4.2 Health Insurance Subscriber 8
2.4.3 Health Service Provider 8
2.4.4 Insurance Clearing Houses 9
2.5 Health Insurance in Kenya 9
2.6 Health Insurance Fraud 11
2.7 Types of Fraud in Healthcare Systems 12
2.7.1 Healthcare Service Providers Fraud 12
2.7.2 Health Subscriber Fraud 13
2.8 Impact of Healthcare Fraud 13
2.9 Machine Learning Algorithms 14
2.9.1 Logistic Regression 14
2.9.2 Naive Bayes 15
2.9.3 Random Forests 16
2.9.4 Gradient Boosting Tree Classifier 17
2.9.5 Support Vector Machine 18
2.9.6 Artificial Neural Network 19
2.10 Related Work 20
2.11 Research Gap 21
2.12 The Process Model 22
CHAPTER THREE
RESEARCH METHODOLOGY
3.1 Introduction 23
3.2 Research Design 23
3.3 Overview of CRISP-DM 23
3.3.1 Business Understanding 24
3.3.2 Data Collection 24
3.3.3 Data Understanding 25
3.3.4 Data Preparation 28
3.3.5 Model Development 29
3.3.6 Evaluation 29
3.3.6.1 Confusion matrix 29
3.3.6.2 Classification Report 30
3.3.6.3 ROC - Area under the curve 30
3.3.7 Deployment 30
3.4 Prototyping 31
3.4.1 System Architecture 31
3.4.1.1 Users 31
3.4.1.2 Frontend 31
3.4.1.2 Backend 32
CHAPTER FOUR
RESULTS AND DISCUSSION
4.1 Introduction 33
4.2 Performance Metrics 33
4.2.1 Logistic Regression 33
4.2.2 Naive Bayes Classifier 34
4.2.3 Support Vector Machine 36
4.2.3 Random Forest Classifier 37
4.2.4 Artificial Neural Network 40
4.2.5 Gradient Boosted Tree Classifier 42
4.3 Comparison of Algorithms Used 44
4.4 The Prototype 48
4.5 Conclusion 49
CHAPTER FIVE
CONCLUSION
5.1 Introduction 50
5.2 Achievements 50
5.3 Limitations of the Study 51
5.4 Recommendation for future work 51
REFERENCES 52
List of Figures
Figure 2.1: Health Insurance Ecosystem
Figure 2.2: Healthcare Facilities Distribution in Kenya Figure 2.3: The Conceptual Model
Figure 3.1: Research Model Architecture Figure 3.2: Confusion Matrix Representation Figure 4.1: AUC Logistic Regression
Figure 4.2: Confusion Matrix Logistic Regression Figure 4.3: AUC Naive Bayesian Classifier
Figure 4.4: Confusion Matrix Naive Bayesian Classifier Figure 4.5: AUC Support Vector Machine
Figure 4.6: Confusion Matrix Support Vector Machine Figure 4.7: Hyperparameter Tuning Random Forest Figure 4.8: ROC Random Forest
Figure 4.9: Confusion Matrix Tuned Random Forest Figure 4.10: ROC Neural Network
Figure 4.11: Confusion Matrix Neural Network Figure 4.12: ROC Gradient Boosting Classifier
Figure 4.13: Confusion Matrix Gradient Boosting Classifier Figure 4.14:confusion Graphical Comparison Analysis
Figure 4.15: Launch Page Prototype Figure 4.16: Prototype Fraud detection
List of Tables
Table 4.1: HyperParameters used in Random Forest
Table 4.2: HyperParameters used in Sequential Neural Network Table 4.3: HyperParameters used in Gradient Boosting Classifier Table 4.4: Comparison Analysis
Acronyms
NHIF- National Health Insurance Fund UHC - Universal Health Care
KDD – Knowledge Discovery Databases
LEIE - List of Excluded Individual or Entities database IRA - Insurance Regulatory Authority
HIS - Health Information Systems NN - Neural Network
SVM - Support Vector Machine ANN -Artificial Neural Network
MLP -Multilayer Perceptron Neural Network GBC - Gradient Boosting Classifier
CHAPTER ONE
INTRODUCTION
1.1 Background
Healthcare fraud is a type of white-collar crime which occurs when healthcare claims are filed dishonestly to gain profit. Many organizations across the world have lost a lot of money due to healthcare fraud and corruption. Annually, expenditure in healthcare increases rapidly in different countries. Globally, approximately 10% of gross domestic product is spent on healthcare. There are many sources of inefficiency such as fraud and abuse and thus up to 10% of that money is wasted (Joudaki et al., 2016).
In the US, it is estimated that fraud in health insurance costs annually, 80 billion US dollars which are approximately 3% of the national healthcare expenditure (Yao et al., 2014). In the year 2020, about 330 fraud offenders were charged in court. With the increasing losses in the sector, The National Healthcare Anti-Fraud Association has been formed to research in the area of healthcare fraud.
In China, the Chinese Insurance Regulatory Commission estimates that fraud cases lead to losses of about 10-30% of the total income (Yao et al., 2014). In response to the escalating issue, the government proposed to build an anti-fraud system in 2012. In 2021, the government introduced new regulations regarding health insurance fraud. The new law has increased penalties regarding fraudulent acts and has placed a penalty of five times the amount of fraud. Also, the fraudulent activity may be subjected to a 3 to 12 months suspension of compensation by the fund.
In South Africa, an investigation firm Qhubeka Forensic Services indicated that the health system in South Africa loses between half a billion to 1 billion US dollars in healthcare fraud. In our country Kenya, according to the NHIF report (“Strides Towards Universal Health Coverage For All Kenyans,” 2018), NHIF loses to a tune of 10 billion Kenyan shillings every year in false medical claims. Between January and February 2021, the NHIF almost lost 27 million in only 15 service providers are using many methods to defraud healthcare systems both in the private and public sectors. Some of these activities Include; Billing for services that were not performed, upcoding claims, exaggerating medical illness, receiving kickbacks, and phantom billing fraud in healthcare is now perceived as a serious social concern. It is a problem for Insurance companies and the governments and thus there is a great need for more effective detection and prevention methods.
1.2 Problem Statement
Traditional methods of detecting healthcare fraud such as whistleblowing, planned audits, Statistical methods, and pattern matching are time-consuming and not effective. This has led to organizations losing up to 4 to 5% of their revenue due to fraud. This has become a limiting factor in the delivery of affordable premiums in insurance and quality healthcare to the insurance subscribers. Therefore, there is a great need to have automatic fraud detection systems in place.
1.3 Objectives
1.3.1 Main Objective
To develop and test a machine learning-based model for detection of fraudulent health insurance claims
1.3.2 Specific Objectives
1. To identify features in insurance claims that can be used for fraud detection
2. To identify appropriate machine algorithms that can be used for fraud detection
3. To implement a prototype for detecting fraudulent health insurance claims
1.4 Research Questions
1. What features in health insurance claims can be used for fraud detection?
2. Which machine learning algorithms are most appropriate for the identification of fraudulent claims?
3. What is the performance of the machine learning algorithm selected for use in the detection of fraudulent medical claims?
1.5 Significance
This research will contribute to the domain of fraud in insurance and specifically healthcare fraud. It will give recommendations on which machine learning algorithms are appropriate to use when developing a system to detect fraudulent claims. When the number of losses are minimized, patients will experience better and more affordable healthcare. When the prototype is implemented, it will enable insurance to detect fraudulent claims before payments and the perpetrators can be charged.
1.6 Scope
The study will use data from a local insurance company based in Kenya. The data will have maternity claims records which include both fraudulent and non-fraudulent claims.
Login To Comment