ABSTRACT
The pervasiveness of Cardiovascular disease has rapidly led to it becoming a global threat in the past few years. The pathology causes 18.6 million deaths annually, and its projections are predicted to hike to more than 23 million deaths annually by 2030. The research aimed at developing a CVD predictive model that was more accurate and robust than conventional models. Stacking is one of the efficient methods in machine learning classification tasks that has been widely utilized to fight CVD. The stacking technique offers better solutions by providing a good trade-off between variance and bias. Stacking gives more accurate and robust results. The study compared seven conventional with stacked algorithms and evaluated the algorithms' performance with four evaluation metrics; accuracy, precision, recall, and f1 measure. The better-stacked algorithm was cross-validated with 10 K-folds. The proposed model was achieved: Data description, retrieval, pre-processing, partitioning, normalization, feature selection, stacking, and model evaluation. The stacked algorithms outperformed the conventional algorithms in classification accuracy with 73.62%, recall with 71.24%, and F1 measure with 72.86%. However, in precision, Decision Tree was better performing with 77.41%. Cross-validating, the stacked model with K-Fold, improved the accuracy from 72.76% to 74.71%. The proposed model can be utilized in the primordial prevention strategy of the World Heart Federation. Additionally, medical practitioners can use it as a CVD diagnostic tool. In future works, more data can be retrieved, investigated in multi-level stacking and deep learning, to research the pros and cons of the proposed model.
TABLE OF CONTENTS
DECLARATION 2
DEDICATION 3
ACKNOWLEDGEMENTS 4
TABLE OF CONTENTS 5
LIST OF TABLES 7
LIST OF FIGURES 8
LIST OF EQUATIONS 9
LIST OF ABBREVIATIONS AND ACRONYMS 10
ABSTRACT 11
CHAPTER ONE
INTRODUCTION
1.1 Background Information 12
1.2 Statement of the Problem 13
1.3 Research Aim 14
1.4 Objectives of the Research 15
1.5 Research Question 15
1.6 Expected contribution 15
CHAPTER TWO
LITERATURE REVIEW
2.1 Research Gap 20
CHAPTER THREE
METHODOLOGY
3.1 Implementation flow of the Proposed Model 23
3.1.1 Data Description and Retrieval 24
3.1.2 Data Preprocessing 26
3.1.3 Data Partitioning and Normalization 26
3.1.4 Feature Selection 28
3.1.5 Stacking 28
3.1.6 Model Evaluation 38
3.1.7 Final Prediction 39
3.2 Data Analysis 39
3.3 Prototype Design 41
CHAPTER FOUR
RESULTS AND DISCUSSION
4.1 Data Description and Retrieval 42
4.2 Data Preprocessing 42
4.3 Data Partitioning and Normalization 44
4.4 Feature Selection 45
4.5 Machine Learning Algorithms Evaluation 46
4.6 Final Prediction and Experimental results 57
4.7 Comparative Analysis of Stacked Prototype Model 58
CHAPTER FIVE
SUMMARY, CONCLUSIONS AND RECOMMENDATIONS
5.1 Summary Of Findings 60
5.2 Conclusions 61
5.3 Limitations 61
5.4 Future Work 61
References 63
LIST OF TABLES
Table 3.1 Attributes of the Kaggle CVD dataset… 22
Table 3.2 Attributes of the performance metrics… 40
Table 4.1 Accuracy of the 7 Stacked and Conventional ML Algorithms… 47
Table 4.2 Precision of the 7 Stacked and Conventional ML Algorithms… 49
Table 4.3 Recall of the 7 Stacked and Conventional ML Algorithms… 51
Table 4.4 F1 Score of the 7 Stacked and Conventional ML Algorithms… 53
Table 4.5: Benchmark comparison of the proposed model with previous study 57
LIST OF FIGURES
Figure 2.1 Implementation flow of the Conceptual design of the proposed model…..20
Figure 3.1 Implementation flow of the proposed model… 22
Figure 3.2 K-fold cross-validation… 27
Figure 3.3 Proposed stacked model… 29
Figure 3.4 Linear Discriminant Analysis… 32
Figure 3.5 Support Vector Machine… 34
Figure 3.6 Random Forest… 38
Figure 3.7 Decision Tree structure… 39
Figure 4a The CVD Head Data Sample… 46
Figure 4b The CVD Tail Data Sample… 47
Figure 4.2a Null Values… 47
Figure 4.2b Negative Values… 47
Figure 4.3 Statistical computation… 48
Figure 4.4 CVD Dataset partition… 49
Figure 4.5 Data Normalization… 50
Figure 4.6 The Pearson Correlation… 50
Figure 4.7a Accuracy of the 7 conventional algorithms… 52
Figure 4.7b Accuracy of the 7 Stack algorithms… 52
Figure 4.8a Precision of the 7 conventional algorithms… 53
Figure 4.8b Precision of the 7 Stack algorithms… 54
Figure 4.9a Recall of the 7 conventional algorithms… 55
Figure 4.9b Recall of the 7 Stacked algorithms… 55
Figure 5a F1 measure of the 7 conventional algorithms… 57
Figure 5b F1 measure of the 7 Stacked algorithms… 58
Figure 5.1 All 4 performance metrics of the 7 conventional and Stacked algorithms. 59
Figure 5.2 10 KFold Cross-Validation… 61
Figure 5.3a Sample of Experimental results… 61
Figure 5.3b Results of the sample experiment… 62
LIST OF EQUATIONS
Equation 3.1 Min-Max Scaler… 28
Equation 3.2 Pearson Correlation Formula… 29
Equation 3.3 Linear Discriminant Analysis… 35
Equation 3.4 Support Vector Classifier… 36
Equation 3.5 Naive Bayes… 37
Equation 3.6 Logistic Regression… 37
Equation 3.7 Similarity Score of Errors… 39
Equation 3.8 Random Forest Time Complexity… 39
Equation 3.9 Random Forest Majority Voting… 39
Equation 4.0 Decision Tree Gini Index… 41
Equation 4.1 Decision Tree Information Gain… 42
Equation 4.2 Accuracy… 42
Equation 4.3 Precision… 42
Equation 4.4 Recall… 42
Equation 4.5 F1 Score… 42
LIST OF ABBREVIATIONS AND ACRONYMS
CVD Cardiovascular Disease
DT Decision Tree
LDA Linear Discriminant Analysis
LR Logistic Regression ML Machine Learning NB Naive Bayes
RF Random Forest
SVC Support Vector Classifier
XGB Extreme Gradient
CHAPTER ONE
INTRODUCTION
1.1 Background Information
The World Health Organization defines CVD as a group of blood vessels and heart disorders involving diseases like pulmonary embolism, peripheral arterial, congenital heart, cerebrovascular, deep vein thrombosis, rheumatic heart, and coronary heart diseases. Cardiovascular disease attacks and strokes are usually extreme events caused by a preventive blockage that inhibits blood flow into the brain or the heart. The renowned justification for the blockage is fatty deposits build up on the intrinsic anatomical blood vessel walls that flow blood to either the human heart or the brain. Heart strokes are caused by either blood vessel bleeding in the cerebrum or blood clots (WHO,2021).
The pervasiveness of cardiovascular disease has rapidly led to it becoming a global health problem in the past few years. The World Heart Federation estimates that the disease causes 18.6 million deaths per year. Cardiovascular diseases constitute 31% of the global death percentile and thus making it the world’s leading human disease killer. The federation approximates that 520 million human beings live with the deadly pathology (Aryal et al., 2020). Cardiovascular disease projections are predicted to hike to more than 23 million deaths per year by 2030(Liu et al., 2022).
Computational intelligence and machine learning techniques have been widely utilized to fight the CVD globally threatening disease (Liu et al., 2022). Stacking ensemble learning is one of the efficient methods in machine learning classification and regression tasks (Pavlyshenko, 2018). The stacking learning technique offers better solutions by providing a good trade-off between variance and bias. Hence, providing a more effective overall performance of a model. Combining various algorithms' advantages is crucial in providing more accurate and robust results (Rajagopal, 2020).
The stacked ensemble learning approach is a significant concern in inpatient management, prediction, medical diagnosis, and relatable healthcare administration issues (Hu et al., 2020).
The disease’s intense social and economic impacts render the stacking learning approach one of the major priorities in healthcare research.
There is an uptrend due to CVD risk factors like physical activities, smoking, and alcohol consumption (Hu et al., 2020). The factors were used as the independent variables to predict the presence or absence of CVD pathology. Thus, attempts were made to the stacked machine learning approach in the proposed research. The method was investigated in CVD diagnosis by analyzing the disease risk factors. (Johnson et al., 2018; Cao et al., 2018). The study suggested a more accurate and robust stacked learning model that ought to provide credible results in pathology prediction.
1.2 Statement of the Problem
Currently, there has been explosive datasets growth. The growth has consequently made the development an integral part of scientific research in the biomedical discipline. The biomedical field includes datasets in genomics, medical images, and heart diseases. Thus, to handle biomedical data effectively in the healthcare field, data processing and computational tools must interpret, transform, and analyze the data. (Asgari & Mofrad, 2015; Sharma, 2016). The dataset enormously aids in research advancements of CVD. Whereby stacked algorithms are applied to heart disease. These biomedical advancements will ultimately progress prognosis, prediction, diagnosis, and early treatment of CVD, a human-threatening pathology. (Alizadehsani et al., 2019).
The integral approach to extracting knowledge in heart disease datasets is by using machine learning approaches in CVD research (Cao et al., 2018; Alizadehsani et al., 2019). There is an accuracy and robustness challenge in the review of related work by researchers, in healthcare applications, for disease prognosis, prediction, diagnosis, and early treatment (Sharma, 2016; Abdelaziz et al., 2018; Chowdhury et al.,2021; Ghosh et al., 2021). To overcome the CVD global threat, challenges faced by conventional and conventional models are of crucial concern to scholars (Wang, 2019). Integrating multiple predictors increases the accuracy and robustness of the model(Wang, 2019). Another challenge is that the less accurate single algorithms potentially led to the misdiagnosis of the patients (Narain, 2018). For this reason, the research suggests a predictive model of machine learning algorithms that address the challenges (Cao et al., 2018; Pandey et al., 2019).
A conventional model doesn’t capture all the dataset properties(Wang, 2019). Fortunately, the stacking ensemble solves the conventional problems by combining the different single algorithms, as indicated in figure 4.5. Additionally, combining different level 0 algorithms reduces the generalization error. The level 1 classifiers rectify the prediction errors of the base classifiers and acquire optimal results for the disease prediction. The stacking method improves the generalization capabilities of the algorithms by preventing the overfitting of a model and thus giving a higher prediction precision(Rajagopal, 2020). Using a meta classification approach and whenever single algorithms are combined, the performance of individual algorithms is enhanced. The stacking of different algorithms improves prediction accuracy. Combining conventional algorithms leads to optimized levels of superior predictions(Rajagopal, 2020). There is the potential development of more accurate predictive models by combining conventional algorithms, and this yields better results. (Hu et al., 2020). The research proposes a stacking ensemble predictive model using 7 supervised learning techniques on CVD datasets(Rajagopal, 2020).
The purpose of the research is to compare the performance of conventional supervised learning and stacked algorithms in CVD diagnosis. The research goal is to perform predictive analysis using seven stacked supervised learning classifiers on CVD diagnosis. The diagnosis approach will be based upon machine learning techniques in stacking and an HTML predictive graphical user interface.
1.3 Research Aim
The main research aim of the proposed study is to develop a predictive model that is more accurate and robust than conventional models. The research will find classifiers with the highest accuracy in predicting heart disease.
1.4 Objectives of the Research
The general objective of the proposed study is to obtain an accurate and robust CVD prediction model using the stacking technique.
The specific objectives are
i. To stack seven conventional machine learning algorithms with seven different meta-classifiers
ii. To evaluate the performance of the seven conventional machine learning algorithms and seven stacked machine learning algorithms with four evaluation metrics; classification accuracy, precision, recall, and f1 measure.
iii. To develop a stacked model prototype from the best performing machine learning algorithms for CVD diagnosis
iv. To perform ten K-fold cross-validation to fine-tune the stacked prototype model
v. To compare the CVD prediction capability of the stacked prototype model with a previous study
1.5 Research Question
According to section 1.2, the research examines one comparison question.
Question: Which are the best performing ML algorithms for CVD prediction modeling, between conventional and stacked ensemble algorithms?
1.6 Expected contribution
To develop a robust machine learning model to predict and support medical practitioners as a diagnostic tool.
Click “DOWNLOAD NOW” below to get the complete Projects
FOR QUICK HELP CHAT WITH US NOW!
+(234) 0814 780 1594
Login To Comment