ABSTRACT
The academic performance and
retention of students are critical concerns for higher education institutions.
Identifying students at risk of failure early enough for effective intervention
remains a significant challenge, particularly in institutions like The
Polytechnic Ibadan, which rely on reactive, manual systems. This study
addresses this problem by designing and implementing a machine learning-based
system to predict student academic performance proactively. The research
leverages educational data mining techniques, utilizing the algorithm, Logistic Regression, on a dataset
compiled from academic records. The methodology encompasses data collection,
preprocessing, exploratory data analysis, model training, and evaluation. A
comprehensive system analysis of the existing processes at The Polytechnic
Ibadan was conducted, identifying key limitations such as delayed
identification of at-risk students and inefficient use of available data. The
proposed system aims to shift the paradigm from reactive to proactive by
providing early warnings, enabling educators and advisors to implement timely
support measures. By comparing multiple machine learning models, this research
seeks to identify the most effective and generalizable predictor for student
performance, thereby contributing to improved educational outcomes, resource
allocation, and institutional decision-making.
TABLE OF CONTENT
i TITLE PAGE / COVER PAGE i
ii CERTIFICATION ii
iii DEDICATION iii
iv ACKNOWLEDGEMENT iv
v ABSTRACT v
CHAPTER ONE INTRODUCTION
1.1
INTRODUCTION 1
1.2
STATEMENT OF THE
PROBLEM 4
1.3
JUSTIFICATION OF
STUDY 4
1.4
AIM AND
OBJECTIVES 5
1.5
SCOPE OF STUDY 6
1.6
METHODOLOGY 6
1.7
SIGNIFICANCE OF
THE STUDY 7
1.8
DEFINITION OF
TERMS 8
CHAPTER TWO LITERATURE REVIEW
2.1 BACKGROUND THEORY OF STUDY 10
2.1.1 MACHINE LEARNING 15
2.1.1.1 DECISION TREE 15
2.1.1.2 SUPPORT VECTOR MACHINE (SVM) 16
2.1.1.3 RANDOM FOREST 18
2.2 RELATED WORKS 19
2.3 CURRENT METHODS IN USE 19
2.4 APPROACH TO BE USED IN THIS STUDY 19
CHAPTER THREE SYSTEM INVESTIGATION AND ANALYSIS
3.1 BACKGROUND INFORMATION ON CASE STUDY 21
3.2 OPERATION OF EXISTING SYSTEM 22
3.3 ANALYSIS OF FINDINGS 22
(a) OUTPUT FROM THE SYSTEM 22
(b) INPUTS TO THE SYSTEM 23
(c) PROCESSING ACTIVITIES CARRIED OUT BY THE
SYSTEM 23
(d) ADMINISTRATION/ MANAGEMENT OF THE SYSTEM 24
(e) CONTROLS USED BY THE SYSTEM 24
(f) HOW DATA AND INFORMATION ARE BEING STORED
BY THE SYSTEM 25
(g) MISCELLANEOUS 25
3.4 PROBLEMS IDENTIFIED FROM ANALYSIS 25
3.5 SUGGESTED SOLUTIONS TO PROBLEMS
IDENTIFIED 26
CHAPTER FOUR SYSTEM DEVELOPMENT
4.1 SYSTEM DESIGN 28
4.1.1 OUTPUT DESIGNS 28
(a) REPORTS TO BE GENERATED 28
(b) SCREEN FORMS OF REPORTS 28
(c) FILES USED TO PRODUCE REPORTS 30
4.1.2 INPUT DESIGN 31
(a) LIST OF INPUT ITEMS REQUIRED 31
(b) DATA CAPTURE SCREEN FORMS FOR
INPUT 31
4.1.3 PROCESS DESIGN 34
(a) LIST ALL PROGRAMMING ACTIVITIES NECESSARY 34
(b) PROGRAM MODULES TO BE DEVELOPED 35
(c) VTOC 35
4.1.4 STORAGE DESIGN 36
(a) DESCRIPTION OF DATABASE USED 36
4.1.5 DESIGN SUMMARY 36
(a) SYSTEM FLOWCHART 36
(b) HIPO CHART 38
4.2 SYSTEM IMPLEMENTATION 38
4.2.1 PROGRAM DEVELOPMENT ACTIVITIES 38
(a) PROGRAMMING LANGUAGE USED 38
(b) ENVIRONMENT USED FOR DEVELOPMENT 38
(c) SOURCE CODE 39
4.2.2 PROGRAM TESTING 39
(a) CODING PROBLEMS ENCOUNTERED 39
(b) USE OF SAMPLE DATA 39
4.2.3 SYSTEM DEPLOYMENT 39
(a) SYSTEM REQUIREMENTS 39
(b) TASKS PRIOR TO DEPLOYMENT 40
(i) HARDWARE/SOFTWARE ACQUISITION 40
(ii) PROGRAM INSTALLATION 40
(c) USER TRAINING 40
4.3 SYSTEM DOCUMENTATION 41
4.3.1 FUNCTION OF PROGRAM MODULES 41
4.3.2 USER MANUAL 41
CHAPTER FIVE -
SUMMARY, CONCLUSION AND RECOMMENDATION
5.1 SUMMARY 43
5.2 CONCLUSION 44
5.3 RECOMMENDATION 44
REFERENCES
APPENDICES
(a)
PROGRAM FLOWCHART
(b)
PROGRAM LISTING
(c)
TEST DATA
(d)
SAMPLE OUTPUT
CHAPTER ONE
INTRODUCTION
1.1 Introduction
The use of machine learning (ML)
algorithms in academia has gained significant attention in recent years due to the
increasing availability of educational data and advancements in ML techniques
(Yagcı, 2022). Using ML algorithms to predict students’ academic performance
can give valuable insights to educators, allowing them to identify at-risk
students who may need additional support, modify instructional techniques,
boost learning outcomes, tailor teaching approaches to specific students’
requirements, and increase student retention rates (Adnan et al., 2021). This
procedure promotes the growth of the educational system at higher institutions
because educators and policymakers can intervene early to prevent students from
falling behind and increase their chances of success. Applying ML algorithms to
predict student academic achievement can dramatically enhance educational
results and give valuable insights into the aspects contributing to academic
success (Alyahyan and Du¸steg¨, 2020). Therefore, it is critical to carefully
assess these algorithms’ possible benefits and limitations and ensure they are
appropriately utilized. Mechanisms, such as the type of ML algorithm employed,
the variables analyzed, and the assessment metrics used to determine prediction
accuracy, were included as part of our investigation criteria. Applying ML
algorithms in education can transform how we approach teaching and learning
with a qualitative or quantitative analysis, or a mix of the two, offering an
overall assessment of the condition of the results (Zhai, 2021). The potential
benefits of using ML algorithms to predict academic performance extend beyond
individual students and can positively impact society (Waheed et al., 2020). By
improving education outcomes, individuals are better equipped to contribute to
the workforce and society, leading to economic growth and social development. A
vast majority of the work in educational data analysis has been devoted to
developing machine learning models capable of accurately predicting students’
performance in specific contexts. However, the existing body of literature
often overlooks the crucial aspect of evaluating models for their ability to
transcend beyond their original training settings and demonstrate robust
generalizability across diverse student populations and learning environments.
This oversight raises concerns about the potential biases introduced by relying
solely on ‘best-performing’ models and neglecting the search for models that
exhibit superior generalization capabilities. Consequently, a pressing need
emerges to address this research gap by identifying and investigating the optimal
machine learning model that can be a predictive tool for assessing students’
performance. This pursuit emphasizes ensuring that the identified model
achieves accurate predictions and avoids any inherent bias stemming from
feature selection, thereby ensuring its applicability and effectiveness across
various educational contexts. This study contributes to advancing educational
data analysis practices by addressing these challenges and encouraging a
paradigm shift towards holistic and unbiased model evaluation and selection.
With the increased availability of
data from numerous sources, including learning management systems, online
platforms, and student records, ML algorithms can give significant insights
into student behavior, performance, and learning patterns (Yu et al., 2020). A
thorough examination of the literature on ML algorithms for forecasting
students’ academic achievement may offer a complete knowledge of the various ML
approaches employed, the parameters examined, and prediction accuracy
(Rastrollo et al., 2020). Institutions can profit from properly anticipating
student performance by concentrating on students who are more likely to perform
poorly and helping them improve their performance (Batool et al., 2023). ML
algorithms used to predict students’ academic achievement can give significant
insights to academics, instructors, and educational policymakers (Waheed et
al., 2020; Alyahyan and Du¨¸steg¨, 2020). ML algorithms may effectively predict
students’ academic achievement by analyzing different academic and non-academic
criteria such as previous grades, attendance records, socioeconomic background,
and student behavior (Batool et al., 2023).
A growing interest has been in
using ML algorithms to predict students’ academic performance, and several
studies have explored the use of ML in this area, with promising results.
Several schools of thought regarding using ML to predict academic performance
have emerged. One school of thought focuses on using traditional statistical
methods, such as regression analysis and logistic regression. These methods
assume a linear relationship between the predictor and outcome variables. For
example, studies by Yaacob et al. (2019) and Waheed et al. (2020) used logistic
regression while El Aissaoui et al. (2020) used linear regression to predict
students’ academic performance. Another school of thought revolves around using
decision trees and random forests. Decision trees are hierarchical models that
predict the outcome variable through binary decisions. On the other hand,
random forests are an ensemble learning approach combining numerous decision
trees. Vijayalakshmi and Venkatachalapathy (2019), Altabrawee et al. (2019),
and Zhang et al. (2022), for example, employed decision trees and random
forests to predict students’ academic performance. A third school of thought
focuses on using neural networks (Baashar et al., 2022; Liu et al., 2022),
which are computational models inspired by the structure and function of the
human brain to predict students’ academic performance. Neural networks are
beneficial when dealing with complex, non-linear relationships between
variables. Hybrid approaches also combine multiple ML methods to improve
prediction accuracy. For instance, a study by Francis and Babu (2019) used a
hybrid approach that combined logistic regression, decision trees, and neural
networks to predict academic performance based on students’ demographic
information, prior academic performance, and study habits. Each approach has
its strengths and weaknesses, and the choice of method depends on the research
question and the nature of the data.
Numerous factors can impact a
student’s academic performance, including individual factors, such as
motivation and self-regulation, and environmental factors, such as
socioeconomic status and school resources (de la Fuente et al., 2021).
Motivation is one of the most significant factors impacting students’ academic
performance. (Deci et al, 2020) has
shown that students who are intrinsically motivated to learn, meaning that they
are motivated by their interest and enjoyment of the material, are more likely
to perform well academically. On the other hand, extrinsically motivated
students, meaning that they are motivated by external rewards such as grades or
praise, may be less likely to perform well if these rewards are not provided.
Another individual factor that can impact academic performance is
self-regulation, which refers to the ability to manage one’s learning and
behaviors. Students who can effectively regulate their learning by setting
goals, monitoring their progress, and seeking help when needed are likelier to
perform well academically (Feeney et al., 2023). Environmental factors can also
have a significant impact on academic performance (Asvio et al., 2022). For
example, students from lower socio-economic backgrounds may have less access to
resources such as high-quality schools, educational materials, and
extracurricular activities, which can negatively impact their academic
performance. Additionally, students who attend schools with fewer resources,
such as low-income schools, may be less likely to have access to experienced
teachers or advanced coursework, which can also impact academic performance.
Family support and involvement in education can also have an impact on student
performance. (Garbacz et al, 2021)
Research has shown that students whose families are involved in their
education, such as providing support and encouragement, attending
parent-teacher conferences, and monitoring homework, are more likely to perform
well academically. To improve student learning concentration and collaboration
in response to the COVID-19 pandemic, Nyarko et al. (2023) utilize a Discrete
Choice Experiment to investigate university instructors’ preferences for
current teaching strategies.
1.2 Statement of the Problem
Student success is of utmost importance for
educational institutions and society. Dropouts, failures, and a drop in the
standard of education in higher institutions are increasing, but identifying
students who are at risk of dropping out and implementing timely interventions
can greatly contribute to improving graduation rates and ensuring academic
success.
1.3 Justification of Study
The
justification for this study is multi-faceted, stemming from critical gaps in
current educational practices and the transformative potential of machine
learning:
- Addressing
Systemic Inefficiencies: The
current system at The Polytechnic Ibadan for monitoring student
performance is reactive and manual. It identifies problems
only after examinations, when opportunities for meaningful intervention
are limited. This study is justified by the urgent need to computerize and
automate this process, making it proactive and data-driven.
This shift will allow the institution to support students before they
fail, rather than afterward.
- Improving
Student Outcomes and Retention: A
primary goal of any educational institution is to ensure student success.
By accurately predicting at-risk students, the proposed system
enables early and targeted interventions, such as additional
tutoring, counseling, or academic advising. This proactive approach is
expected to directly contribute to reducing dropout rates,
improving graduation rates, and enhancing overall academic standards.
- Optimizing
Resource Allocation: Academic
advisors and lecturers have limited time and resources. This system will
help them focus their efforts on the students who need
help the most, thereby increasing the efficiency and effectiveness of
student support services.
- Contributing
to Educational Data Science: While
many studies have developed predictive models, a significant research gap
exists concerning the generalizability of these models
across different contexts. This study aims not only to build an accurate
model but also to investigate which algorithm (e.g., Random Forest vs.
Logistic Regression) offers the most robust and reliable predictions,
contributing valuable insights to the field of educational data mining.
- Enhancing
Institutional Decision-Making: The
predictive analytics generated by the system can provide departmental and
institutional leaders with insights into course difficulty, curriculum
effectiveness, and trends in student performance. This data-driven
intelligence can inform strategic planning, curriculum review,
and quality assurance processes, ultimately strengthening the
institution's academic offerings.
In summary, this study is justified
by its potential to transform the educational experience at The Polytechnic
Ibadan by leveraging technology to foster student success, optimize
institutional resources, and advance the application of machine learning in
education.
1.4 Aims
and Objectives of the Study
Aim
The
aim of this project is to predict students’ academic performance using random
forest and logistic regression algorithms.
Objectives
i.
To
collect and compile relevant academic datasets, including students’ grades,
demographic information, and behavioral records from kaggle.com. Help identify
the various factors that affect students’ academic performance.
ii.
To
preprocess the collected data into a format suitable for machine learning by
handling missing values, normalizing variables, and encoding categorical
features.
iii.
To
explore and analyze the data to identify key academic and non-academic factors
that influence student performance.
iv.
To
evaluate multiple machine learning models (e.g., logistic regression and random
forest) to determine the most suitable for predicting students’ academic
outcomes.
1.5 Scope of the study
The goal is to design and implement
an ML model that predicts student academic performance based on historical
data, demographics, and behavioral factors. The data used for this research was
obtained from an online database known as Kaggle, where hundreds of collections
of data are available for users to explore, analyze, and utilize in various
data science and machine learning applications.
The model will assist educators in identifying struggling students and
improving retention rates.
1.6 Methodology
1.6.1 Objective 1: To predict students’ final GPA
classified as pass or fail, given the grades of all the mandatory courses
·
Collect academic records, including students’ grades
in all mandatory courses and their final GPA.
·
Preprocess the data: normalize grade values, remove
missing values, encode the GPA classification into binary labels (e.g., Pass =
1, Fail = 0).
·
Use classification algorithms (Logistic Regression,
Decision Tree, and Random Forest) to model the data.
·
Train the models on the preprocessed dataset.
·
Evaluate using classification metrics like Accuracy,
Precision, Recall, and F1-Score.
1.6.2 Help identify the various factors that
affect students’ academic performance
·
Gather additional academic and non-academic data
such as demographic data, attendance, family background, motivation levels,
study habits, etc.
·
Perform exploratory data analysis (EDA) to identify
patterns and correlations.
·
Use feature importance techniques (e.g., Gini
importance for Decision Tree/Random Forest, coefficients in Logistic
Regression) to rank the impact of different variables.
· Optionally
apply dimensionality reduction techniques (e.g., PCA) to visualize key
influencing factors.
1.6.3 To implement a model to predict academic
performance using Decision Tree classifiers
·
Use preprocessed training data (features such as
grades, attendance, demographics).
·
Implement a Decision Tree classifier using
scikit-learn or a similar ML library.
·
Tune hyperparameters like maximum depth, min samples
split, etc., to avoid overfitting.
· Train the
model and validate its performance using techniques such as k-fold
cross-validation.
·
Use a confusion matrix and tree visualization to
interpret results and explainability.
1.7 Significance of the Study
The
system offers enormous benefits to the following users:
1.
Lecturers/
Academic Advisors: The prediction model will help teachers and tutors identify
weak and strong students so that teachers can lay more emphasis on instructions
and procedures when dealing with weak students, to enhance overall academic
performance. An academic advisor can refer to the prediction results when
advising students who perform poorly in their studies so that preventive
measures can be taken much earlier. In addition, a lecturer can further improve
his/her teaching and learning approach, as well as plan interventions and
support services for weak students.
2.
Academic
Performance is an important factor people consider before applying to any
tertiary institution. An institution that is known for producing
low-performance students is at risk of having low intakes. The need for
a Prediction Performance System arises as this will help in the early
prediction of weak students and help them focus on their weak areas.
3.
Parents/Guardians/partners:
Results have shown that Parents/Guardians and Partners have effects on the
academic performance of Students. The Study helps to analyze the influence of
family background on students’ performance predictions. Attributes such as the
size of family, encouragement/motivation from parents/spouses/siblings, the
highest qualification of the sponsor, and other factors will help determine
those factors that affect performance.
1.8 Definition of Terms
- Academic
Performance:
A measure of a student’s achievement in educational activities, often
represented by grades, GPA, exam scores, or course completion rates.
- Classification
Algorithm: A
type of machine learning algorithm used to categorize student outcomes,
such as predicting whether a student will pass, fail, or drop out.
- Decision
Trees: A
machine learning algorithm that predicts student performance by learning
decision rules based on input features such as attendance, scores, and
study behavior, structured like a flowchart.
- Dropout
Prediction:
The process of identifying students at risk of discontinuing their studies
before completion, using predictive models trained on academic,
behavioral, and demographic data.
- Feature
Engineering:
The process of creating, transforming, or selecting relevant variables
(features) from raw student data to enhance the performance of predictive
models.
- Grade
Point Average (GPA):
A standardized measure of academic achievement calculated by averaging the
grades received across courses, commonly used as a performance indicator.
- K-Nearest
Neighbors (KNN):
A machine learning algorithm that classifies a student’s likely
performance by comparing it to the 'k' most similar students in the
training dataset.
- Learning
Management System (LMS):
A software platform used by educational institutions to deliver, track,
and manage student learning activities and data, which can be used as
input for predictive models.
- Linear
Regression: A
statistical technique used to predict continuous academic outcomes (e.g.,
GPA or final exam scores) based on one or more independent variables like
study hours or attendance.
- Machine
Learning (ML):
A subset of artificial intelligence that allows systems to learn from
educational data and make predictions about student outcomes without being
explicitly programmed for each scenario.
- Predictive
Analytics:
The use of data analysis, statistical models, and machine learning to make
informed predictions about students’ future academic performance or
likelihood of success.
- Student
Dataset: A
structured collection of data related to students, including demographic
information, attendance, grades, and behavioral logs, used for training
and testing machine learning models.
- Support
Vector Machine (SVM):
A supervised learning algorithm that classifies student performance into
distinct categories by finding the optimal boundary that separates
different outcome groups.
Login To Comment