ABSTRACT
It is important for human beings to communicate globally, and due to difference in languages, a translator is vital for effective use of digital services. The international marketers for example, use about ten languages, Kiswahili excluded, despite it being a national language in most of the Sub-Saharan African countries. According to The Cambridge Encyclopedia of the English Language, an estimate of 9% of Kenyans speak in English which is a major international language. The other 91% of Kenyans who don’t speak in English either speak in Swahili or their tribal languages hence are excluded digitally. Even though the vernacular languages are spoken by approximately fifty million people in Kenya, they are resource-scarce from a language technological point. Machine translation models for these low-resourced African languages are scarce causing a lack of digital inclusion for many Africans. An English- Indigenous Language translator thus should be designed for digital inclusion of these non-English speaking communities. In this study, exploratory methodology was used to develop the English to Luhya machine translation prototype. Exploratory research is a methodology approach that investigates research questions that have not previously been studied in-depth. The model was successfully developed using Encoder – Decoder. It had a hidden layer size 128 and embeddings had 256 units. The training run for 50 epochs in batches of 100. The ADAM optimizer was used with a constant learning rate of 0.0005 to update the model weighs. The model was evaluated using BLEU score as the main evaluation metric and WER, SER, TER complementing the results. The model scored a highest BLEU score of 0.55, just 0.05 shy off the median range of 0.6 – 0.7 that has been achieved by other researchers. Compared to similar research on low-resourced languages, it scored modestly but outperformed translation of English to Kiswahili (0.20) using Statistical Machine Translation. For future work, the key initiative is creation of publicly available corpus which will serve as a catalyst to for research in this area. This limitation can benefit from having audio resources through speech recognition and speech-to-text implementation since Bukusu is primarily spoken and the lack of standardization in writing complicates the creation of clean reference sets and consistent evaluation. This study could also benefit from having reference models for Bukusu Named Entity Recognition and Parts of Speech tagging to improve translation accuracy. Since Bukusu language structure is like Swahili, key focus should first be in developing open-source NLP tools for Swahili language. With this, researchers Bukusu and other low resourced in other East Africa Bantu languages can be able to transfer the Swahili models and annotations to their respective languages.
TABLE OF CONTENTS
DECLARATION ii
TABLE OF CONTENT iii
TABLE OF FIGURES v
TABLE OF TABLES vi
ABSTRACT vii
CHAPTER ONE
INTRODUCTION
1.1 Background Information 1
1.2 Problem Statement 2
1.3 Research Objectives 3
1.4 Significance of the Research 3
1.5 Assumptions and Scope of the Research 4
CHAPTER TWO
LITERATURE REVIEW
2.1 Definition of Machine Translation 5
2.2 Machine Translation Approaches 5
2.3 Issues and Challenges faced with the different Machine Translation Approaches 9
2.4 Natural Language Processing for Low-resourced Languages 10
2.5 Existing Translation Models for Low-Resourced Languages 11
2.6 Evaluation of Machine Translation 13
2.7 Existing Machine Translation Tools 14
2.8 Other Authors Related Work and Findings 17
2.9 Research Gap 19
2.10 Conceptual Framework 19
CHAPTER THREE
RESEARCH DESIGN AND METHODOLOGY
3.1 Introduction 21
3.2 Quantitative approach 21
3.3 Research Data 21
3.4 Exploration (Modelling) 22
3.5 Prototype Design 23
3.6 Prototype Development 25
3.7 Evaluation (Testing Methodology) 27
3.8 Ethical Considerations 27
CHAPTER FOURE
RESULTS AND DISCUSSIONS
4.1 Exploratory Data Analysis Results 29
4.2 Model Performance Results 29
4.3 Prototype Evaluation Results 30
4.4 Discussions 31
CHAPTER FIVE
CONCLUSION
5.1 Discussion 34
5.2 Research Contribution 35
5.3 Limitations in Research 36
5.4 Recommendation for Future Work 36
REFERENCES 37
TABLE OF FIGURES
Figure 1: Figure showing Rule-based Machine Translation 6
Figure 2: Figure showing the architecture of Recurrent Neural Networks 7
Figure 3: Figure showing Bidirectional Neural Network 8
Figure 4: Figure showing a simple architecture of an Encoder-Decoder Model 9
Figure 5: Figure showing Google Translate System 14
Figure 6: Figure showing the Kamusi Project Homepage 15
Figure 7: Figure showing Unbabel's Integration 16
Figure 8: Figure showing IBM Watson Language Translator Demo 16
Figure 9: Figure showing Microsoft Translator for Bing 17
Figure 10: Figure showing the conceptual design of the research. 20
Figure 11: Figure showing the Exploratory Phase of the Research 22
Figure 12: Figure showing a mockup of the expected interface 24
Figure 13: Figure showing the architecture of the prototype 25
Figure 14: Figure showing the data preprocessing done on the Jupyter Notebook 25
Figure 15: Figure showing the implementation of Embedded RNN of the Translation system 26
Figure 16: Figure showing the translation page of the prototype 26
Figure 17: Figure showing the evaluation page of the prototype 27
Figure 18: Figure showing the metrics of the Training Phase of the Encoder-Decoder model 32
Figure 19: Figure showing the metrics of the Training Phase of the Bidirectional RNN model 32
TABLE OF TABLES
Table 1: BLEU and NIST scores for Bidirectional Machine Translation Task 17
Table 2: Table showing the requirements for prototype design and implementation 24
Table 3: Figure showing the summary of the research data 29
Table 4: Table showing the evaluation scores for the first run of the research Exploration Phase 30
Table 5: Table showing the evaluation scores for the final run of the research Exploration phase. 30
Table 6: Table showing sample translation results of the evaluation set 31
Table 7: Table showing the BLEU score comparison of different translations. 33
CHAPTER ONE
INTRODUCTION
1.1 Background Information
The essential form of human communication in the Information Age is language, which acts as a transporter of information. Nevertheless, it has been viewed as an impediment to intercultural contact, particularly in urban and marginalized communities. Translating texts from one language into another quickly and accurately has become difficult (Sirbu, 2015).
With 1.2 billion people, Africa is both the second largest and second-most-populous continent in the world. With between 1500 and 2000 different African languages, it has a diverse language population (Doochin, What Are The Languages Spoken In Africa?, 2019). Due to commerce and intermarriage between various linguistic groups, the continent has a lengthy multilingual past. The Afroasiatic, Nilo- Saharan, Niger-Congo, Khoe, Austronesian, and Indo-European language families all include African languages. Following their independence, many African nations made their colonizers' language their official tongue for use in business, government, and education. However, most African nations continue to support multilingualism through local language promotion and appreciation programs.
About 40 major ethnic groups make up our multicultural nation of Kenya. With so many different languages spoken inside its boundaries, the nation is multilingual due to its unique ethnic makeup. Swahili and English are the two officially recognized legal languages among these (Sawe, 2017). Large corporations, universities, and the government are the main English-speaking environments. For instance, most of the legislation submitted to the National Assembly is written in English. Due to its broad use in trade, commerce, communications, and education, Swahili is regarded as the lingua franca of southeastern Africa. Kiswahili is almost primarily used in small-scale trade, the media, and educational institutions, with significant ties to urban life and certain vocations (Doochin, What Languages Are Spoken In Kenya?, 2019). Nearly 50 indigenous languages, including Kikuyu, Luhya, Kamba, Somali, Dholuo, Kalenjin, Arabic, Hindustani, and Punjabi, are spoken as Kenya’s vernaculars.
Overcoming the language barrier has become a widespread issue in the global community due to the web's rapid progress and the integration of the world economy. With the democratization of ICT, there is a need for inclusion in providing information, work, and leisure opportunities on the internet. As the web becomes continuously entrenched in the lives of individuals, communities, and commerce, it is more vital than ever to ensure digital literacy for everybody and bridge this digital (Sanders, 2020). Language barriers are the most common reason an individual would not be an internet user, especially in rural towns. The need for localization of internet products cannot be met by human translation, hence the use of machine translation to assist users in finding information has become a fundamental trend. Machine translation is the process of using a computer to translate a natural source language into a different natural target language. (Peng, 2013).
African languages that are frequently spoken have been the subject of machine translation efforts. However, most African languages are regarded as low-resource due to the difficulty in obtaining data, the lack of sufficient labeled audio speech, or the lack of concurrent translation across the various languages. (Cracking the Language Barrier for a Multilingual Africa, 2021). Most translation research for African languages has been done by the Masakhane project – open-source Natural Language Processing research that is continent-wide, distributed and online – such as machine translation, text-to-speech, document classification, keyword spotting, and sentiment analysis datasets (Masakhane, 2021). Most machine translation efforts in Kenya have been made for Swahili, Kikuyu, and Dholuo through document translation, language interpreting, transcription, and subtitling solutions. These translations have connected businesses to the African market. In the globalized marketplace, translation services have played a vital role across all sectors, such as business, financial, medical, legal, and marketing, thus breaking communication barriers (African Translation Solutions by Translate 4 Africa, 2021).
In this information age, Machine Translation can translate internet content into local languages to facilitate social inclusion and help all people contribute to and benefit from the digital economy and society. (GovernmentOfKenya, 2014). This will significantly improve access to digital services for those who have difficult access: those in rural areas, the elderly, and users of minority languages.
1.2 Problem Statement
Human beings need to communicate globally, and due to differences in languages, a translator is vital for the effective use of digital services. International marketers, for example, use about ten languages, of which Kiswahili is excluded, despite being a national language in most Sub-Saharan African countries. According to The Cambridge Encyclopedia of the English Language, 9% of Kenyans speak English, a primary international language. The other 91% of Kenyans who do not speak English either speak in Swahili or their tribal languages and are excluded digitally. Even though approximately fifty million people speak the vernacular languages in Kenya, the languages are resource-scarce from a language technological point of view. Machine translation models for these low-resourced African languages are scarce, causing many Africans to lack digital inclusion. An English-Indigenous Language translator should thus be designed to include these non-English speaking communities digitally.
1.3 Research Objectives
1.3.1 Overall Objective
To build a machine translation model for translating English text to Luhya text.
1.3.2 Research Objectives
1. To investigate the machine translation techniques currently applied in translating low-resourced African languages.
2. To investigate the factors affecting the implementation of automatic machine translation of low- resourced African languages.
3. To collect and analyze data for Natural Language Processing in the automatic machine translation model.
4. To develop and validate a model for automatic machine translation from English to Luhya.
1.3.3 Research Questions
1. What are the machine translation techniques being used for low-resourced African languages?
2. What are the factors affecting the implementation of machine translation for African languages?
3. What NLP tasks need to be performed on the language data to build the translation engine?
4. How can machine translation be implemented for English to Luhya translation?
5. What evaluation techniques should be applied to examine the performance of the English to Luhya translation engine?
1.4 Significance of the Research
With global digitization, products and services are getting to the market faster. On the other hand, learning different languages cannot keep up with this pace. As such, it is far easier to label products in the target market's language than to teach the entire market region how to speak a new language whenever a new product launches. Using local languages means that the users of a product can then relate better to the products as it makes them feel that they have been adequately considered. This project is aimed at helping in improving localization and internationalization works, and the focus is on Luhya since most popular machine translation engines have focused on European languages and left the African languages relatively underrepresented (Okpor, 2014).
Machine translation of local languages serves as a testbed for developing NLP technologies that perform reasonably well despite the low-resource constraint. By creating guidelines and providing training through open educational resources in collaboration with national institutions, this improves the capacity for the development of open language datasets and language technology applications and raises the number of digitally accessible vernacular projects that other researchers can use in corpus-based research in African Language Technology. (Cracking the Language Barrier for a Multilingual Africa, 2021).
Research and development in Machine Translation to a local Kenyan language enables the digital inclusion of marginalized communities, in line with Kenya's Digital Blueprint towards achieving Vision 2030 under the Social and Economic Pillar (Government Of Kenya, 2014).
In addition to this, it is the requirement by the Communication Authority of Kenya that at least 60% of the content that Kenyan media companies (television and radio) must be local (Regulation By The Communications Authority of Kenya, 2018). Implementation of this project will be in line with this directive.
1.5 Assumptions and Scope of the Research
1.5.1 Assumptions
All data to be used in training the model is available and accessible for use in open research. The data to be collected is not proprietary to the source site.
1.5.2 Scope
This research is limited to translation to the Bukusu dialect. All data is textual, and none will be sourced directly from the population. The data collected will not be limited to a specific topic to ensure a good quantity of data and improve translation accuracy. The translation model built will then be applied to translate Government documents to Luhya text.
Click “DOWNLOAD NOW” below to get the complete Projects
FOR QUICK HELP CHAT WITH US NOW!
+(234) 0814 780 1594
Login To Comment