ABSTRACT
The World Wide Web is a rapidly growing and changing information source. This reality is gradually replacing the traditional way users obtain news or information. Traditionally, individuals get their news or information from print media, such as newspapers and magazines. The advent of the internet has made things a lot easier by making this digitalized news accessible from anywhere in the world, either through news websites or dedicated application. However, its growth and change rates make the task of finding relevant and recent information harder. Users are still faced with the challenge of visiting numerous websites just to get updated or informed on a specific type of news. This creates a problem as users have to always memorize different URLs and visit numerous website just to view a specific type of news. Therefore the need to develop an intelligent web based dynamic news aggregator that will provide a digital platform for individuals to easily find news pertaining to a particular topic in real time becomes imperative. This aim was achieved via developing an algorithm for the news aggregator that will serve as the output for crawled syndicated web pages based on different categories. InfoSpiders and Incremental web crawling methods were also used to develop an algorithm for the web crawler to download syndicated web pages of different categories of news from different Nigerian news websites. The code was realized in PHP scripting language that is suited for web development and can be embedded into HTML. The web crawler frontier was interfaced with seed URLs to identify all the hyperlinks in the page and add them to the list of URLs to visit. This was possible via applying a stochastic selector and incremental web crawling technology to crawl the entire seed URLs. It crawled the web, searched for news agencies and returned specific news of interest to the user. The system was deployed and tested using Apache web server and a personal computer as the testing machine. In order to ascertain the accuracy and performance measure of the developed system, a comparative analysis of the developed system against existing news aggregators was done, and the developed system yielded an accurate result of 93%.
TABLE OF CONTENTS
Cover Page
Title Page i
Declaration ii
Certification iii
Dedication iv
Acknowledgments v
Table of Contents vi
List of Tables xi
List of
Figures xii
Abstract xiv
CHAPTER 1: INTRODUCTION 1
1.1 Background of Study 1
1.2 Problem Statement 2
1.3 Aim and Objectives of the Project 3
1.4 Justification 4
1.5 Scope of the Study 5
CHAPTER
2: LITERATURE REVIEW 7
2.1 Historical Background 7
2.2 Web Crawling 8
2.3 Structure of A Web
Crawler 9
2.4 Mercator’s Architecture 10
2.4.1 The URL frontier 11
2.4.2 The Http protocol module 12
2.4.3 Rewind input stream 13
2.4.4 Content seen test 14
2.4.5 URL filtters 15
2.4.6 Domain name resolution 15
2.4.7 URL seen test 16
2.4.8 Efficient
duplicate URL eliminators (DUE) 16
2.5 Crawling Algorithms 17
2.5.1 Naive best-first crawling method 17
2.5.2 Shark-search crawling method 18
2.5.3 Focused crawling method 19
2.5.4 Context focused crawling method 20
2.6 Review
of News Aggregation Websites 22
2.6.1 Google news aggregator 22
2.6.1.1 Googlebot web crawler 23
2.6.1.2 Blocking google from content on your site 24
2.6.1.3 Pagerank: googlebot
crawling algorithm 25
2.6.1.4 Pagerank crawling algorithm 26
2.6.1.5 Disadvantage of pagerank crawling algorithm 29
2.6.2 Yahoo news aggregator 30
2.6.2.1 Yahoo slurp web
crawler 31
2.6.3 Drudge report news
aggregator 33
2.6.4 Huffpost news
aggregator 34
2.6.5 Fark news aggregator 36
2.6.6 Zero hedge news
aggregator 38
2.6.7 Newslookup news
aggregator 38
2.6.8 The daily beast news aggregator 39
2.6.9 World news (WN) network news aggregator 41
2.6.10 Newsvine news aggregator 42
2.6.10.1 Seeding 42
2.6.10.2 Articles 43
2.6.10.3 Voting Not Crawling 43
2.6.10.4 Commenting 44
2.6.10.5 Conversation tracker 44
2.7 Aggregator Applications 44
2.8 Literature Gap 48
2.8.1 Specification of the Envisage System 52
CHAPTER 3: MATERIALS
AND METHODS 53
3.1 Materials 53
3.1.1 Personal computer 53
3.1.2 Aptana studio 53
3.1.3 Bootstrap 3 54
3.1.4 Jquery 55
3.1.5 Xampp 55
3.1.6 Php 56
3.1.7 PHP-crawler 57
3.1.8 Apache 58
3.1.9 System requirements 58
3.2 Design Methodology 58
3.2.1 Water fall model 59
3.3 Methods of Data Collection 60
3.3.1 Examination of
existing algorithms 61
3.3.2 Website research 61
3.3.3 Telephonic interview 61
3.3.4 Library research 61
3.3.5 Interview survey based
62
3.4 Expectations
62
3.4.1 Analysis of existing
system 62
3.4.1.1 Problems of existing
systems 63
3.4.2 Analysis
of the new system 64
3.4.3 Algorithm for the new system 66
3.4.4 Flow chart of the new
system 67
3.4.5 Block diagram of the
new system 68
3.5 System Design 69
3.5.1 Initialize the URL
frontier 69
3.5.2 Frontier implementation details 70
3.5.3 Making the web crawler an intelligent agent 72
3.5.4 Dynamic News Aggregator 73
3.6 Benefit of the New System 73
CHAPTER 4: RESULTS AND DISCUSSION 74
4.1 Results
74
4.2 Comparative Analysis of Results 84
4.2.1 Throughput
optimization 84
4.2.2 Accuracy
Measure for Web Crawling Algorithms 86
CHAPTER 5: CONCLUSION AND RECOMMENDATIONS 88
5.1: Conclusion 88
5.2: Recommendations 88
5.3 Contribution to Knowledge 89
References 90 Appendices 94
LIST OF
TABLES
PAGE
2.1: Result
of PageRank of figure 2.7 28
4.1: Retrieval Time of Various News Aggregators
84
4.2: Number of Relevant pages visited 86
LIST OF FIGURES
PAGE
2.1: Updated Web Crawling picture 9
2.2: Basic Architecture of a Web Crawler 10
2.3: Mercator’s main components. 11
2.4: A context graph 21
2.5: Google news 23
2.6: PageRank crawling algorithm 25
2.7: Six pages, each pointing to each other 27
2.8: Snapshot of Yahoo News. 30
2.9: Snapshot of Drudge Report 33
2.10: Snapshot of HuffPost 35
2.11: Snapshot of Fark news aggregator 37
2.12: Snapshot of Zero Hedge 38
2.13: Snapshot of Newslookup 39
2.14: Snapshot of the Daily Beast 40
2.15: Snapshot of World News (WN) Network 41
2.16: Snapshot of Newsvineaggregator 42
2.17: Module to address problems exhibited by existing
crawling technology 49
3.1: Snapshot of XAMPP panel 56
3.2: Sample PHP Syntax 57
3.3: Waterfall model. 60
3.4: Flow
of an Existing Web Crawling Architecture 63
3.5: New system flow chart 67
3.6: New System Block Diagram 68
3.7: Frontier Implementation 71
4.1: The Aggregators Home Page 74
4.2: Aggregated Sport News from Vanguard
Newspaper 75
4.3: Aggregated Sport News from The Sun
Newspaper 76
4.4: Aggregated Sport News from The Nations
Newspaper 76
4.5: Aggregated Politics News from Vanguard
Newspaper 77
4.6: Aggregated Politics News from The Sun
Newspaper 78
4.7: Aggregated Politics News from The Nations
Newspaper 78
4.8: Aggregated Entertainment News from
Vanguard Newspaper 79
4.9: Aggregated Entertainment News from The Sun
Newspaper 80
4.10: Aggregated Entertainment News from The
Nations Newspaper 80
4.11: Aggregated Business News from Vanguard
Newspaper 81
4.12: Aggregated Business News from The Sun
Newspaper 81
4.13: Aggregated Business News from The Nation
Newspaper 82
4.14: Detailed News of the Slected News Headline 83
4.15: Percentage
Retrieval Time of Various News Aggregators 85
4.16: Number of Relevant pages visited 87
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
OF THE STUDY
In computing, an aggregator is a client
software or a web application
which aggregates syndicated
web contents in one location for easy viewing.
Basically, the news aggregator
uses the extensible markup language to structure pieces of information to be
aggregated and displays the information in a user-friendly interface.
Developing an intelligent news aggregator integrating info-spider and incremental
web crawling technology will help gather and distribute content after
completing the appropriate organizing and processing to suit customers’ needs.
It will saliently control and collect information according to clients'
criteria as opposed to print media and numerous
websites to visit for a specific type of news, (Alisha, 2009)
The need to provide and update
information from different sources in a systematic way, whereby a user is
regularly updated with the latest news on a chosen topic is very salient, be it
sports, politics, entertainment or business. This process of aggregation will
be entirely automatic, using algorithms which carry out contextual analysis and
group similar stories together. According to Saul (2002), News aggregation websites
began with content selected and entered by humans; while automated selection
algorithms were eventually developed to fill the content from a range of either
automatically selected or manually added sources. This work would manifest an
intelligent aggregator, whose platform will be user friendly, flexible and fast
scraping of news as a result of the integration of InfoSpiders and incremental
web crawling technology.
According to Rahi (2013), the World Wide Web is a rapidly growing and
changing information source. Its growth and change rates make the task of
finding relevant and recent information harder. Developing an intelligent news
aggregator with the capacity and tendency to scrap information within the World
Wide Web based on users interest is of prior concern. According to Amudha (2017), the web crawler starts with a list of Uniform Resource Locators (URL) to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in
the page and adds them to the list of URLs to visit, called the crawl
frontier. URLs from the frontier are recursively
visited according to a set of policies. In the project work, the crawler will
archive websites
headlines on the news aggregator as it crawls the World Wide Web, starting from
the seed URLs. The archives are usually stored in such a way that they can be
viewed, read and navigated as they were on the live web.
It is becoming essential to crawl the
Web in not only a scalable, but efficient way. If some reasonable measure of
quality or freshness is to be maintained, a crawler must carefully choose at
each step which pages to visit next.
Therefore, producing categorized crawled
content and returning high-quality data; fully structured and totally inclined
with user’s requirements is paramount. In this research, data will be fetched via
automated web scraping and displayed on the news aggregator. To achieve this, a
good crawling strategy will be implemented. As opposed to existing crawling
methods, such as Naive Best-First Crawling Method, Shark-Search Crawling
Method, Focused Crawling Method, Context Focused Crawling Method, Incremental Crawler and InfoSpiders
Crawling Method, two of these methods will be combined (InfoSpiders and
incremental crawling methods), so as to have high performance and a highly
optimized web crawling system architecture for information extraction, thereby
making the news aggregator saliently unique.
1.2 PROBLEM
STATEMENT
Traditionally, individuals get their news or
information from the Print Media such as Newspapers and magazines. The advent
of the internet has made things a lot easier as most of these contents have
been digitized and viewed from anywhere in the world either through a dedicated
application or a website.
Even upon this digital improvement, users
still face certain issues such as numerous websites to visit for a specific
type of news, for instance, someone interested in politics would have to visit Vanguardngr.com,
Sunnewsonline.com, thenationsonlineng.net and other news related websites just
to view news on just politics.
This creates a problem as users have to
always memorize different URLs and also visit numerous website just to view a
specific type of news.
To solve the problem enumerated above, this
work is intended to create an intelligent dynamic content aggregator
integrating InfoSpiders and incremental web crawling technology that would
house different categories of news, from different news websites, by properly
ordering and grouping them in their specific categories which will subsequently
be served to the users on one platform
1.3 AIM AND OBJECTIVES OF THE
PROJECT
The aim of this project is to develop an
Intelligent Web Based Dynamic News Aggregator Integrating InfoSpiders and
Incremental Web Crawling Technology. In order to realize the aim, the following
research objectives are to be achieved:
1.
To develop an algorithm for the news aggregator that will serve as the
output for crawled syndicated web pages based on different categories.
2.
To develop and realize the code in PHP scripting language.
3.
To interface the web crawler frontier with seed URLs, identify all the hyperlinks in
the page and add them to the list of URLs to visit.
4.
To deploy the developed system into an Apache web Server for automatic
web search.
5.
To analyze the effectiveness of the system developed by carrying out a
comparative analysis between the developed system and existing news
aggregators.
1.4 JUSTIFICATION
According to Isbell (2010), during the past
decade, the Internet has become an important news source; this source of news
has coincided with declining usage in the traditional print media. Though the
internet has made news available in a digital form which can easily be
accessed, there is need to aggregate news from multiple sources and display
same in a single platform.
Although,
other forms of news aggregators exit, their functions, their mode of operation
and methods, which perform unique functions, still exist with limitations, creating
and highlighting avenues for future work. This connotes the fact that much work
needs to be done in the area of news scraping via an improved web crawling
algorithm. Existing news aggregators include: Google
News, Yahoo News, Drudge Report, Huffington Post,
Fark, Zero Hedge, Newslookup,
The Daily Beast, World
News (WN) Network and Newsvine (Belinda
2009). Some of the news aggregators such as Drudge Report,
The Daily Beast, Newsvine, World News Network, Zero Hedge and Fark do not
integrate web crawling technologies.
Other automated news
aggregators with integrated web crawling technology include: Google News and
Yahoo News. Both web crawlers use the same web crawling method, which is called
the PageRank crawling algorithm. The new system is an improvement upon the
existing web crawling technology by combining two web crawling algorithms into
an intelligent web
based news aggregator, a platform that will use the web crawling technology to
expand the aggregation of news beyond a particular area of interest. This
project will not only solve the memorization and visiting of numerous website,
but also implement a totally new web crawling algorithm.
Furthermore, considering the shortcoming of
most existing news aggregators to give genuine information, summarize
representation of news, copying and pasting of news, as a result it becomes
necessary to embark on a research work such as this to improve existing news
aggregators.
1.5 SCOPE OF
THE STUDY
This
work covers the development of an intelligent web based news aggregator using
the InfoSpiders and Incremental web crawling technology. It is important at this
point to note that this project work is not a news site. We have different news
sites such as punchng.com, naija.ng, Vanguardngr.com, Sunnewsonline.com, thenationsonlineng.net.
Instead, this research aggregates news from different authoritative news websites
(Vanguardngr.com, Sunnewsonline.com, thenationsonlineng.net). In this research,
materials such as the personal computer, Aptana Studio, Bootstrap 3, jQuery, XAMPP, PHP, PHP –
Crawler and Apache Web Server were deployed. This combination will foster ease,
simplicity and compatibility during implementation. The design methodology used
in this work is the waterfall model. This methodology helps achieve the
conceptualized idea behind the requirement of the project. The choice of this methodology is to properly
implement the project work in a chronological flow model. The waterfall model
is a flow based model which suits the methodology idea, in chronology of the
new system implementation.
Therefore, based on this information, the
waterfall model is the right method to be employed because
of the rigidity of the model, thus saving a significant amount of time. Unlike
the PageRank crawling method used in google news and the link aggregation used in
Drudge
Report, Huffington
Post, Fark, Zero Hedge, Newslookup,
Daily Beast, World
News (WN) Network and Newsvine news
aggregator, the new system employs both the stochastic
selector technology via the InfoSpiders crawling method to pick links in the
frontier based on categories, applicable to sports, politics, entertainment, or
business, so as to learn link estimates
via neural network and intelligently controlled record for each resource
history using the incremental crawling method.
This project was developed with Web Based
technologies and can be accessed online over the internet using any Internet
powered device. Public users will be able to access this web server for
searching through the Internet.
Login To Comment