DEVELOPMENT OF AN INTELLIGENT WEB BASED DYNAMIC NEWS AGGREGATOR INTEGRATING INFOSPIDERS AND INCREMENTAL WEB CRAWLING TECHNOLOGY

  • 0 Review(s)

Product Category: Projects

Product Code: 00007734

No of Pages: 122

No of Chapters: 1-5

File Format: Microsoft Word

Price :

$20

ABSTRACT

The World Wide Web is a rapidly growing and changing information source. This reality is gradually replacing the traditional way users obtain news or information. Traditionally, individuals get their news or information from print media, such as newspapers and magazines. The advent of the internet has made things a lot easier by making this digitalized news accessible from anywhere in the world, either through news websites or dedicated application. However, its growth and change rates make the task of finding relevant and recent information harder. Users are still faced with the challenge of visiting numerous websites just to get updated or informed on a specific type of news. This creates a problem as users have to always memorize different URLs and visit numerous website just to view a specific type of news. Therefore the need to develop an intelligent web based dynamic news aggregator that will provide a digital platform for individuals to easily find news pertaining to a particular topic in real time becomes imperative. This aim was achieved via developing an algorithm for the news aggregator that will serve as the output for crawled syndicated web pages based on different categories. InfoSpiders and Incremental web crawling methods were also used to develop an algorithm for the web crawler to download syndicated web pages of different categories of news from different Nigerian news websites. The code was realized in PHP scripting language that is suited for web development and can be embedded into HTML. The web crawler frontier was interfaced with seed URLs to identify all the hyperlinks in the page and add them to the list of URLs to visit. This was possible via applying a stochastic selector and incremental web crawling technology to crawl the entire seed URLs. It crawled the web, searched for news agencies and returned specific news of interest to the user. The system was deployed and tested using Apache web server and a personal computer as the testing machine. In order to ascertain the accuracy and performance measure of the developed system, a comparative analysis of the developed system against existing news aggregators was done, and the developed system yielded an accurate result of 93%.





TABLE OF CONTENTS

Cover Page

Title Page                                                                                                                    i

Declaration                                                                                                                  ii

Certification                                                                                                                iii

Dedication                                                                                                                  iv

Acknowledgments                                                                                                      v

Table of Contents                                                                                                       vi

List of Tables                                                                                                              xi

List of Figures                                                                                                             xii

Abstract                                                                                                                      xiv

 

CHAPTER 1: INTRODUCTION                                                                          1

1.1       Background of Study                                                                                     1

1.2       Problem Statement                                                                                          2

1.3       Aim and Objectives of the Project                                                                 3

1.4       Justification                                                                                                     4

1.5       Scope of the Study                                                                                         5

CHAPTER 2: LITERATURE REVIEW                                                              7

2.1       Historical Background                                                                                    7

2.2       Web Crawling                                                                                                 8

2.3       Structure of A Web Crawler                                                                           9

2.4       Mercator’s Architecture                                                                                  10

2.4.1    The URL frontier                                                                                            11

2.4.2    The Http protocol module                                                                              12

2.4.3    Rewind input stream                                                                                       13

2.4.4    Content seen test                                                                                            14

2.4.5    URL filtters                                                                                                    15

2.4.6    Domain name resolution                                                                                 15

2.4.7    URL seen test                                                                                                16

2.4.8    Efficient duplicate URL eliminators (DUE)                                                  16

2.5       Crawling Algorithms                                                                                      17

2.5.1    Naive best-first crawling method                                                                   17

2.5.2    Shark-search crawling method                                                                        18

2.5.3    Focused crawling method                                                                               19

2.5.4    Context focused crawling method                                                                 20

2.6       Review of News Aggregation Websites                                                         22

2.6.1    Google news aggregator                                                                                 22

2.6.1.1 Googlebot web crawler                                                                                   23

2.6.1.2 Blocking google from content on your site                                                    24

2.6.1.3 Pagerank: googlebot crawling algorithm                                                        25

2.6.1.4 Pagerank crawling algorithm                                                                          26

2.6.1.5 Disadvantage of pagerank crawling algorithm                                               29

2.6.2    Yahoo news aggregator                                                                                  30

2.6.2.1 Yahoo slurp web crawler                                                                                31

2.6.3    Drudge report news aggregator                                                                      33

2.6.4    Huffpost news aggregator                                                                              34

2.6.5    Fark news aggregator                                                                                      36

2.6.6    Zero hedge news aggregator                                                                           38

2.6.7    Newslookup news aggregator                                                                         38

2.6.8    The daily beast news aggregator                                                                     39

2.6.9    World news (WN) network news aggregator                                                 41

2.6.10 Newsvine news aggregator                                                                             42

2.6.10.1 Seeding                                                                                                         42

2.6.10.2 Articles                                                                                                          43

2.6.10.3 Voting Not Crawling                                                                                                43

2.6.10.4 Commenting                                                                                                 44

2.6.10.5 Conversation tracker                                                                                     44

2.7       Aggregator Applications                                                                                 44

2.8       Literature Gap                                                                                                 48

2.8.1    Specification of the Envisage System                                                            52

CHAPTER 3: MATERIALS AND METHODS                                                   53

3.1       Materials                                                                                                         53

3.1.1    Personal computer                                                                                           53

3.1.2    Aptana studio                                                                                                 53

3.1.3    Bootstrap 3                                                                                                     54

3.1.4    Jquery                                                                                                              55

3.1.5    Xampp                                                                                                            55

3.1.6    Php                                                                                                                  56

3.1.7    PHP-crawler                                                                                                    57

3.1.8    Apache                                                                                                            58

3.1.9    System requirements                                                                                       58

3.2       Design Methodology                                                                                      58

3.2.1    Water fall model                                                                                             59

3.3       Methods of Data Collection                                                                           60

3.3.1    Examination of existing algorithms                                                                61

3.3.2    Website research                                                                                             61

3.3.3    Telephonic interview                                                                                       61

3.3.4    Library research                                                                                              61

3.3.5    Interview survey based                                                                                   62

3.4       Expectations                                                                                                   62

3.4.1    Analysis of existing system                                                                            62

3.4.1.1 Problems of existing systems                                                                         63

3.4.2    Analysis of the new system                                                                            64

3.4.3    Algorithm for the new system                                                                                    66

3.4.4    Flow chart of the new system                                                                         67

3.4.5    Block diagram of the new system                                                                  68

3.5       System Design                                                                                                69

3.5.1    Initialize the URL frontier                                                                              69

3.5.2    Frontier implementation details                                                                      70

3.5.3    Making the web crawler an intelligent agent                                                 72

3.5.4    Dynamic News Aggregator                                                                            73

3.6       Benefit of the New System                                                                            73

CHAPTER 4: RESULTS AND DISCUSSION                                                    74

4.1       Results                                                                                                            74

4.2       Comparative Analysis of Results                                                                    84

4.2.1    Throughput optimization                                                                                84

 4.2.2   Accuracy Measure for Web Crawling Algorithms                                                86

CHAPTER 5: CONCLUSION AND RECOMMENDATIONS                         88

5.1:      Conclusion                                                                                                      88

5.2:      Recommendations                                                                                          88

5.3       Contribution to Knowledge                                                                            89

References                                                                                                      90        Appendices                                                                                                     94


 

LIST OF TABLES

PAGE

 2.1:     Result of PageRank of figure 2.7                                                                   28

4.1:      Retrieval Time of Various News Aggregators                                                84

4.2:      Number of Relevant pages visited                                                               86

 

 

  

 

LIST OF FIGURES

PAGE

2.1:      Updated Web Crawling picture                                                                      9

2.2:      Basic Architecture of a Web Crawler                                                             10

2.3:      Mercator’s main components.                                                                         11

2.4:      A context graph                                                                                              21

2.5:      Google news                                                                                                   23

2.6:      PageRank crawling algorithm                                                                         25

2.7:      Six pages, each pointing to each other                                                            27

2.8:      Snapshot of Yahoo News.                                                                              30

2.9:      Snapshot of Drudge Report                                                                            33

2.10:    Snapshot of HuffPost                                                                                     35

2.11:    Snapshot of Fark news aggregator                                                                 37

2.12:    Snapshot of Zero Hedge                                                                                 38

2.13:    Snapshot of Newslookup                                                                                39

2.14:    Snapshot of the Daily Beast                                                                           40

2.15:    Snapshot of World News (WN) Network                                                      41

2.16:    Snapshot of Newsvineaggregator                                                                   42

2.17:    Module to  address problems exhibited by existing crawling technology      49

3.1:      Snapshot of XAMPP panel                                                                            56

3.2:      Sample PHP Syntax                                                                                        57

3.3:      Waterfall model.                                                                                             60

3.4:      Flow of an Existing Web Crawling Architecture                                           63                   

3.5:      New system flow chart                                                                                   67

3.6:      New System Block Diagram                                                                          68

3.7:      Frontier Implementation                                                                                 71

4.1:      The Aggregators Home Page                                                                          74

4.2:      Aggregated Sport News from Vanguard Newspaper                                                75

4.3:      Aggregated Sport News from The Sun Newspaper                                       76

4.4:      Aggregated Sport News from The Nations Newspaper                                 76

4.5:      Aggregated Politics News from Vanguard Newspaper                                 77

4.6:      Aggregated Politics News from The Sun Newspaper                                                78

4.7:      Aggregated Politics News from The Nations Newspaper                              78

4.8:      Aggregated Entertainment News from Vanguard Newspaper                      79

4.9:      Aggregated Entertainment News from The Sun Newspaper                         80

4.10:    Aggregated Entertainment News from The Nations Newspaper                   80

4.11:    Aggregated Business News from Vanguard Newspaper                               81

4.12:    Aggregated Business News from The Sun Newspaper                                  81

4.13:    Aggregated Business News from The Nation Newspaper                             82

4.14:    Detailed News of the Slected News Headline                                               83

4.15:    Percentage Retrieval Time of Various News Aggregators                             85

4.16:    Number of Relevant pages visited                                                                 87

 

 

 


 

CHAPTER 1

INTRODUCTION


1.1       BACKGROUND OF THE STUDY

In computing, an aggregator is a client software or a web application which aggregates syndicated web contents in one location for easy viewing. Basically, the news aggregator uses the extensible markup language to structure pieces of information to be aggregated and displays the information in a user-friendly interface. Developing an intelligent news aggregator integrating info-spider and incremental web crawling technology will help gather and distribute content after completing the appropriate organizing and processing to suit customers’ needs. It will saliently control and collect information according to clients' criteria as opposed to print media and numerous websites to visit for a specific type of news, (Alisha, 2009)

The need to provide and update information from different sources in a systematic way, whereby a user is regularly updated with the latest news on a chosen topic is very salient, be it sports, politics, entertainment or business. This process of aggregation will be entirely automatic, using algorithms which carry out contextual analysis and group similar stories together. According to Saul (2002), News aggregation websites began with content selected and entered by humans; while automated selection algorithms were eventually developed to fill the content from a range of either automatically selected or manually added sources. This work would manifest an intelligent aggregator, whose platform will be user friendly, flexible and fast scraping of news as a result of the integration of InfoSpiders and incremental web crawling technology.

According to Rahi (2013), the World Wide Web is a rapidly growing and changing information source. Its growth and change rates make the task of finding relevant and recent information harder. Developing an intelligent news aggregator with the capacity and tendency to scrap information within the World Wide Web based on users interest is of prior concern. According to Amudha (2017), the web crawler starts with a list of Uniform Resource Locators (URL) to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. In the project work, the crawler will archive websites headlines on the news aggregator as it crawls the World Wide Web, starting from the seed URLs. The archives are usually stored in such a way that they can be viewed, read and navigated as they were on the live web.

It is becoming essential to crawl the Web in not only a scalable, but efficient way. If some reasonable measure of quality or freshness is to be maintained, a crawler must carefully choose at each step which pages to visit next.

Therefore, producing categorized crawled content and returning high-quality data; fully structured and totally inclined with user’s requirements is paramount. In this research, data will be fetched via automated web scraping and displayed on the news aggregator. To achieve this, a good crawling strategy will be implemented. As opposed to existing crawling methods, such as Naive Best-First Crawling Method, Shark-Search Crawling Method, Focused Crawling Method, Context Focused Crawling Method, Incremental Crawler and InfoSpiders Crawling Method, two of these methods will be combined (InfoSpiders and incremental crawling methods), so as to have high performance and a highly optimized web crawling system architecture for information extraction, thereby making the news aggregator saliently unique.


1.2       PROBLEM STATEMENT

Traditionally, individuals get their news or information from the Print Media such as Newspapers and magazines. The advent of the internet has made things a lot easier as most of these contents have been digitized and viewed from anywhere in the world either through a dedicated application or a website.

Even upon this digital improvement, users still face certain issues such as numerous websites to visit for a specific type of news, for instance, someone interested in politics would have to visit Vanguardngr.com, Sunnewsonline.com, thenationsonlineng.net and other news related websites just to view news on just politics.

This creates a problem as users have to always memorize different URLs and also visit numerous website just to view a specific type of news.

To solve the problem enumerated above, this work is intended to create an intelligent dynamic content aggregator integrating InfoSpiders and incremental web crawling technology that would house different categories of news, from different news websites, by properly ordering and grouping them in their specific categories which will subsequently be served to the users on one platform

1.3       AIM AND OBJECTIVES OF THE PROJECT

The aim of this project is to develop an Intelligent Web Based Dynamic News Aggregator Integrating InfoSpiders and Incremental Web Crawling Technology. In order to realize the aim, the following research objectives are to be achieved:

1.        To develop an algorithm for the news aggregator that will serve as the output for crawled syndicated web pages based on different categories.

2.        To develop and realize the code in PHP scripting language.

3.        To interface the web crawler frontier with seed URLs, identify all the hyperlinks in the page and add them to the list of URLs to visit.

4.        To deploy the developed system into an Apache web Server for automatic web search.

5.        To analyze the effectiveness of the system developed by carrying out a comparative analysis between the developed system and existing news aggregators.

1.4       JUSTIFICATION

According to Isbell (2010), during the past decade, the Internet has become an important news source; this source of news has coincided with declining usage in the traditional print media. Though the internet has made news available in a digital form which can easily be accessed, there is need to aggregate news from multiple sources and display same in a single platform.

Although, other forms of news aggregators exit, their functions, their mode of operation and methods, which perform unique functions, still exist with limitations, creating and highlighting avenues for future work. This connotes the fact that much work needs to be done in the area of news scraping via an improved web crawling algorithm. Existing news aggregators include:  Google News, Yahoo News, Drudge Report, Huffington Post, Fark, Zero Hedge, Newslookup, The Daily Beast, World News (WN) Network and Newsvine (Belinda 2009). Some of the news aggregators such as Drudge Report, The Daily Beast, Newsvine, World News Network, Zero Hedge and Fark do not integrate web crawling technologies.

Other automated news aggregators with integrated web crawling technology include: Google News and Yahoo News. Both web crawlers use the same web crawling method, which is called the PageRank crawling algorithm. The new system is an improvement upon the existing web crawling technology by combining two web crawling algorithms into an intelligent web based news aggregator, a platform that will use the web crawling technology to expand the aggregation of news beyond a particular area of interest. This project will not only solve the memorization and visiting of numerous website, but also implement a totally new web crawling algorithm.

Furthermore, considering the shortcoming of most existing news aggregators to give genuine information, summarize representation of news, copying and pasting of news, as a result it becomes necessary to embark on a research work such as this to improve existing news aggregators.

1.5       SCOPE OF THE STUDY

This work covers the development of an intelligent web based news aggregator using the InfoSpiders and Incremental web crawling technology. It is important at this point to note that this project work is not a news site. We have different news sites such as punchng.com, naija.ng, Vanguardngr.com, Sunnewsonline.com, thenationsonlineng.net. Instead, this research aggregates news from different authoritative news websites (Vanguardngr.com, Sunnewsonline.com, thenationsonlineng.net). In this research, materials such as the personal computer, Aptana Studio, Bootstrap 3, jQuery, XAMPP, PHP, PHP – Crawler and Apache Web Server were deployed. This combination will foster ease, simplicity and compatibility during implementation. The design methodology used in this work is the waterfall model. This methodology helps achieve the conceptualized idea behind the requirement of the project.  The choice of this methodology is to properly implement the project work in a chronological flow model. The waterfall model is a flow based model which suits the methodology idea, in chronology of the new system implementation.

Therefore, based on this information, the waterfall model is the right method to be employed because of the rigidity of the model, thus saving a significant amount of time. Unlike the PageRank crawling method used in google news and the link aggregation used in Drudge Report, Huffington Post, Fark, Zero Hedge, Newslookup, Daily Beast, World News (WN) Network and Newsvine news aggregator, the new system employs both the stochastic selector technology via the InfoSpiders crawling method to pick links in the frontier based on categories, applicable to sports, politics, entertainment, or business,  so as to learn link estimates via neural network and intelligently controlled record for each resource history using the incremental crawling method.

This project was developed with Web Based technologies and can be accessed online over the internet using any Internet powered device. Public users will be able to access this web server for searching through the Internet.


Click “DOWNLOAD NOW” below to get the complete Projects

FOR QUICK HELP CHAT WITH US NOW!

+(234) 0814 780 1594

Buyers has the right to create dispute within seven (7) days of purchase for 100% refund request when you experience issue with the file received. 

Dispute can only be created when you receive a corrupt file, a wrong file or irregularities in the table of contents and content of the file you received. 

ProjectShelve.com shall either provide the appropriate file within 48hrs or send refund excluding your bank transaction charges. Term and Conditions are applied.

Buyers are expected to confirm that the material you are paying for is available on our website ProjectShelve.com and you have selected the right material, you have also gone through the preliminary pages and it interests you before payment. DO NOT MAKE BANK PAYMENT IF YOUR TOPIC IS NOT ON THE WEBSITE.

In case of payment for a material not available on ProjectShelve.com, the management of ProjectShelve.com has the right to keep your money until you send a topic that is available on our website within 48 hours.

You cannot change topic after receiving material of the topic you ordered and paid for.

Ratings & Reviews

0.0

No Review Found.


To Review


To Comment