ABSTRACT
This work is concerned
with the development of information retrieval system. These are system that
searches through a group of large document in relevance to the user’s demand (query).
In the end, the results are indexed according to these queries and the
documents retrieved. Different retrieval systems have been researched over time
which may include vector space, Boolean, probabilistic etc.
This research work “A
fuzzy logic system for archiving purposes” was developed using the concept of
fuzzy logic by Professor Lofti A. Zadeh to enhance the precision of information
retrieval in archives. Most information retrieval systems have their different
retrieval method which might include weight matching, probability etc. But this
work utilizes the concept of membership function and fuzzy set theory.
The reason why this
method was selected was due to the fact that archives usually contain large
document and sometimes the user might not have a perfect idea of what he want
to retrieve. Therefore, a perfect tool was designed which allows for the user
to just input part of the file or document name he/she is looking for.
The results were then
compared to other search and matching systems like the Lucene App developed
with Java and does not use fuzzy logic, the Rubens App which uses fuzzy logic
and Doc Fetcher retrieved from the internet used for searching files and
documents. The model proposed in this work proved far more effective than the
aforementioned such that some of the aforementioned software above produced
congested results or none at all.
TABLE
OF CONTENTS
CHAPTER ONE
1.0
INTRODUCTION
1.1
PROBLEM DEFINITION
1.2
AIMS &
OBJECTIVES
1.3
RESEARCH
METHODOLOGY & DESIGN
1.4
SCOPE OF STUDY
1.5
SIGNIFICANCE OF
STUDY
1.6
DEFINITION OF
TERMS
CHAPTER TWO
2.0
LITERATURE REVIEW
2.1
OVERVIEW
2.2
BACKGROUND OF
FUZZY LOGIC
2.3
EARLIER MODELS AND
PREVIOUS PROPOSITIONS
2.4
EASTERN vs.
WESTERN PERSPECTIVE
2.5
CONCLUSION
CHAPTER THREE
3.0
RESEARCH
METHODOLOGY
3.1 ASTRACT
3.2 OTHER INFORMATION RETRIEVAL MODELS
3.3 ANALYSIS OF THE PROPOSED MODEL
3.4 MATHEMATICAL REPRESENTATION OF THE MODEL
3.5 ALGORITHM
3.6 FLOWCHART
CHAPTER FOUR
4.0
DISCUSSION AND
FINDINGS
4.1 FINDING THE BEST MEMEBERSHIP FUNCTION
4.2 INDEXING DOCUMENTS ACCORDING TO USER QUERY
4.3 FINDING RELEVANCE LEVEL OF DOCUMENTS
4.4 SELECTING THE BEST DEFFUZIFICATION METHOD
4.5 LANGUAGE CHOICE FOR THE EXPERIMENT
4.6 SYSTEM PROCESS AND CONFIGURATION
CHAPTER FIVE
5.0
SUMMARY AND
CONCLUSION
5.1 FUTURE WORK
REFERENCES
CHAPTER ONE
1.0 INTRODUCTION
Logic in its literal meaning could mean the ability of
a system to make a rational decision which can be regarded as the theory of
reasoning in decision making. Mathematically, logic generates two results which
can be either TRUE or FALSE, 0 or 1, ON or OFF, or any other applicable
representation. This concept is referred to as Boolean logic.
Unfortunately, Boolean logic has its limitations. This
is due to the fact that it is limited to a set of (0, 1) only, meaning Boolean
logic is too precise. This also means that a condition can either be true or
false only. For example, Boolean
logic cannot differentiate between something that is “good” and that which is
“very good”. This limitation is being eliminated by the concept of fuzzy logic.
Fuzzy logic is a branch of logical systems and
artificial intelligence. Although it has being studied since 1920, as infinite-valued logics
notably by Łukasiewicz and Tarski, the concept was fully developed in 1965 by Lofti A. Zadeh in one of his seminar
works regarded as the "fuzzy set theory”. Fuzzy logic is a kind of logic
that allows for imprecise or ambiguous answers to questions, forming the basis
of computer programming designed to mimic human intelligence (Microsoft Encarta
Encyclopedia, 2009). Unlike Boolean logic, fuzzy logic extends its set elements
to [0.0, 1.0] and applies membership
function to each of the elements contained in the set.
From the above, it could be seen that fuzzy logic
compared to Boolean logic, is more complex and it is not too precise, giving a
wider range of results to a condition. Rather than mere producing true of
false, fuzzy logic can produce very true, true, false, very false. This concept
is regarded as degree of truth,
where; 0.0 is represented as absolute falseness, and; 1.0 is represented as
absolute truth.
Before we go deeper into fuzzy logic, we should not
neglect a concept known as defuzzification.
Defuzzification is the process
of producing a quantifiable result in fuzzy logic, given fuzzy sets and
corresponding membership degrees. It is typically needed in fuzzy control
systems. These will have a number of rules that transform a number of variables
into a fuzzy result, that is, the result is described in terms of membership in
fuzzy sets. For example, rules
designed to decide how much pressure to apply might result in "Decrease
Pressure (15%), Maintain Pressure (34%), and Increase Pressure (72%)".
Defuzzification is interpreting the membership degrees of the fuzzy sets into a
specific decision or real value.
Fuzzy set theory defines fuzzy operations on fuzzy
sets. It uses the feature of human decision making using levels of possibility
in a number of uncertain / fuzzy categories. Therefore, fuzzy logic uses IF – Then
– Else constructs in the format:
IF
variable IS property THEN action
The AND, OR and NOT Boolean logic operators are also
used in fuzzy logic usually referred to as MAXIMUM, MINIMUM and COMPLIMENT.
They are also referred to as the Zadeh
operators. These operators are defined as:
- AND:
If Xa is a member of set a, for a
measurable variable Xb and is a
member of set b, for another measurable variable, then the fuzzy AND will be:
A and B =
min(X(a), X(b)) or
Xa and b = Xa ^ Xb
= Xa * Xb = min (Xa, Xb)
- OR:
If Xa is a member of set a, for a measurable variable Xb, and is a member of set b, for another measurable variable, then
the fuzzy OR will be:
A or B = max(X(a),
X(b)) or
Xa or b = Xa ˅ Xb
= Xa + Xb = max (Xa, Xb)
- NOT:
For member of set Xa, the fuzzy NOT will be:
NOTa = 1 – X(a) or
X not a = 1 – X(a)
= ¬Xa
Fuzzy logic has being applied in many areas which
include; medicine, engineering equipment, databases, archives, etc. The
application of fuzzy logic in archives is a branch of information retrieval system.
Archiving
is a process of compressing large files or data for long term storage. Data of
archives usually consists of compressed files having extensions either .zip,
.rar etc. Archives mostly contain very old files that are not needed for daily
processing but only for reference purposes. An archive is a collection of
records containing primary source documents over an individual or
organization’s lifetime.
Archiving has many advantages like performance
improvement, availability of storage space, reduced maintenance costs,
etc. Though, archiving has advantages, organizations cannot archive as they
please. An organization needs to have data on the database to a certain period
of time before it is archived in order to meet some legal and government
requirements.
- An efficient data archiving process
can be far more cost effective than using the traditional method of simply
adding more storage (disks) and servers.
- Data archives can be used to
retrieve information at a later stage if a suspected misdemeanor or criminal
act has been suspected. This has become particularly important over recent
years due to many incidents of criminal activities, such as drug dealing using
companies’ computer resources and even issues around terrorist activities.
- Data archiving systems can compress
the information thereby reducing the storage requirements of an organization.
- Data or content archiving systems
may automatically ensure that documents or records are not duplicated.
Again, the replication of the same information can be a massive overhead on an
organization’s resources.
- Mitigation of breaching
regulations. Implementing a data archiving system minimizes the risk of
being in breach of key codes of practice and other legislation.
The archived data can be made
available upon request. In order to make the archived data available it has to
be re-loaded in to the online database. But, with NetWeaver 2004s, a new method of archiving called NearLine Storage has come into
existence. NearLine Storage acts as an intermediate solution between a
traditional archiving and an online database. Using NearLine Storage would
allow us to have access to the archived data without the need of reloading the
data to online database. There are two types of archives:
- On
Line Archiving: is a system whereby the archive system is physically attached
to an organization’s network at all the time. It has the benefit in that it is
efficient and allows for fast access of archived material and the archiving
process can be automated.
- Off
Line Archiving: is a system whereby an IT manager would have to archive
information from a computer network and then physically move that information
to a separate system for retention. The
drawback with this is the time and labor required in order to complete this
task and also if someone needs to access some archived data, the whole
procedure would have to be repeated in reverse.
1.1 PROBLEM
DEFINITION
As said earlier, fuzzy logic for archiving purposes is
a branch of information retrieval systems. In general, we are faced with the
problem of the selection of documentary information from storage in response to
search questions (G. Klir et al, 1995).
Since Archiving is a very compact way of storing data such that the problem of
disk and space management is being reduced, we shall be concerned with the
storage, representation, organization and access of information items. The
below elaborates more on the possible problems to be encountered with fuzzy
logic in archives:
- Although
memory wastage is not too much of a concern in archiving system, Archival
storage capacity is always a concern since data is, as mentioned above,
generally immutable and cannot be deleted until the retention period expires.
This requires careful capacity management to ensure that the archive does not
run out of space.
- Archives
can literarily contain hundreds of gigabytes of unique data making location of
files tedious and time consuming. Therefore, a powerful indexing and search
capability is required.
- Data
duplication could be a very disturbing obstacle in archives such that redundant
data could exist in the archive for a longer period of time than expected and
can lead to data inconsistency.
- The
retrieved documents have to be ranked in order of their significance with
respect to the user query.
- Inability
to clarify the degree of usefulness of a document in an archive.
1.2 AIMS
& OBJECTIVES
Due to the difficulties encountered in maintaining
archives, and also inability to classify documents properly with their level of
significance and membership functions, the aims of this research would be:
- Matching
mechanism is softened to a partial matching: computes the degree of relevance
of each document to the user query, on the basis of membership values of the
query term in document representations.
- Proper
data representation to differentiate properly which data belongs to which set
(archive) and also use fuzzy logic operations to note which is a member, a partial
member, not a member etc.
- Archives should also be implemented
with well-defined data retention and deletion policies in place. Archived data
must often be available for retrieval over years -- even decades -- so
retention is important to meet compliance and legal obligations. Retention
periods can vary by file type and may be set in metadata during the file
archiving process and generally cannot be changed until deletion.
- Due
to the large size of archives, operation on the archive will naturally slow
down. Therefore, a proper index and search mechanism will be implemented to
speed up file search and retrieval.
1.3 RESEARCH
METHODOLOGY & DESIGN
There are three basic groups of retrieval models which
are:
- Set-theoretic
o
Standard Boolean model e.g. OPACs (Online
Public Access Catalogs)
o
Fuzzy Logic model, e.g. Inquiry Assistant
at Bielefeld University (www.ub.uni-bielefeld.de/databases/rechercheassistent/)
- Algebraic
o
Vector Space model, e.g. SMART (Salton et
al, 1971)
- Probabilistic
(Van
Rijsbergen et al, 1979)
o
Probability theory-based model, e.g.
OKAPI, (Robertson and Sparck Jones et al, 1976)
The methodology to be used is the set theoretic model
that will implement fuzzy logic and will have the following components:
- User
Interface for query and result: Allows the user to input a query and view the
result set.
- Query
interpreter: Processes the query in a manner
similar to the documents.
- Indexer
module: Creates the index, which enables faster searching.
- Matching
mechanism: Determines if a document is relevant or not.
- Documents
and document representations: The actual pieces of information and their
logical view.
An Information
Retrieval model is a quadruple
where
- D is a set of
representations for the documents in the collection.
- Q is a set of
representations for the user information needs (queries).
- F is a framework for
modeling document representations, queries, and their relationships.
- R: Q×D→R is
a ranking function which associates a real number with a query qi ∈ Q and
document representation dj ∈
D.
Traditional
Fuzzy Document Representation (Salton and McGill et al, 1989)
Function F, defined
in the following way: F: DXT → [0, 1].
F(d,t) changes from a crisp set
value (either 0 or 1) to a continuous membership value in the range [0,1].
Index term weight: the
degree of “aboutness” of a document with respect to a term, expressed by value F(d,t),
also interpreted as the significance of term in representing the document
content.
F(d,t)
= tfdt * idft
tfdt:
frequency of term t in document d:
tfdt:
(number of occurrences of term t in document d / number of occurrences of the
most frequent term t in document d)
idft: Inverse document
frequency of term t:
idft: log
(total number of documents in a collection / number of documents containing
term t)
F(d,t) increases:
-
with
the number of occurrences within a document
- with the rarity of the term across the whole
document
1.4 SCOPE OF
STUDY
This project will only cover areas of files
arrangement, search optimization, assignment of membership function to elements
and other fuzzy operations on archives. This means it will be limited to the
manipulation of archives using fuzzy operations, determining the membership
level and significance of documents in an archive. This research work will only
implement the above and is not meant to cover how the archives are originally
created.
1.5 SIGNIFICANCE
OF STUDY
The purpose of this research is to improve file
retrieval in archives and to eliminate the “too precise” results produced by
Boolean logic. This project work should produce more ambiguous results and
flexibility. The significance of using fuzzy logic will be to eliminate:
- Oversimplified
representation of the information items (documents).
- No
formal means for qualifying the role and degree of the terms in characterizing
document contents.
- Matching
mechanism only based on the evaluation of the presence of a given search term
in the document representation.
- No
way of establishing the degree of usefulness of each single document.
- Problems
with Boolean Operators:
o
Disjunctive (OR) queries lead to
information overload by too many results.
o
Conjunctive (AND) queries lead to
reduced, and commonly zero result.
o
Conjunctive queries imply reduction in
Recall.
- Query
language gives users only a crisp way of specifying their information needs:
term is either definitely significant or completely useless.
- No
discriminating power in ordinary logic.
1.6 DEFINITION
OF TERMS
- Logic:
means the ability of a system to make a rational decision which can be regarded
as the theory of reasoning in decision making
- Fuzzy
logic: a branch of logic that applies degree of truth to ordinary Boolean logic
by assigning membership functions to elements of a set.
- Degree
of truth: the process of assigning significance to members of a set.
- Membership
function: a function that clearly specifies (though imprecise) the level of
membership of an element in a set
- Artificial
Intelligence: a branch of computer of science that develops programs to allow
machines to perform functions normally requiring human intelligence.
- Defuzzification: is the process of
producing a quantifiable result in fuzzy logic, given fuzzy sets and
corresponding membership degrees.
- Archive:
is a collection of documents e.g. letters, photographs in a long term storage
for future reference.
- Query:
is a formal representation of the request of the information needed from a
storage system e.g. in databases, archives etc.
- Information
retrieval: is a field concerned with the structure, analysis, organization,
storage, searching and retrieval of information (G. Salton et al, Fuzzy Information Retrieval, 1968).
- Index:
is a data structure that improves the speed of data retrieval operations.
- Search:
the operation that involves the location of required data using specific search
queries or conditions. The ability to retrieve results based on a specified
criterion
- Redundancy:
the duplication of data which makes the data inconsistent thereby rendering it
useless.
- Set:
is a collection of well-defined and distinct objects, considered as an object
on its own.
- Fuzzy
sets: are functions that map a value, which might be a member of a set, to a
number between zero and one, indicating its actual degree of membership.
- GUI:
means a graphical user interface that allows users to interact with a program,
application or system containing graphics (combination of images, text, videos,
flash etc.).
- Model:
A model is a representation or an embodiment of the theory in which we define a
set of objects about which assertions can be made and restrict the ways in
which classes of objects can interact.
- Optimization:
is used to enhance the effectiveness and performance of search activities in
information retrieval.
- Ranking:
assigning priority to
search results.
- Index term weight: the degree of
“aboutness” of a document with respect to a term, expressed by value F(d,t),
also interpreted as the significance of term in representing the document
content.
Login To Comment