Sponsored Links

Sunday, August 8, 2010

Information Retrieval Software

Sponsored Links
Did you get benefit from google? Google is one of what it is called Information Retrieval Software.
Information Retrieval Software is a crucial thing is knowledge management discourse. This article gives you an overview What Information Retrieval Software is.

This article aims to provide readers with a very basic overview of information retrieval. Understanding these principles can help you optimize your site content for search engines and also help you to analyze the search engine algorithm changes. However, the details in this article are not intended to describe how modern search engines work, because they use additional factors, including link analysis.

Information retrieval (IR) is a science to search for documents / in the document. Information retrieval techniques form some of the most fundamental elements of a web search engine technology. This article will discuss in the context of information retrieval search engine.


It is unrealistic to remotely access documents in real-time when performing a search, because it will be very slow and unreliable. Therefore, the local index is created, for the search engines by crawler (aka spider). So, when you do a search you do not really search the web, but looking for a web version as seen and kept by the crawler at some point in the past.

Index will not usually contain all the documents (this may, however, the cache is stored in a separate document), but stores a representation of the terms that are relevant to documents quickly and easily searchable. There are various stages of this process (not all systems will cover all phases):

   1. This document is a document in a standard format with all text, structure and format.
   2. Structure Analysis Identifying the title, paragraphs, headings, bold text, lists, ..., etc.
   3. Lexical Analysis of Changing characters in the document to the list of words. This process may include analyzing the numbers, hyphens, punctuation and case of letters. The right Words can use case analysis and the format of word / phrase to identify important information such as name, place, date and organization.
   4. Removal Removal Stopwords words are very common and does not provide the ability to distinguish between documents. For example: "", "no", "is". However, it can be seen that several search engines to leave the words in the index and delete it at the level of user demand. This allows the "word" query to be performed.
   5. Stemming is the conflation procedure that reduces the variation of the word into a single root. For example, both "work" and "work" can be reduced to "work." The Porter Stemming Algorithm can be used to perform stemming.

Once this process has been carried out we have a list of index terms for a particular document.

Index Term Weighting

We now need to count up to where the term is relevant to a particular document. The following are examples of weighting schemes:

    Index Term Frequency * This is the term frequency in the document. Frequency is usually normal in a particular document: TermFrequency (term, document) = (no. term occurrence in the document) / (no. of term with a max occurrence in the document)
    * Inverse Document Frequency opposite of the term frequency of all documents in the set. Terms that appear in many documents are not very useful because they do not allow us to distinguish between documents. IDF (term) = log ([no documents] Collection / [no documents in the collection contain ..] The term)
    * The weight is actual weight index term for a specific period in a particular document: Weight (term, document) = TermFrequency (term, document) * IDF (term)

Other items can be a factor in determining the weight, such as: the term position in the document, whether it is in the title, whether it is bold, whether it is in the list, ..., etc.

Reverse Index

We now have a list of terms (with their weight) for a particular document. However, the list of documents containing certain words will be far more useful than a list of words for a particular document. This is called an inverted index.

For example, if we have the following three documents:

   1. This is a file on a web site search engine optimization
   2. A website design tutorial files
   3. A file of the design and development of bespoke software

Then the index terms for each document may be as follows (weight will be in parentheses):

   1. files (),? website (?), search (?), machinery (?), optimization (?)
   2. website (?), design (),? tutorial (?), files (?)
   3. files (?), a bespoke (?), software (?), design (?), construction (?)

However, the index will reverse:

file: document1 (),? document2 (),? docuement3 (?)

document1 (),? document2 (?)

document1 (?)

document1 (?)

document1 (?)

document2 (),? document3 (?)

document2 (?)

document3 (?)

document3 (?)

document3 (?)

Reverse index then allows us to easily find relevant documents for particular words

Similarity matching

This is a process to calculate the document relevance to a particular search. This can consist of:

    * Query Term Weighting Applies weights for each query term. For example, at the beginning of the query terms may be weighted more heavily.
    * Similarity Coefficients Using weighting query terms and document term weights to calculate the similarity between the query and documents. similarity can be computed using the vector space model and calculate the cosine coefficient (this will not be discussed here).

Refreshing Index

Documents can be constantly changing, so the index should continue to refresh. crawlers need to determine how often reindex a specific document, based on how often they are updated. If documents are not updated very often, so very often Reindexing would be a waste of resources. However, the document that is always changing to keep reindexed because they may no longer be relevant to their being indexed for the term.

Measuring Accuracy IR System

Two simple ways to assess the accuracy of basic information retrieval systems are precision and Recall. It is calculated using the number of relevant documents retrieved and the number of documents (the documents deemed relevant by the system), the documents actually returned to the user is where the two sets of overlapping documents.

    * Precision Ratio does not. relevant documents returned to the number of documents were taken - ie the number of relevant documents.
    * Remember not Ratio. relevant documents returned to the number of relevant documents - namely the number of relevant documents are returned.

Actually, the document returned from the documents retrieved set will be determined by using some form of assessment mechanism (this discussion is beyond the scope of this article).

In general, there is a compromise between precision and recall, such as increasing the number of retrieved documents also tend to increase the number of relevant documents in the set of documents retrieved.

Web Search Engines

web search engines (like Google, Yahoo and MSN!) usually combine techniques of information retrieval system with the analysis of link structure, as well as many other techniques are not known. Clearly, the above technique is very easily spammed, so any useful search engine will need to try to filter spam where possible.

Article Source:

No comments:

Post a Comment