Onix Text Retrieval Toolkit
API Reference

API
Function List
Topical List

Main Index

Introduction
Calling Sequences
Query Processing
Relevancy Ranking
Data Types
Error Handling
   
Support
   
Lextek Products
Onix Text Retrieval Engine
Lextek Document Profiler & Categorizer
Brevity Document Summarizer
RouteX Routing Engine
Language Identifier

Relevancy Ranking

Most modern text indexing and retrieval systems have a method for relevancy ranking. Onix not only provides relevancy ranking but includes the most up-to-date algorithms from the research community for accurately ranked results. Relevancy ranking is the method that is used to order the results list in such a way that the records most likely to be of interest to a user will be at the front. This makes searching easier for users as they won't have to spend as much time looking through records for the information that interests them. A good ranking algorithm will put information most relevant to a user's query at the beginning of the returned results.

Different applications frequently have different ranking needs. While one algorithm may meet one project, you may find that it doesn't necessarily produce the best results for an other. Onix allows you to select from a variety of relevancy ranking algorithms. The developer can easily select different relevancy ranking algorithms and use the one which works best for their particular application and data. In the future, there will likely be additional relevancy ranking algorithms added to Onix to provide additional flexibility for developers. While any of the relevancy ranking algorithms will dramatically improve your search results from a user's perspective, using an algorithm that fits your application and your data can make even further gains. Each relevancy ranking algorithm slightly biases one type of data over another. While most any of the relevancy ranking algorithms will make a large difference, it is sometimes worthwhile trying several of the ranking methods. This way, you will be able to find the algorithm which most closely reflects the needs of your application as well as you and your user's expectations.

How do relevancy ranking algorithms work? There are a number of ways of calculating how a given record ranks and the factors that are taken into consideration vary with each technique. However, to give you some idea of some of the factors that are taken into consideration, a few of the factors taken into account are as follows:

  • The number of times the search term occurs within a given record.
  • The number of times the search term occurs across the collection of records.
  • The number of words within a record.
  • The frequencies of words within a record.
  • The number of records in the index.

Additional (or perhaps fewer) factors may be taken into consideration depending on the relevancy ranking algorithm selected but this should give a a conceptual idea of some of issues involved.

Typically, relevancy ranking algorithms rank records in relation to each other. The weight assigned to a given record is a weight that reflects the weight of the record in relation to other records within the same database and for the same query. In other words, if the weight of a given record for one query is compared with the weight of the same record for a different query it will most likely be different. Furthermore, the same record in a different database with the same query will typically generate a different result. What this means is that there are typically no hard numbers as to how relevant a record is only how relevant the record is in relation to other documents within the search results set for that particular query and database.

Index Modes

Onix allows you when you create an index to specify one of several different index modes. Record mode for example, builds an index which contains which records contain each given term. IDFMode contains the same information as well as some additional information which is used for relevancy ranking. WordMode contains which records the various words appear in as well as their offsets within the document which aids in phrase and word proximity searching. Both IDFMode and WordMode indexes may be used for relevancy ranking methods. RecordMode indexes do not contain enough information to provide relevancy ranking judgments.

 

Selecting a Ranking Method

When you create an index, you need to specify the method which is used for ranking. If you use the function ixCreateIndex(), this is set for you to Ranking Method 2 which for most people's purposes is probably the most accurate ranking method. ixCreateIndexEx() allows you to select other ranking methods that are available.

 

Relevancy Feedback Values

Some developers like to display their search results with a number that shows how relevant the given record is. They like to be able to say a record is 100% relevant or 87% relevant etc. If one gives due consideration to this problem, it quickly becomes apparent that there are real problems in saying that a given search result is 100% relevant or even 50% relevant to a user. User's typically search for one or two words at a time and due to the nature of language, there are real ambiguities as to what a word means not only within the text but within the user's query itself. For example, if a user were to search for "red car", this could potentially return records talking about the red cars on a train. It could also potentially miss records that talk about red automobile cars as the text might use rouge for red or automobile instead of car. As you can see, the more one analyzes the problem as to what is really relevant, the more complex the problem becomes. And since no retrieval system that has relevancy ranking can read a user's mind, one can not readily make a judgment that a given record is a precise amount of relevancy to the user.

Many developers will still want to provide some sort of relevance judgments to the user. These numbers while they may have no hard meaning can benefit the user by giving them a judgment as to how far a given search result is from the most relevant search result.

The algorithm which is typically applied for assigning relevance values to the results displayed an an application's user interface is as follows:

  • Determine what the relevancy weight is of the most relevant document.
  • Determine what the relevancy weight is of the least relevant document.
  • Assign the most relevant document a relevancy judgment of 100%.
  • Assign the least relevant document a relevancy judgment of 50% or less.
  • Scale the intermediate search results within the range set by the most relevant document and the least relevant document.

This is the algorithm which is used by most every application or web search engine when assigning relevancy values for user interface purposes. The same method can be used to display a series of stars, dots, moons, or graph which displays the relevancy judgments for users.

See Also