Onix Text Retrieval Toolkit
API Reference

API
Function List
Topical List

Main Index

Introduction
Calling Sequences
Query Processing
Relevancy Ranking
Data Types
Error Handling
   
Support
   
Lextek Products
Onix Text Retrieval Engine
Lextek Document Profiler & Categorizer
Brevity Document Summarizer
RouteX Routing Engine
Language Identifier

INDEXING

Indexing is the most important part of the process for making your text searchable. Onix assigns a record number and word number for each word you index. This information is compiled to create an index much like the index you might find in the back of a text book. This index makes searching extremely fast. It is much faster than searching every word, looking for matches to the word you're looking for.

To begin indexing, you must first start an indexing session by calling ixStartIndexingSession(). Once you've started the indexing session, indexing is very simple. For each word in a document or record, you simply need to call the function ixIndexWord. When you have reached the end of a record or document call ixIncrementRecord. You may then continue to call ixIndexWord for the words in your next document or record.

Note: ixIncrementRecord should only be called if there is more data to index and should not be called immediately prior to a call to ixEndIndexingSession. This is a frequent error by people just starting with Onix. ixEndIndexingSession automatically closes the current record, so there is no need to increment the record. If you do you would be creating an empty record. This can throw off the calculations of many users who consider the record number to be significant. Because of this Onix flags calling ixEndIndexingSession with an empty record as an error.

The most important unit of indexing for Onix is the record. A record is simply a chunk of text. It functions much like records in more traditional databases as each search looks for individual records. Thus when you use a query term, such as and (&), you are looking for each record where all the query terms appear. You can think of a record as the basic unit you search for.

An analogy might help explain this notion of a record. In the index in back of a text book, each word has page numbers associated with it. You uses those page numbers to go to a page of text. For the book's index, each page is a record. Now with computers most people don't use page numbers as their index unit. Instead they use whole documents or paragraphs. However what text you decide belongs "together" as a record is up to you. You begin and end records by calling ixIncrementRecord. You can make the boundaries of a record anything you wish, depending upon your application's needs.

The following pseudocode shows how you would index a document. Note that it assumes you've already opened an index.

IndexingEngine = ixStartIndexingSession()
while( NotDone ) {
for(EveryWordInTheCurrentDocument) {
ixIndexWord(Word);
}
if( MoreDataToIndex ){
ixIncrementRecord()
}
}
ixEndIndexingSession()


Stemming

Some people choose to "stem" the words they index as they index them. The process of stemming a word reduces words to what is called a normalized form. The idea is that all the different forms of a word are indexed as a single term. Thus the words "run", "running", "ran" and "runs" are all indexed as the same term. This would allow a user to find all the forms of a word with a single query. A stemmer does not always generate a real word. That is not its goal. The goal is to have all forms of a word indexed as the same term. So if you are not showing your wordlist to your users, stemming can be an effective tool. However, since stemming does not always generate real words, it is not always a good idea to show a stemmed wordlist to users as they are likely to be confused without an explanation. Onix includes a copy of the Porter stemmer which has been found to be one of the best and fastest stemming algorithms available for English.

Note: As you might expect, in order to search an index which contains stemmed terms, the query must have its terms stemmed as well.

 

About Changes To The Index

Most of the indexing process is performed totally autonomously from the index itself. So while you are indexing, it is perfectly safe to access the index. However, after you call ixEndIndexingSession changes to the index begin to be made and you will want to avoid accessing the index via any processes or threads which may be active. Since ixEndIndexingSession() may take awhile to run, two other functions have been created. The first is ixFinalProcessIndex which completes the processing on the temporary files generated during the indexing process including the index compression. The processing completed during ixFinalProcessIndex is also completely autonomous from the index and the index remains safe to access. After calling ixFinalProcessIndex(), you will need to call ixMakeIndexActive which brings the new index data into the index. During this period, the index must not be accessed otherwise data will be read which is not being expected. How fast is ixMakeIndexActive() you ask? If you are using a distributed index, ixMakeIndexActive should only take 10-20ms (more or less). If you are not using a distributed index, ixMakeIndexActive takes as long as it takes to copy the new index data into the index and thus is about as fast as your hard drive and OS will allow.

Note: After changes have been made to the index, any other index managers which are accessing the index need to reload their index which may be done by calling ixReloadIndex.

About Distributed Indexes

Indexes need not be kept within a single file on a single hard disk or partitition. It is possible to create what are called distributed indexes. These indexes are spread across several files which can be on different hard drives or servers. This can enable you to create extremely large indexes - potentially larger than you could fit on a single hard drive.

See Also

ixStartIndexingSession, ixEndIndexingSession, ixIndexWord, ixIndexWordSpecial, ixIncrementRecord, ixStemEnglishWord, ixFinalProcessIndex, ixMakeIndexActive