Onix Text Retrieval Toolkit
API Reference

API
Function List
Topical List

Main Index

Introduction
Calling Sequences
Query Processing
Relevancy Ranking
Data Types
Error Handling
   
Support
   
Lextek Products
Onix Text Retrieval Engine
Lextek Document Profiler & Categorizer
Brevity Document Summarizer
RouteX Routing Engine
Language Identifier

ixProcessQuery

Name

ixProcessQuery

Synopsis

OnixQueryVectorT ixProcessQuery(OnixIndexManagerT IndexManager,UCharT *RankedQueryString, UCharT *BooleanQueryString, StatusCodeT *Status)

Arguments

IndexManager: Index Manager created by a call to ixCreateIndexManager() and which has an open index which has a retrieval session in progress..

RankedQueryString: NULL terminated string which has the query terms in it by which the results will be sorted (ranked).

BooleanQueryString: A NULL terminated string which has the query in it.  The query terms must be represented in hexadecimal..

Status: A pointer to a value of type StatusCodeT representing any error conditions.

Returns

OnixQueryVectorT (contains the results of the search).

If an error occurred, Status will be set to the error number.

Description

ixProcessQuery queries the index currently associated with IndexManager.  In order to search an index you must first open it with ixOpenIndex() and then begin a retrieval with ixStartRetrievalSession(). ixProcessQuery takes two different strings that it uses to query the index. The first is a string representing a ranked query. The ranked query attempts to determine which records are most relevant to your search. Not every word in the query is guaranteed to be present in returned records. The next query is a boolean query. You can specify what words must be in returned records. Both queries can be passed to ixProcessQuery. In this case the query processor will return records ranked according to relevance, but only those records which satisfy the boolean query. Pass a NULL to ixProcessQuery if you do not wish to use that query. (For instance if you don't want a ranked query pass NULL in for RankedQueryString)

Query Terms, and Character Sets

Onix is character set independent and query terms are represented in hexadecimal.  This allows for any string of binary characters to be both indexed and searched. This allows you to deal with any character set and determine how specific characters are handled. For instance many indexers can not handle NULL characters because they assume those represent the end of a string. By storing query terms in hexadecimal you have total freedom and flexibility over your search terms.

To generate a query you must convert your query terms to hexadecimal. A hexadecimal term starts with the text "0x" followed by a hexadecimal number for each byte. The hexadecimal characters are placed consecutively in the string. A space character represents the end of the term. For example "ahab & whale" should be passed to ixProcessQuery as "0x61686162 & 0x7768616c65".  The function ixConvertQuery can be used to automatically convert a query of the form "ahab & whale" to the form "0x61686162 & 0x7768616c65".  ixConvertQuery converts many queries but if your query contains extended characters, Unicode, or boolean operators as part of the query terms, it is advisable that you write a conversion function which is specific to your application's requirements.

All query terms need to be represented in the same form they were indexed. In other words, queries are case sensitive and so query terms need to be passed in the same format (upper or lower case) in which they were indexed. For instance if you indexed a word as "Ahab" and search on "ahab" the query will not find "Ahab". This means that some care should be taken in determining how you index your documents. Most applications tend to convert words to lower case before indexing them. Some applications index both the lowercase and uppercase forms of the word, however.

Onix also allows you to use numbers as terms treated as numbers rather than as strings. To use a number term you enclose the term in pound signs. (#) Many users prefix their strings so as to add meta-data to them. To enable this functionality with numbers a prefix in hexadecimal form is provided. So to search for the number 1 with a prefix of a space (character 32 which is 1A in hex) you'd search for #0x1A.1#. You can also search for ranges of numbers. So to search for all the numbers between 50 and 100 with a prefix of 32 you'd search for #0x1A.50-100#. Note that you don't need to use a prefix. In that case simply leave it blank. So to search for the number 1 with no prefix you'd put in #.1#.

Note that to search for numbers you must first have indexed numbers using the routine ixIndexNumber.

Ranked Queries

Onix supplies several different ranking schemes. This allows you to find a ranking scheme that works the most intuitively for your application and your customer's demands. The ranking scheme that Onix uses is specified when the index is created as one of the index creation parameters passed into ixSetIndexCreationParams(). The query processor finds records most relevant to the query terms that you supply. It then uses the ranking scheme you specify to determine how relevant each record is. When the records are returned in the result vector they are sorted by this relevance.

It is important to remember that not every returned record will necessarily contain all the terms in your query. Records missing terms will be less relevant than those records containing all terms, but may be returned. To specify that returned records must contain a term, prepend that term with the + sign. To specify that a term must not be in returned records prepend that term with the - sign. You can achieve the same functionality by combining a ranked query with a boolean query. However using only the ranked query is faster.

Boolean Queries

Unlike ranked queries which return the most relevant records which match a search, boolean queries return all the records which satisfy the query. Used in conjunction to a ranked query, it will return the top ranked documents which satisfy the boolean expression. Boolean expressions are composed of query terms (operands) and query operators which lets the user specify such things as phrase searching, boolean ANDs, ORs, NOTs, word proximity searching, etc. Parenthesis may be used to group query operations to specify the order in which they must be executed.

The boolean style operators currently supported by Onix are as follows:

 Boolean Operator Operator Name
& Boolean AND
| Boolean OR
! Boolean NOT
" " Phrase
^ Exclusive OR
w: Within (Word Proximity)
M Member
 ( ) Parenthesis

 

Boolean Operators

The currently supported boolean operators are "&" (AND), "|" (OR), "!" (NOT), and "^" (EXCLUSIVE OR). Boolean operators take two operands (search terms).One operand on the left and one on the right side of the operator. The boolean operators work as follows:

& -- AND. Finds records which have both operands. For example, the query

cat & dog


finds records which have both the word "cat" AND the word "dog".

------------------------------------------------------------------------



| -- OR. Finds records which have either operand. For example, the query


cat | dog


finds records which have the word "cat" OR the word "dog".

------------------------------------------------------------------------



! -- NOT. Finds records which have the left operator AND NOT the right operator. For example, the query:


cat ! dog


finds records which have the word "cat" AND NOT the word dog.
------------------------------------------------------------------------



^ -- EXCLUSIVE OR. Finds records which have either operator but not both. For example the query:


cat ^ dog


finds records which have either "cat" or "dog" but not both "cat" and "dog".

 

 


Phrase and Word Proximity

Besides processing boolean, AND, OR, NOT, Onix also supports phrase searching. Simply put your words (in their hexadecimal form), in quotes. For example, to search for white whale, simply search for:


"white whale"


or (using the hexadecimal form the query processor actually takes):


"0x7768697465 0x7768616C65"


The rest of the examples will use normal English words for clarity but keep in mind that the query processor takes the query itself in a hexadecimal representation.

You can specify word proximity by using the w: operator. The w: operator takes a parameter. This parameter is the distance the first word can be from the second word as measured by words. i.e., A w:5 B will find all instances where A is within 5 words of B. Note the colon (:) which is used to separate the operator from its parameter. You can allow your users to support the NEAR operator by using ixLongQueryFormToShortQueryForm(). This function converts such things as NEAR to an equivalent query of the form A w:n B.

Parenthesis

 Parenthesis may be used to specify the order of evaluation of terms in a query.  For example, with the query "white & (whale | ahab)", the query processor will for perform a boolean OR on "whale" and "ahab", and then perform a boolean AND on the result with "white".

General Searching

If words are simply specified by spaces, they are ANDed together. So for example, the query

cat dog jane

would find records where all the words, "cat" and "dog" and "jane" occur.

 

Field Searches

The M operator allows you to specify that a term or series of terms are "members" of a particular field. The M operator can operate on either a single term or on a series of ANDs optimally. For example, you can search for:


bob M name


which specifies that you are looking for "bob" in a field named "name". (Or in other words, "bob" is a member of the field "name".) With boolean ANDs, you can perform a series of ANDs within a given field. This is done by putting the series of ANDs within parenthesis followed by the M operator. For example:


(bob & casey & jones) M name


specifies that you are looking for a record where the words "bob" and "casey" and "jones") are all part of the name field.

Field searches can also be implemented by prefixing the terms that are being indexed with a unique prefix that specifies the field. When searching in a field, this same prefix can be prefixed to the search term to specify the field.

For example, the "Name" field can have each word prefixed with the word "Name:" as in:

Name:Jones
Name:Henry
Name:Smith
etc....

 

The same can be done for the other fields. This can be faster than using the member (M) operator by a reasonable margin on large indexes. This is due to the field prefix making the indexes for the various words significantly shorter as well as not require the search engine decode as much field positioning data during a query.

 

Natural Language Searching

One can allow people to write queries just as if they were asking the question of a human. Typically the way this is handled is to remove the stop words from the query. For example for the query "Where is Angkor Wat?" one would remove the words "where" and "is" since they are both stop words leaving the query as: "Angkor Wat" which is then run as the ranked query. Onix will then find the most closely matching records and return those. Even if after removing all the stop words from the query there are still a few words left which are not what one would consider key words, that is o.k., as the relevancy ranking algorithms will typically still be able to detect which words are the most important to the user and return the proper set of records.

 

Wildcards

 In addition, wildcards may be used in the query to specify a class of terms.  For example "whal*" will match "whale", "whales", "whaling", etc.  The following wildcards apply.


* -- Match any of one or more characters
? -- Match any character
\ -- Escape Character


It is important to note that hexadecimal character are composed of two characters (both in the range 0-9,A-F).  This means that when wildcarding a query term, a wildcard character replaces two characters in the query term.  For example, the wildcarded query term "whal*" is 0x7768616c* (the * replacing the "65".) If you have need to search for a term which contains either a "*", "?", or "\", you will want to prefix the character with the escape character "\". The escape character tells the wildcard pattern matcher to accept the next character literally.

ixProcessQuery returns a query vector which contains the results of the query.  A query vector is a list of "hits" or records which match the results of a search.  You can view the search results with the functions ixVectorCurrentHit(), ixVectorNextHit(), and ixVectorPreviousHit() as well as find how how many records match your query with ixNumHits().

NOTE: The returned query vector needs to be disposed of after you are finished using it. You can do this by calling the function ixDeleteResultVector.

See Also

Queries
ixVectorNextHit, ixVectorCurrentHit, ixVectorPreviousHit, ixNumHits