Onix Text Retrieval Toolkit
API Reference

API
Function List
Topical List

Main Index

Introduction
Calling Sequences
Query Processing
Relevancy Ranking
Data Types
Error Handling
   
Support
   
Lextek Products
Onix Text Retrieval Engine
Lextek Document Profiler & Categorizer
Brevity Document Summarizer
RouteX Routing Engine
Language Identifier

UNICODE SUPPORT ROUTINES

Onix provides support for Unicode. It does this by two different means. First, Onix's index structure is flexible enough to allow you to define what characters denote a word. This allows Onix to index just about every character set now in use. Secondly, Onix proves a set of routines to assist the parsing, processing, and searching Unicode data.

ucNormalizeChar and ucTableNormalizeChar allow you to case convert Unicode letters to either the upper or lowercase form for the ASCII code page and European Latin Unicode codepages. (These routines depend on a table setup by ucInitializeNormalizationTable.)

ixUnicodeCharToHex and ixUnicodeHexToChar both assist in converting Unicode words to and from their binary and hexadecimal encodings. This are roughly the same as ixCharToHex and ixHexToChar except that they are more sensitive to byte endian issues which one must deal with when indexing Unicode.

When indexing Unicode characters, it is best to convert them from UTF-7 or UTF-8 to their normal representation as UTF-16. (All this means is that the Unicode characters should be represented in their normal 16 bit representation.) Furthermore, the indexed words should be indexed in Big Endian format making the most significant byte of the two byte Unicode character occur first. If you are using Onix on a Intex x86 class machine, you will need to swap the bytes as these machines store integers backwards from many other systems such as those that use the Motorola or IBM processors. By doing this, the words will occur in the wordlist in the order you would expect them to be making finding them and traversing the wordlist much more intuitive.

See Also

ucInitializeNormalizationTable, ucTableNormalizeChar, ucNormalizeChar, ixUnicodeCharToHex, ixUnicodeHexToChar , ixHexToChar, ixCharToHex