Onix Text Retrieval Toolkit
API Reference

Function List
Topical List

Main Index

Calling Sequences
Query Processing
Relevancy Ranking
Data Types
Error Handling
Lextek Products
Onix Text Retrieval Engine
Lextek Document Profiler & Categorizer
Brevity Document Summarizer
RouteX Routing Engine
Language Identifier





void ucInitializeNormalizationTable(UnicodeCharT *TableBuffer, size_t MaxChars, BooleanT Lower)


TableBuffer: Array of characters of type UnicodeCharT.

MaxChars: The number of characters in the table.

Lower: A boolean flag representing whether characters will be normalized for case as well as accents and other letter variations.




Many European languages have characters with many different accents. ucNormalizeChar helps by normalizing these characters to their form without their accents to either their upper or lower case form. ucTableNormalizeChar does this for all the Unicode characters from the Latin (ASCII) and European Latin code pages. ucTableNormalizeChar is designed to work with Unicode and does not work with single byte or other character sets.

While you can normalize characters using ucNormalizeChar it uses a case statement and is not as fast as using a lookup table. If you put all the characters you will be normalizing in a lookup table this will initialize the table so that you can do direct lookups. When you use ucTableNormalizeChar rather than ucNormalizeChar it will use this lookup table to normalize each character. This can produce a significant speedup when dealing extensively with unicode based documents.

Not that TableBuffer must be a pre-allocated array and must be at least MaxChars in size. Remember that the array is in terms of UnicodeCharT and not standard C char's. A common error is to allocate in terms of bytes and not the proper type.

See Also

ucTableNormalizeChar, ucNormalizeChar, ixUnicodeCharToHex