Language Identifier SDK

Language Identifier Home
Function List
Languages Supported
About Languages
Development SDK/API
Company
Contact Us
   
   
   
 
 
Other Products By Lextek
Onix Text Search and Retrieval Engine
Brevity Document Summarizer
Lextek Document Profiler & Categorizer
RouteX Document Routing Engine
 
   
   
   

liAnalyzeDocumentText

Name

liAnalyzeDocumentText -- Submit text to the language identifier for analysis.

Synopsis

void liAnalyzeDocumentText(LextekLanguageIdentiferT LanguageIdentifer, char *Text, size_t NumBytes, StatusCodeT *Status)

Arguments

LanguageIdentifer: A Language Identifier object that was allocated by liCreateLanguageIdentifier.

Text: A pointer to a buffer containing the text to be analyzed.

NumBytes: The number of bytes in the buffer to be analyzed.

Status: A pointer to a StatusCodeT object. (A signed long integer.)

Returns

Nothing.

Description

liAnalyzeDocumentText submits data to the language identifier for analysis.

In order for the language identifier to be able to succesfully identify what language you are dealing with, it is important that you submit enough data to the language identifier for it to collect enough information to properly identify the language and character set. Typically, this is around 200 characters. However, depending on the nature of the text being analyzed, it can be half that. Accuracy of the analysis is however improved as the text length increases up to about 15K and typically, you don't need near that much and 100-200 characters will suffice. (The point being, the more text you can provide the better accuracy you will receive since there is more information the language identifier can use to base its analysis on.)

You may call liAnalyzeDocumentText multiple times passing in pieces of the text each time. However, we strongly recommend that you pass in as large a buffer of text for analysis at one time as possible. (i.e., the Language Identifier was not designed to be able to handle one or two character at a time analysis.)

The current version of the Language Identifier needs to know the size of each character in bytes. (We will probably change this slightly in future releases.) For most purposes, this is simply one. For all the emphasis on Unicode and other multi-byte character sets, the fact of the matter is that most everyone is still using the 8 bit character set that is appropriate for their language.

If an error occurs, the StatusCode pointed to by Status will be set to a negative value.

When you are finished passing data to the language identifier (through one or more calls to liAnalyzeDocument), call liEndDocument to identify which language is used in your document

See Also

liStartDocument, liEndDocument