LanguageIdentifer:
A Language Identifier object that was allocated by liCreateLanguageIdentifier.
Text:
A pointer to a buffer containing the text to be analyzed.
NumBytes:
The number of bytes in the buffer to be analyzed.
Status:
A pointer to a StatusCodeT object. (A signed long integer.)
liAnalyzeDocumentText submits data to the
language identifier for analysis.
In order for the language identifier to
be able to succesfully identify what language you are dealing
with, it is important that you submit enough data to the language
identifier for it to collect enough information to properly identify
the language and character set. Typically, this is around 200
characters. However, depending on the nature of the text being
analyzed, it can be half that. Accuracy of the analysis is however
improved as the text length increases up to about 15K and typically,
you don't need near that much and 100-200 characters will suffice.
(The point being, the more text you can provide the better accuracy
you will receive since there is more information the language
identifier can use to base its analysis on.)
You may call liAnalyzeDocumentText multiple
times passing in pieces of the text each time. However, we strongly
recommend that you pass in as large a buffer of text for analysis
at one time as possible. (i.e., the Language Identifier was not
designed to be able to handle one or two character at a time
analysis.)
The current version of the Language Identifier
needs to know the size of each character in bytes. (We will probably
change this slightly in future releases.) For most purposes,
this is simply one. For all the emphasis on Unicode and other
multi-byte character sets, the fact of the matter is that most
everyone is still using the 8 bit character set that is appropriate
for their language.
If an error occurs, the StatusCode pointed
to by Status will be set to a negative value.
When you are finished passing data to the
language identifier (through one or more calls to liAnalyzeDocument),
call liEndDocument
to identify which language is used in your document