Profiling Engine SDK
|
|
Introduction This manual is a brief summary of the query language used by the Lextek Profiling Engine. The manual is designed to outline the structure of the language itself. If you are unfamiliar with programming languages and search engines, we suggest you go through the Tutorial manual. That manual will walk you through conducting searches and designing more complex queries. The Onix Query language is based on regular programming languages that you may already be familiar with. The only difference from languages like C is that logical operations don't operate purely between single valued numbers, but between lists of records. Consider the following C code A = 1; B = 0; C = A & B; // C now holds the value of 0 Contrast this with the following query from the Profiler.
There are a few other differences from C in this code fragment. We'll deal with those a little later. For now we just want to emphasize that unlike many query languages, the Lextek Profiling Engine's query language is oriented around writing complex queries. By following the design of a programming language rather than series of "macros" we are able to make the Profiler more flexible and easily allow further expansion in the future.
The Onix Query language consists of a series of statements. Each statement is a query expression, a variable being set to a query expression, or the definition of a function containing other statements. Query expressions are like the search queries in simple search tools. They consist of a series of operands (terms) combined with operators (symbols representing logical operations). A statement can span multiple lines and is terminated by a semi-colon character (;), just as in C or C++. To send several statements to the query processor you simply create a string with a semi-colon separating each statement.
To have Onix return to you the results of a statement you must make the last statement an expression. This results of this expression will then be stored in what is called a vector, returned by prProcessQuery. (See the API Function Summary for more information on SDK calls). A vector is simply a list of "hits." Each hit is a document and word number with the weight of that hit. Each hit represents those documents you've indexed that match your query. Generally each query will represent a category or concept you've created. So each hit tells you those documents that fit your category and how well they fit your category.
2.0 - Terms Terms are the basic unit of the Profiling Engine. Each term is something that you indexed using prIndexWord or prIndexWordSpecial. Technically they need not be words. For instance some applications index trigrams (3 character pairs) to analyze texts rather than indexing whole words. Further some programmers add control characters flag words as having special meanings. In general though, most terms are words that have been indexed in a document. The simplest query possible is to look for a single term. The following returns all documents that contain the word cat. 'cat'; As with all queries we terminate it with a semi-colon. (See 1.1 Language Basics). Any alpha-numeric term can be specified by simply enclosing it within single quotes. Terms can also be specified in a hexadecimal form. The hexadecimal form consists of a 0x followed by hexadecimal numbers representing each byte in the term. For instance the query above could also have been written as 0x636174; Note that the format of terms are totally controlled by the indexing process. Whatever is passed in to prIndexWord or prIndexWordSpecial is a term. Your program has total control and flexibility over what is or isn't a term. It is therefore important that you remember to match what you use in your queries with what you used in your indexing.
3.0 - Basic Operators The basic operators that are available in most search engines are also available in the Profiler.
Most of these operators are fairly self-explanatory. The only unusual ones are the phrase operators and the near operator. For the phrase operator you can enclose terms between quote characters " " or alternatively braces < >. It is, however, important that you still enclose terms within single quotes if you are using that form. A common mistake is to type "first second" rather than "'first' 'second'" as is required. Remember that the single quotes translate ASCII text into the format that the Profiler understands. The near operator is one of the most useful operators as it lets you specify that a term is significant only if it occurs nearby an other operator. To specify how near they must be you simply use the forward slash character / followed by the number of terms they must occur within. Thus if I wanted to find the term apple within five words of pear I'd type the example given above. Note that the near operator doesn't care what order the words occur in. It simply checks to see how close they are. You can group operators using parenthesis. These allow you to specify the order in which they are evaluated. For example the following query evaluates all documents with pear and orange occuring as well as any document with apple occuring.
4.0 - Weights Weights are values that are associated with each term or query result. You can specify the weight of a term or query result by enclosing the value within brackets. If the term or result already has a weight specifying a new weight will replace the weight. Weights have a meaning for ranking hits. Generally weights take a value between 0 and 1. Weights of 0 are discarded as they are equivalent to "not present" or "not significant." Weights of 1 mean "is the category" or "is a member." Values between 0 and 1 usually are taken to represent the degree to which a term or query is a member of a concept or else the probability that they are members of a concept or category. Technically though the value can mean whatever you wish. There really isn't any requirement that they be between 0 an 1. However if you have values over 1 their meaning is hard to see. Usually values that don't range betwen 0 and 1 are normalized so that they can represent a probability. The Lextek Profiler has several methods for normalizing weigths which are discussed later.
If you don't specify a weight a term defaults to a weight of 1. This is because a weight of 1 represents present or 100% probability. Note that when you specify a weight it applies to the term or expression to its left. In the third example above the weight of .3 is applied to the term 'apple' and not the entire expression. To modify the entire expression we must enclose it in parenthesis as in the second example.
5.0 - More Operators The Profiling query language supplies numerous more operators than the basic ones given above. Because of the numerous operators we can't use the symbolic notation for all of them. Instead they are specified using a functional form. Using these operators is akin to calling a function in C or other programming language. Even the basic operators we listed in 1.2 are available in a functional form. Most of the extra operators function equivalently to our basic operators, but add advanced ranking methods. Ranking methods determine how to deal with terms or subqueries that have different weights. The simplest method is to ignore their weights. Other methods take the maximum or minimum weight. More advanced methods utilize statistical modeling or even vector space modeling. We'll go through each of the methods with a very brief summary along with the functions that use that method. 5.1 - Existential Operators Existential operators simply ignore underlying weights. They care only whether a term exists in a document or not. Thus they either return a weight of 0 (not present) or 1 (present). They don't deal with uncertainty or probability.
5.2 - Fuzzy Logic Operators Fuzzy Logic operators use the rules of fuzzy logic to determine resultant weights. For operators that require all terms to be present they return the maximum weight of the terms. (i.e. operators that work like an and) For operators that don't require all terms be present they return the minimum weight. (i.e. operators that work like an or) They care only whether a term exists in a document or not. Thus they either return a weight of 0 (not present) or 1 (present). They don't deal with uncertainty or probability.
5.3 - Probabilistic Operators Probabilitistic operators treat weights as a probability. To calculate the weight of an operator it calculates the probability of all the returned terms occurring together. Statistically this is equivalent to multiplying the probabilities of each term not occuring and then taking the inverse of that. Thus if I have a term with a weight of .4 and consider that weight as a probability, the probability of that term not belonging is .6. So if I have three terms with weights of .5, .3 and .9 the probability that they all belong is 1 - ( (1-.5)*(1-.3)*(1-.9) ) = .965. This is one of the more commonly used ranking methods. Probabilistic operators usually take the same name as the basic operator but with an r (for ranked) as a prefix.
In addition to the above operators there is one additional operator that uses a slightly different probabilistic calculation. The bayesian method is very similar to the probabilistic method, except that they use a form of statistics called Bayesian Statistics. Bayesian Statistics differ from regular statistics in that they deal with how likely something is based upon incomplete information. The calculation is to take the probability you think something will occur and divide that by the probability you think it won't occur plus the probability you think it will. In practice if you need to use the bayesian operator you'll already understand how it works.
5.4 - P-Norm Operators P-norm operators are a little difficult to explain. Basically they treat weights like a vector in space (if you remember your physics or vector algebra). Each term represents a unique direction. The weight represents the distance along that direction. What the p-norm does is allow you to pick a kind of measurement of the total distance of all the terms together. Each p-norm calculation requires what is called a p-value. The p-value specifies what kind of distance calculation is used. A p-value of 2 is like calculating a traditional vector distance. A p-value of 1 simply adds the weights together. Higher p-values actually make the calculation become more and more like a fuzzy logic calculation. So in a sense a P-norm calculation allows you to specify how much like a distance and how much like a min/max your weight returns. Even though P-norm calculations are very difficult to understand, in practice they are the most effective ranking method available. They are a little more computationally intensive, however, which can be significant if speed is a factor and you are doing thousands of calculations. Since it can be difficult to get a grasp on how P-norm operators work conceptually you should try various p-values to see which give you the best results pragmatically. After you tune your queries in this way you 'll find that P-norm calculations can really help the accuracy of your categorizing. The calculation is to take the weights to the power of p, take the average, and then take the pth root. If you are familar with vector algebra you can see how this is like a geometric distance. As with fuzzy logic calculations, a slightly different form is used for operators that are like an or as opposed to those like an and. Operators that are like an and use the probability of the term not occuring (1 - weight). Operators that are like an or use the weight itself. The first parameter passed to p-norm operators is the p-value. If the operators take further numeric parameters (as with p_near, for example) those other values follow the p-value.
5.5 - MMM Operators The mixed minimum and maximum method (MMM for short) for ranking is actually a variation of the fuzzy logic method. For fuzzy logic we take the maximum weight of underlying terms and subqueries if we require all of them to occur. If we only require some to occur we use the minimum. What the MMM method does is take the opposing weight and use it to bias the fuzzy logic weight. In a sense what the MMM does is take a value, called the m-value, which specifies how much like and and is like an or and vice versa. The m-value is the ratio between those two values. An m-value of .5 means that the and part and the or part contribute equally to the final answer. From testing the most effective m-values for m_and ranges from .5 to .8. For m_or we've found that an m-value of .2 to .5 works best. However you should do some testing to see what value matches your data and categories best. As with the p_norm ranking method, the MMM method requires a little trial and error, but can be very effective and improving your profiling efficiency.
5.6 - Variable Distance Operators Some operators explicitly depend upon how far apart terms are. For some applications terms which occur close to one an other are far more significant than terms which are farther apart. The variable distance method of ranking compares some maximum distance to the distance between terms and uses that as a guide for assigning a rank between 0 and 1.
5.7 - Normalization Operators Sometimes the weights assigned to terms arise out of external programs. Those programs might be due to feedback by users, results of document similarity calculations, or even statistics from a dictionary. Consider, for example, a mail router that routes mail based upon how well it fits categories supplied by a user. Some applications provide a checklist to see how relevant that document really was. Based upon what the user tells them, the weights of certain categories are adjusted so that only appropriate messages make it through. Those calculations may results in weights becoming larger than 1. (For instance a term that already has a weight of 1 which you want to make more significant) The approach to taking a range of values and assigning them new values between 0 and 1 is called normalization. There are many approaches to normalizing ranked results. The simplest, although least accurate, is to simply take the highest value and make that 1. The problem with that method is that it means that a collection of somewhat accurate data is treated the same as a collection of very accurate data. Sadly this is the typical method most ranking schemes use - often in a fashion that doesn't let the programmer adjust the data themselves. Sometimes this renomralization is entirely hidden from the programmer, leading to misleading statistics and weights. The most accurate method of normalizing weights is to look at all the data and utilize features from the data itself to get an idea of the "typical" value. The most widely used method is called the RMS method. This method, familar to people in physics or engineering, is to take an average of the square of each value. You then take the square root of that average. To normalize you then divide each value by this RMS figure. Any value that would be greater than 1 is set to 1. In this way the several values that are most likely to be the sought for hits are treated as being most significant. This helps eliminate misleading data and results in a ranking that is most accurate. Since the RMS method doesn't meet everyones' needs, we've included a traditional ranking which simply divides all the values by the maximum value. We've also included a custom normalization method that lets you specify what value is the "maximum" value.
5.8 - Other Operators In addition to the various operators listed above, we have some specialized operators that don't bear a clear connection to any ranking scheme or to our basic operators. The first group of operators are like an if statement in programming, but which functions on hits rather than a normal condition. These fit fairly specialized needs, but allow you to return terms or subqueries based upon the results of other subqueries. Usually these operators are parts of fairly complex queries.
The main difference between gate and iif is that gate works on the record level. If a record contains the first term, then it returns the second term otherwise the third term. Basically it is an "if-statement" that is evaluated for each and every record. The iif statement works over the entire set of records being profiled. Thus if the first term has a hit in any record it returns the hits for term 2 othewise term 3. The reason we've supplied both forms is that many clients break up documents into multiple records while profiling. Each record has a special meaning. While this use of the Profiler isn't typical, for those who need the power it comes in very handy.
6.0 Comments To enable you to document your queries and add notes, the language allows the use of C and C++ style comments. Everything within /* */ is treated as a comment and the comments can span numerous lines. This form of comments is useful for commenting out blocks of code or writing long explanations. Everything between the double slash // and the end of the line is also a comment. This is useful for putting short notes after a statement.
7.0 Variables Variables are used to hold the results of a subquery. It is important to note that they exist only to hold results. If you index more documents, the variables will not update to reflect the new documents. If you need variables to reflect the current state of the index you should use functions, discussed below. Variables are most often used for clarity or to hold intermediate steps when conducting interactive uses of the Profiler. They work exactly the same as variables in a traditional programming language. Because of the nature of queries, we find that you will rarely need to use variables. They exist mainly for special circumstances. To set a variable you simply type the name of the variable followed by an equals sign and the subquery you wish to assign to the variable. The name of a variable starts with a letter followed by any combination of alphanumeric characters and the underscore character. Variable names are not case sensitive. Thus BOBs_Query and Bobs_Query are equivalent.
It is important to remember that in this version of the Profiler, all variables are global. In practice this shouldn't be a problem. However when you use variables within functions you should remember that those variable names are not local to that function.
7.0 Functions Functions are very similar to variables, except that they reflect the current state of the index. Thus if you reset the Profiler and re-index some new documents, the information returned by a function will reflect the new documents. In a sense a function is basically a named query. It is important to use functions whenever possible. Not only do they make your queries easier to read and understand, but the Profiler uses it to optimize your queries. The more functions your query uses, the more the Profiler can optimize it. A function name must start with a letter followed by any combination of alpha-numeric characters and the underscore character. The names are not case sensitive. If you attempt to create a function with a name that has already been used an error will be returned. To define a function you type the function name followed by an equals sign. After the equals sign you put the subquery between braces { }. Each statement in the subquery must end with a semicolon. The function must also terminate with a semi-colon.
It is important to remember that variables are not local to functions. While most functions typically consist of a single operator that then calls other subqueries, a function can consist of more than one statement. In this case only the last statement should return a value. Any other statements return values are ignored.
|