Profiling Engine SDK
Query Language Introduction

Query Language
Introduction
Summary
Operators
Tips, Questions, and Answers
   
 
Main Index
Index
Tutorial
API Functions
Query Language
   
Technology Overview
   
Contact Us
   
 
Other Products
Onix Text Search and Retrieval Engine
Brevity Document Summarizer
Lextek Document Profiler & Categorizer
RouteX Document Routing Engine
Lextek Language Identifier
 

 

The Lextek Profiling Engine was designed with three main goals in mind. The first, speed, is a product of our advanced underlying technology. The other two, power and efficiency, are as much a result of our query language as it is our engine. To make an analogy, a well designed query language can be as big an improvement over a simple query parser as a word processor is over a type-writer. If you are already a programmer, then a more apt analogy would be the difference between programming in languages like Java, C or Pascal versus programming in assembly language.

 

Structure

The design of the query language focused on how people actually use profilers. Most profiling applications are oriented around designing queries that represent specific concepts or categories. Not only do these applications attempt to represent these concepts, but they often do so using other concepts. This observation leads us to naturally want to create a query language that mirrors this structure.

What we have done is design a query language so that any large query is made up of blocks of named sub-queries. These blocks mirror the components of a mental concept or category. They form a conceptual unit that allows your queries to be quite modular. Not only do you build new concepts from existing ones, but you can use other concepts without having to know how they work. After you have finished writing a category or concept you can rely on it to continue working in various situations. Others can thus use the categories and concepts you've created as building blocks for new queries.

These blocks resemble what are called functions in traditional programming languages. Because of this resemblance we've used the metaphor of a programming language to help you understand how the query language works. It is, however, important to realize that you need have no programming experience to use the query language. If you prefer, you can simply consider the blocks of query statements you write to be concepts or categories rather than functions. In practice all these terms can be thought of as synonymous.

 

Terms and Operators

Each block in a query consists of terms, operators, and functions. A term is simply something that was indexed in a document. You indexed these terms using the API. We won't go through the indexing process, but it is important to realize that what terms are available and their format is determined by you as you call the API. This includes such things as character sets, special control characters representing information, and even the case of words. For the purpose of this introduction we'll assume the typical case where all the words in the documents you indexed were converted to lower case and indexed as ASCII strings. Special cases, such as Unicode, we'll discuss later in the questions and answer section.

A term can be specified in one of two ways. The easiest, and most typical, is to use an ASCII representation. An ASCII representation consists of a series of ASCII characters enclosed within single quotes. For instance to find the word cat, you'd type 'cat'. The second, more powerful, way is to use what we call hexadecimal form. Hexadecimal form consists of a zero followed by an x and then a series of hexadecimal digits. Each pair of hexadecimal digits represents a byte in a byte stream. Thus to find the word cat using hexadecimal form, you'd type 0x636174. The number 63 is ASCII "c", 61 is "a", and 74 is "t". The advantage to using hexadecimal form is that you can find control characters and non-ASCII letters. Non-English character sets, such as those for Japanese or Arabic typically require you to use the hexadecimal form. Several programming tricks also utilize the hexadecimal form for terms.

Each term can be assigned a weight or rank. The weight is a value, typically between 0 and 1, that tells how significant the term is or how probable its occurrence is in a concept. A weight of 1 means that the term is absolutely part of a concept. A value of 0 means that it is absolutely not part of a concept. Most queries or programs typically end up with categories assigned various weights. These weights represent how well a document matches that query. If you don't specify a weight it defaults to 1.

We provide numerous operators that allow you to manipulate terms. Most of these are based upon the Boolean operators that you are familiar with while searching for web pages or using other indexes.

Basic Operators Summary
all, and, & All the terms or sub-queries must be present.
any, or, | Any of the terms could be present.
not, ! The term can not be present
phrase, " " The terms must occur in the order presented and be beside one an other, just as in a literal phrase.
near The terms must be near to one an other. (You can specify within how many words they must occur)
ordered The terms must occur in the order presented.
ordered near The terms must occur in the order presented and be within a certain number of words. (Phrase is like this, but with a distance of 1 word - i.e. adjacent)
ordered The terms must occur in the order presented.

 

Weights and Ranking

Most of these operators have several methods for ranking their results. Each name is typically the normal name of the operator with an abbreviation representing the type of ranking method it uses. For a full list of actual operator names, please see the

 

Existential Method

The simplest ranking method simply assigns a weight of 1 for all returned hits. (Weights of 0 are never returned, since a weight of 0 means irrelevant or not present - any term with a weight of 0 is discarded) We sometimes call operators that return only weights of 1 existential operators. They get this name as they only care about whether subterms exist or not. If you've used traditional indexes before, this is the way most of their operators work. Effectively you simply don't worry about weight at all.

 

Probabilistic Method

The most common form of the operators uses what we call the probabilistic method of ranking. Basically this treats the weight of each subterm as the probability that term belongs to the concept. It uses the standard formula for computing probabilities for several events transpiring together.

For instance if I had three terms in a record with weights of .5, .7 and .9 the returned weight from an OR would be 0.985.

 

Fuzzy Logic Method

The next commonly used form of operators is fuzzy logic. In this form each term weight is the value of a term in fuzzy logic. When calculating the rank of a operation with many terms, we need to know whether all the terms must occur or simply any term. If all the terms must occur then the weight of the result is the weight of the most significant term in the calculation. For instance if I was doing an AND of three terms with weights of .4, .5 and .7 the result would have a weight of .7. If any term can occur, then the weight is the weight of the least significant term. If I was doing an OR of three terms with weights of .4, .5, and .7 and the first two occurred, then the resultant weight would be .4. Operators that use fuzzy logic thus return either the maximum or the minimum weight, depending upon whether all terms must be present or not.

 

Bayesian Method

An other frequently used form of operator ranking is the bayesian method. These are very similar to the probabilistic method, except that they use a form of statistics called Bayesian Statistics. Bayesian Statistics differ from regular statistics in that they deal with how likely something is based upon incomplete information. For instance if we calculate the probabilities of rolling dice, we know what the probabilities are. (1 in 6 for any particular side of the die) But what if we are calculating statistics about something we aren't sure about? In this case our statistics don't really represent how probable something is, but rather how probable we guess it to be. We won't go through the details of the math here. There are quite a few books on this. It turns out that the bayesian method is very useful for many types of analysis in information retrieval and analysis. The formula is a little more complicated than the probabilistic method. It is basically the probability that we think something will happen divided by the probability it will happen plus the probability it won't happen.

If we had three terms with weights of .4, .5 and .7 in a record the weights returned by the Bayesian function for that record would be about 0.609.

 

P-Norm Method

One very useful, but not widely used ranking method is called the p-norm method. Unless you have some background in information retrieval, it can be difficult to understand how a p-norm function works. One way to think about it is to consider each weight as a type of "distance" in space. Each weight is a length and each term represents a unique direction. The simplest p-norm finds the overall length of all the terms together using methods similar to those you used with vectors in college. (For two terms you simply have the square root of x^2 + y^2) In addition, however, p-norm ranking uses a "p-value" which determines how much the ranking is like finding a distance as compared to how much it is like traditional fuzzy logic. The lower the p-value the more it is like finding a distance. The higher the p-value the more it is like the fuzzy logic method. There is a slight difference between the p-norm calculation for operators where all terms must be present and those where only a few need be present. We present the formula for both below. The first is for OR like operators and the second is for AND like operators.

While it is quite difficult to understand what a p-norm method means or how it works, it turns out that they are one of the most accurate and useful methods of ranking operations. Quite a few studies have found that using p-norm ranking can dramatically improve the quality of your queries as compared to traditional means. The downside is that a bit of trial and error might be needed to find the proper p-value that meets your needs. However after you've tried using p-norm ranking a bit you will find that they are extremely useful in developing queries and categories.

 

 

 

 

MMM Method

The mixed min and max method (MMM method for short) is a variation on our Fuzzy Logic method. As you recall in Fuzzy Logic methods we return either the minimum weight or the maximum weight depending upon whether all terms had to be present or not. Basically what the MMM method of ranking does is smooth out the effects of fuzzy logic by including both the minimum rank and the maximum rank in the final calculation. This is done by providing a "m-value" which says how much it is like a traditional calculation. While MMM can be helpful for some specific needs, in general we think you'll find that our other ranking methods are more useful.

 

Syntax

The basic syntax of the Profiler query language is designed to be very similar to what you may have used with other query parsers. The basic operators are the logical operators with their standard symbolic form. The typical way to write statements with these operators is using their symbolic form such as A & B. However you can also write them in their functional form which looks like a function in a programming language. The following gives the main operators in both their symbolic form and their functional form.

 &
AND
term1 & term 2
AND( term 1, term 2 )

both terms must occur in the document
 |
OR
term1 | term 2
OR( term 1, term 2 )

either terms must occur in the document
 !
NOT
term1 ! term 2
NOT( term 1, term 2 )

term 1 must occur without term 2 occuring in the document
 ^
XOR
term1 ^ term 2
XOR( term 1, term 2 )

either terms must occur but not both in the document

In addition to the standard Boolean operators, the Profiler provides phrase and proximity searching. Once again these come in both their typical symbolic forms and then a functional form. We provide the functional form as in some cases it ends up being a little clearer.

When entering a statement always end with a semi-colon (;). You need to put a semi-colon because the typical query people send to the Profiler consists of hundreds or even thousands of indvidual statements. (Don't worry if your queries aren't so complex - just be glad that the Profiler can handle that kind of complexity if you require it) To ensure readibility a single query statement might span numerous lines of text. The Profiler needs some way to know when a statement is done. The semi-colon is an easy solution to this and is a familiar one to most C programmers.

To help with readibility and also the documentation of complex queries we also allow you to include comments with your queries. The comments are the same style as found in C or C++. Anything enclosed within /* and */ is ignore. This is useful for commenting out old code you don't want to delete or simply put notes into your code. In addition all text following two slashes // up to the end of the line is ignore. Thus the following are valid comments for the Profiler.

// This is a comment for the Profiler



/* This is also a comment for the

   Profiler.  Notice how this comment

   spans numerous lines.  */

Functions

As we mentioned at the beginning of this summary, the query language is designed to help you break your queries into logical blocks that represent concepts or categories. To do this you create what are called functions. Each function is a named set of queries.

The name of a function can be any set of alphanumeric characters along with the underscore. Each name must, however, start with a letter. It is important to remember that, unlike some languages, the Profiler is not case sensitive when it comes to function names. (It is, however, case sensitive when it comes to specifying terms)

To define a function you type the name of the function, an equals sign and then enclose a subquery in braces. At the end of your block you put a semi-colon to tell the parser that you are done defining a function.

The following is an example of a query that is made up of several named functions. It illustrates several features of the query language, including structuring your queries, commenting your sub-queries, and presenting them in an easy to read and debug format.

// Our main product and the forms that it's name might take.



MP_3_Player = {

           OR( "'mp3' 'player'",  // note default weight = 1

               NEAR( 4,'mp3', 'player' ) [0.8], 

               NEAR( 5, 'plays', ( 'mp3s' | 'mp3' ) ) [.6],

           );

};

            

DVD_Player = {

           OR( "'dvd' 'players'",

               NEAR( 4, 'dvd', 'player') [0.8],

               NEAR( 5, 'plays', ( 'dvds', 'dvd' ) ) [.6],

           );

};



Products = { DVD_Player | MP_3_Player;

};





Review = { 'review' [.9] |    // notice here we're using the

           'reviews' [.9] |   // symbolic notation for or

           "'product' 'review'" [1] |

           NEAR( 5, 'liked' | 'like', Products ) [.6] |

           "'good' 'value'" [.5] |

           "'report' 'card'" [.8]; 

};



Product_Review = { Products & Review;

};

Notice how we broke up our final concept, Product_Reviews, into several sub-queries. Each sub-query represented the concepts that made up our final concept. We first set up functions representing the concept for each of our products. We then created a more general function for the concept of Product. Finally we created a function representing the concept of a review consisting of terms likely to belong the concept of a review. We ranked those terms based upon how likely we thought they were part of the concept. Obviously the phrase "product review" is about product reviews. The phrase "report card" was likely about a product review, but could potentially been about grades in school. Talk about liking one of our products could well be part of a review, but could also be coincidental. We thus assigned it a much lower value.

You can see how structuring your query like the above makes modifying it much easier. For instance we can quickly see how easy it would be to add a new product to our analysis - say tape recorders. By indenting lines we can quickly see what makes up each component. Thiswould, for instance, let us easily see the terms and phrases making up our concept of a review. We could easily delete or add terms to that concept.