Profiling Engine SDK
The basic way people query indexes is to use what are called "Boolean operators." These operators allow you to not only search for a specific term, but also find documents that contain more than one term. The operators are commands that let you find documents with more than one term, any term in a list of terms, and even exclude documents with a given term. These operators use names that match the way we talk about matches in English. Nearly every indexer provides the operators "and", "or", and "not" along with the ability to group these into more complex queries. Some indexers also provide methods for ranking queries based upon how well it thinks a match corresponds to what you are looking for.
The problem with traditional Boolean operators is that many studies have found that they rarely retrieve just the documents that are desired. Many documents are missed and at the same time many documents that are retrieved that are not needed. In the context of router or categorization services this means that a document may be "tagged" as belonging to some category when it really doesn't. Or, often worse, it may mean that documents that ought to be tagged aren't.
Most categorization and router companies attempt to deal with this by creating extremely complex queries using Boolean operators. Creating effective queries can become quite an art and is often why people use these companies products. Making effective queries using just Boolean operators is very difficult. However quite a bit of research in the science of Information Retrieval shows that other, non-Boolean, functions can improve accuracy and performance dramatically. Boolean operators also don't deal well with ranked information. Most indexers that provide ranking information usually don't combine it in a helpful fashion with their Boolean operations. Thus other operators and functions are needed to deal with the rank of various index terms.
There are several ways to consider ranking queries. The most popular method is to compare the frequency of a term in a given document to the frequency of that term in a collection of documents. While this type of ranking is provided in our Onix Indexing Engine, we have at the present time not included it into the Lextek Profiling Engine. There are several reasons for this. One major reason is that our current clients in the categorization industry do not use these methods of ranking. We have found that our clients are deeply involved in what their industry actually needs rather than what other people think their industry needs. Because of this we put an emphasis on designing our Profiler to do what our clients needed. When our clients point out a need that isn't being met, we try to make those a priority for our development.
The main reason for using a different technique is that frequency based ranking always ranks with respect to a given document collection. Thus this type of ranking is intrinsically tied to large, static indexes. Yet those types of indexes were the very things we were attempting to eliminate to improve performance. If you need this sort of ranking, we'd encourage you to look at our Onix Indexing Engine. It has quite advanced ranking features of this traditional sort. In addition most of the advanced functionality we've created for the Profiler will be part of the next version of the Onix Indexing Engine we are releasing.
A third reason for not including this technique of ranking is that most categorization is based upon very specific ranking of ideas, concepts or queries that the categorization company creates. Categorizers usually are concerned with how significant a term, "idea," or category is relative to the overall query, not how significant a term is to a given set of documents. In a sense the type of ranking most categorizers need is the exact reverse of many traditional indexers rank. Conceptually categorizers are comparing a document to a query, not a query to numerous documents. This conceptual difference applies to ranking just as much as it does indexing. We have found that in practice the advantages of using a document based ranking don't always apply to the categorization industry. Further, while frequency based ranking can provide general categories, it typically becomes fairly inaccurate as more specific categories or concepts are developed. Those general categories can often be developed more easily using other methods.
In the Lextek Profiler we handle ranking in the following way. Any term or query result can have a rank or weight manually assigned to it. These ranked terms or query results can then be used in further queries that make use of the ranking information. We have quite a few operators and functions that allow one to combine these ranking results into more complex rankings. Remember that the rank of a query is how relevant that query is to the overall search. In the context of a categorization project what we will ultimately be ranking are categories. Once we've created several categories we can use them to create new categories. By assigning rank to these categories what we are doing is saying how significant this subcategory is to an overall category. It is thus important that the way query functions rank makes sense for a given project and improve the accuracy and performance of a given query.
An example can help illustrate this. Consider a router that sends news articles about major stock fluctuations in a company named Gizmo. The name of the company is very significant in our search. Indeed, it is one of the most important categories that make up our overall analysis. In addition we will likely have some terms that, while they refer to Gizmo Inc., could also refer to several other companies. For instance the term "inc" is found not only in the concept of "Gizmo inc." but also "Platic Furniture Inc."
These other terms, while significant, aren't as significant as the name of the company in determining what a document is about. Thus while we want documents returned that refer to Gizmo somewhat ambiguously, we don't want them to count as much as documents that refer explicitly as Gizmo. We would weigh terms that deal with stock fluctuations in the same way. Some terms, phrases or sub-queries might indicate very strongly a stock fluctuation. Other ones may not necessarily signify our desired stock fluctuations. Consider the word "gains." That word might be part of a phrase "Gizmo stock made significant gains this week." However it might also be part of the phrase "Gizmo gains a new CEO this week." So while "gains" is significant, we want to take into consideration how significant it is.
Really what we are saying is that we want to specify how likely a term or sub-query determines something about a document. In mathematics we deal with likelihoods using probabilities. To describe the probability of a term describing a concept we assign that terms a value between 0 and 1. A value of 1 means that the term definitely describes the concept. A value of 0 means that it doesn't at all. Values between 0 and 1 tell us the degree that term belongs to a given category, class or idea. For example, we might assign a weight of 0.5 to the term 'glider' as part of the concept of airplanes. While a glider is an airplane, it often isn't the type of airplane that we are usually are interested in. It isn't as probable that "glider" tells us that our notion of airplane is present.
You don't need to know anything about mathematics to use ranking. The way to think about weights and ranking in the Profiler is to think in terms of concepts and categories. You rank terms and sub-categories based upon the degree they belong to a given category. In a categorization engine, for example, you have a list of categories that you are assigning to a document. Think of those categories as real concepts. Next think of the concepts that make up those categories. For each of these sub-categories think of the degree to which they correspond to your larger category. By thinking of ranking in this way you will have an excellent intuitive feel for how the Lextek Profiler works.
Operators and Functions
As we mentioned, basically all indexers provide simple Boolean operations, Unfortunately Boolean operations don't deal with ranking very well. Many traditional indexers simply separate ranking from Boolean searches. The problem is, of course, that you want to search based upon operations like and, or, and so forth. Such operations tell you something about how terms or categories are related. Yet we also want to keep the idea that some documents are more like our category than others.
The Lextek Profiling Engine allows you to do this by providing numerous functions that combine set operations with ranking. The standard Boolean operations have been expanded to take into consideration the rank of the terms they operate on. Many of these operations work the same way we use them in regular language when we talk about how probable something is. For instance "and" returns the weight of the most probable term in the list of terms. (The highest weight in the list of terms) This is because if we want all the terms, then obviously the most significant term is present. With an "or" we consider the rank to be the minimum value. This is because any of the terms could count, so the importance of that query is the importance of the least significant component. We won't go through the details here of why people in Information Retrieval have decided that fuzzy logic should work this way. It should be sufficient to say that quite a few people have studied this issue in depth and decided that operations like and should work the way we have them working. The language summary goes through each of these commands and describes them, along with why you might wish to use them. The following is a summary of the basic kinds of operations we provide. We provide numerous variations of these kinds of operations designed to work with specific ranking methods.
As we mentioned earlier, you can think of the weight or rank of a term as representing the degree to which it is part of a larger concept or category. Consider the name of the CEO of a company. Let us say that the CEO is James Smith. His first name is fairly common. Thus if we find the term "James" it isn't that likely to represent the concept of our CEO in a document. The term "Smith" likewise isn't that likely to represent our concept. If the phrase "James Smith" occurs, it is more likely to be our CEO, but might well actually be someone else entirely. We can quickly see that choosing a proper rank is quite important.
Once you have a rank though, it is important see how functions and operators deal with it. For instance consider our above example. Let us say that we've assigned weights of 0.1 to each individual name and a weight of 0.5 to the full name. Further let us say that for the name of the company, Gizmo, we've assigned a weight of 0.8. (After all, there are other companies with similar names) The concept of the CEO of Gizmo will be made up of the concept of the CEO's name and the concept of the company name. How do the weights making up our sub-concepts relate to this new concept?
Rather than forcing you into picking just one ranking method of dealing with the underlying weights of terms, we've decided to provide several unique methods. Each of these methods is appropriate for specific needs and relationships. For instance, the simplest method is simply to discard the underlying weight and make the new weight 1.
Consider our example. Let us say that we wish to determine the concept of Gizmo's CEO by all occurrences of his name within 5 words of the company name. Now let us suppose that "James Smith" occurs near "Gizmo" and we want to know the weight of this new occurrence. What is it? Well one way to deal with it is to say that because both concepts occurred, that we know that it is the right CEO. In this case we want the weight to be 1 as a weight of 1 means that we know absolutely that it is the right category. However is this really what we want? A better way might be to recognize that we only know the concept as well as we know the underlying concepts. In this case we might take the maximum weight of the concepts making it up. Thus we'd have a weight of 0.5, since that was the weight of the full name. We have other methods as well which arise out of the mathematics of probability and logic. In this case, for instance, we'd probably want to use the method utilizing probabilities. If both "James Smith" and "Gizmo" occurred, then the probability that it is the the CEO is 0.9. We won't go through the details of the mathematics. You can find that in the query language manual.
We have several ranking methods. The first is called the existential method. Basically this is a method that simply determines whether the underlying categories, concepts or terms exist or not. If they do, then the rank is 1. If they don't then it is 0. The way to think about this method is to think that either your concept is present or it isn't. This method is the way most traditional indexers work. Either you find what you're looking for or you don't. While this can be useful in certain circumstances, often you want to use something more sophisticated.
The most common ranking method our clients use is the probabilistic method. This method considers each weight or rank as the probability that it is what you are looking for. Thus if a term is part of a larger query, it is the probability of that term being part of the larger query. To calculate the weight of a query made up of several terms, you simply calculate the probability that all the present terms would occur together. A variation of the probabilitistic method is called the bayesian method. It is very similar conceptually, but deals with how probable you think something is as opposed to knowing how probable it is. In practice this tends to be a distinction only significant to a few people. However the underlying mathematics is different and therefore the resultant weights are also different.
An other method is the fuzzy logic method. This considers each weight the degree to which it belongs to a given category. In practice this isn't that different from thinking of it as a probability. However when we calculate the weight of several terms all occuring, we pick out the weight of the most significant term (the highest weight). If we are calculating the weight of any term being present it is the lowest weight present. Once again we won't go through here why fuzzy logic works this way. You'll find, however, that this method of ranking is very helpful and fits in many circumstances. The way to think about fuzzy logic is by thinking in terms of the significance of terms. If you want all terms you are dealing with the most significant term. If you want any term you are dealing with the least significant term.
We have several other more sophisticated ranking methods and also many other functions. However those tend to be for more specialized needs. We won't deal with those in this manual. We encourage you to read through the Parser Language manual for an extended discussion of those features. The important thing to realize is that the Profiling Engine gives you a great deal of flexibility with ranking. Further, the ranking is always tied to relatively easy to understand ideas that relate terms to concepts or ideas. We have tried to make it so that weights and ranks are tied to thinking about documents in terms of categories or concepts. When you think about your analysis in this more natural way, you'll quickly find that you are better able to understand what you are doing. Further, you can easily move from idea to query without a great deal of problem.