Profiling Engine SDK
Precalculating and Reusing Queries
As we mentioned in our introduction, quite a bit of the value that our clients add to the Profiling engine is found in their pre-built categories. Accurate queries are quite difficult to construct. This is where categorization and routing industries really differ from the typical users of an index. A router or categorizer needs accuracy far more than a user simply searching for a web page. When a typical user searches for a web page they are able to go through the results and decide which are really wanted. With a router or categorizer this process is automated. There isn't a person who is able to go through the query results and discard wrong answers. This need for accuracy means that queries have to become quite complex. In a sense you are trying to automate many features of the decision making process in a human mind.
As queries increase in complexity and sophistication, programmers find that they need to structure the query. This structure organizes their queries so they can understand what each part of the query does. As we've mentioned, with our Profiler it is best to think in terms of high-level categories or concepts. Each category or conceptual representation is made up of smaller concepts. Those in turn are made up of other concepts. As you break your queries up into these smaller sub-queries you will quickly find that the smaller pieces appear in many places. Breaking your categories up in terms of other concepts can get quite confusing. It is very helpful to be able to name the queries making up these sub-concepts. The Lextek Profiling Engine allows you to do this.
As you refer to sub-queries in terms of the concept names they represent, your queries become much more readable. When you refer to a name you don't need to retype the sub-query each time you use it. Most importantly by naming your queries means you can easily see the relationships that make up your overall query. This is immensely helpful when a new employee has to understand the queries someone else has written. It allows you to tie your queries to what they represent rather than simply being an unmanageable mess of operators and terms. This lets you can more easily debug your queries. Anyone who has tried to go through a complex query and understand what is wrong will quickly appreciate this. When your queries are hundreds of pages long this becomes almost essential.
Not only does naming queries allow one to understand the underlying processes, it also encourages you to think in terms of concepts and relationships. This, we've found, can significantly improve your ability to create accurate and useful queries.
What we've just described is very much like the way we write computer programs. Because of the similarity with programming we've used the programming metaphor for writing reusable queries.
Functions are named queries that are re-evaluated every time you call them. This means that you can write very complex queries, pass them to the Profiler and then reuse them even when you've indexed new documents. The Profiler is intelligent enough to know if the results of a function may have changed. This means that the Profiler can use functions to optimize your queries for greater speed. If you have a sub-query that is used in many places, the Profiler only has to calculate the results of that sub-query once. If you don't use named sub-queries, then each time you use the results of a sub-query the indexer would have to recalculate it. The speed increase this results in is dramatic. For some complex queries this can result in more than a ten-fold improvement in speed.
Even more valuable than this is that having functions means that the evaluation of a query is delayed until needed. Imagine a complex set of categories and concepts that you've created to describe an entire set of knowledge. Probably only a very small number of the categories you've created really apply to any given document. In many indexers when you run your query every part of the query is run against the index. If only 5% of your query really is applicable then this is an enormous waste of time and resources. By having functions only those sub-queries that are needed are ever evaluated. This means that you don't need to break your queries up so that you only evaluate them when appropriate. We do it for you. This lets you can focus on describing conceptual relationships and categories and let the Profiler be intelligent enough to know what to evaluate.
We have found that not only does this improve speed, but it also helps projects managers organize their categories. It leads to thinking in terms of large libraries that you can use over and over again. The libraries are described in terms of concepts. In a sense the "query" part of the categorization process is largely hidden. This, in turn, can help improve your productivity as you develop your routing descriptions or categories.
Index terms are the basic unit of the query processor. Each term represents a term that you indexed in a document. Because we support Unicode and byte streams you can represent a term in a hexadecimal format or as a word put within single quotes. Thus the word cat could be entered as 0x636174 (the hex stream for the letters c, a and t) or as 'cat'. The hexadecimal format may seem bulky and unwieldy at first, but can be extremely powerful. Consider, for instance, if you are working with foreign languages. By allowing byte streams you can easily deal with Unicode or other character sets. It also allows you to write routers that work not only on text, but also on byte streams. This can be important if you are using the Profiler to categorize application binaries or e-mail attachments. You can even use the Profiler to help detect viruses or work with n-gram analysis of language.
To learn about the specifics of the language we encourage you to read through the Query Language Manual. It goes through all the functions and operators in the language along with the syntax for writing queries.