Onix Text Retrieval Toolkit
API Reference

API
Function List
Topical List

Main Index

Introduction
Calling Sequences
Query Processing
Relevancy Ranking
Data Types
Error Handling
   
Support
   
Lextek Products
Onix Text Retrieval Engine
Lextek Document Profiler & Categorizer
Brevity Document Summarizer
RouteX Routing Engine
Language Identifier

ROBOTS.TXT

If you are using Onix to build an internet or intranet webcrawler and indexer, Onix provides the functionality for you to be able to parse the robots robots.txt file which is a standard way to tell webcrawlers which files and directories are permissible to crawl and index.

To read the actual specification for the robots.txt parser, the document is located at:

http://info.webcrawler.com/mak/projects/robots/norobots.html

It gives the details of how to set up a robots.txt parser as well as the robots.txt file itself.

Onix also allows you to output a "compact" form of robots.txt using ixOutputCompactRobotsTxt(). This allows you to save a shortened copy of the robots.txt file which contains only those portions which apply to your webcrawler.

After creating the robots.txt parser using a call to ixCreateRobotsTxtParser(), you will want to set your webcrawler's name by a call to ixSetRobotName(). This is the name of your webcrawler and is used by the matcher to separate instructions given to your webcrawler in the robots.txt file from instructions given to other webcrawlers.

You can test to see if a directory or URL is eligible for crawling and indexing via the calls to ixRobotsPermissionGranted() and ixRobotsPermissionGrantedFullURL().

When you are finished using the robots.txt parser, you may delete it by a call to ixDeleteRobotsTxtParser().

See Also

ixProcessRecordID, ixRetrieveRecordID, ixFindRecordID