Many times, it makes sense to not index "stopwords" duing the indexing process. Stopwords are words which have very little informational content. These are words such as:and, the, of, it, as, may, that, a, an, of, off, etc.
Studies have shown that by removing stopwords from the index, you may benifit with reduced index size without significantly affecting the accuracy of a user's query. Care must be taken however to take into account the user's needs. For example, the phrase "to be or not to be" from Hamlet is composed entirely of stopwords. Most of the internet's search engines eliminate all the stopwords from their indexes. By eliminating stopwords from the index, the index size is typically reduced by about 33% for a word level index. For a record level index or IDF level index, then eliminating stopwords is not typically done as they will not add significantly to the index size.
NOTE: If stopwords are not indexed, then to avoid confusion with the user, it is advisable that you preprocess the user's queries to remove the stopwords from their query as well.
Onix provides a range of functions to assist in keeping track of stopwords and removing them from the index stream. In addition, Onix also provides two different stopwords lists (Stopword List 1, Stopword List 2) for you to use. You may want to select words from one or both lists to create your own stopword list. The various references on the Brown Corpus (which is a standardized corpus of English text) are useful when building one's own stopword list. There are a variety of linguistics references for the Brown Corpus on the world wide web if you want to go that route. (Most people do quite well with either of the two lists provided or selecting the words from the two lists that they would like to use.)