lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Taylor <>
Subject Installing a custom tokenizer
Date Tue, 29 Aug 2006 14:40:04 GMT
I am indexing documents which are filled with government jargon.  As 
one would expect, the standard tokenizer has problems with 

In particular, the documents use words such as 310N-P-Q as references 
to other documents.  The standard tokenizer breaks this "word" at the 
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to 
install it and get my indexing system to use it.  I don't want to 
modify the standard .jar file.  What I think I want to do is set up my 
indexing operation to use the WhitespaceTokenizer instead of the normal 
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document 
formats are rather complicated and I need special code to isolate the 
text strings which should be indexed.   My file analyzer isolates the 
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>, 
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a 
different tokenizer.  Can anyone help?


Bill Taylor

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message