lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 17:46:12 GMT
I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.

Same kind of thing for a Query.


On 8/29/06, Bill Taylor <> wrote:
> I am indexing documents which are filled with government jargon.  As
> one would expect, the standard tokenizer has problems with
> governmenteese.
> In particular, the documents use words such as 310N-P-Q as references
> to other documents.  The standard tokenizer breaks this "word" at the
> dashes so that I can find P or Q but not the entire token.
> I know how to write a new tokenizer.  I would like hints on how to
> install it and get my indexing system to use it.  I don't want to
> modify the standard .jar file.  What I think I want to do is set up my
> indexing operation to use the WhitespaceTokenizer instead of the normal
> one, but I am unsure how to do this.
> I know that the IndexTask has a setAnalyzer method.  The document
> formats are rather complicated and I need special code to isolate the
> text strings which should be indexed.   My file analyzer isolates the
> string I want to index, then does
> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> Field.Store.YES, Field.index.TOKENIZED));
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer.  Can anyone help?
> Thanks.
> Bill Taylor
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message