lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Taylor <watay...@as-st.com>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 16:07:30 GMT
is there some way to get the standard Field constructor to use, say, 
the Whitespace Tokenizer as opposed to the standard tokenizer?

On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:

>> I suspect that my issue is getting the Field constructor to use a
>> different tokenizer.  Can anyone help?
>
> You need to basically come up with your own Tokenizer (You can always
> write a corresponding JavaCC grammar and compiling it would give the
> Tokenizer)
> Then you need to extend org.apache.lucene.analysis.Analyzer class and
> override the tokenStream() method. Now, wherever you are
> indexing/searching, use the object of this CustomAnalyzer.
> Public class MyAnalyzer extended Analyzer
> {
> 	public TokenStream tokenStream(....)
> 	{
> 		TokenStream ts = null;
> 		ts = new MyTokenizer(reader);
> 		/* Pass this tokenstream through other filters you are
> interested in */
> 	}
> }
>
> Krovi.
>
> -----Original Message-----
> From: Bill Taylor [mailto:wataylor@as-st.com]
> Sent: Tuesday, August 29, 2006 8:10 PM
> To: java-user@lucene.apache.org
> Subject: Installing a custom tokenizer
>
> I am indexing documents which are filled with government jargon.  As
> one would expect, the standard tokenizer has problems with
> governmenteese.
>
> In particular, the documents use words such as 310N-P-Q as references
> to other documents.  The standard tokenizer breaks this "word" at the
> dashes so that I can find P or Q but not the entire token.
>
> I know how to write a new tokenizer.  I would like hints on how to
> install it and get my indexing system to use it.  I don't want to
> modify the standard .jar file.  What I think I want to do is set up my
> indexing operation to use the WhitespaceTokenizer instead of the normal
> one, but I am unsure how to do this.
>
> I know that the IndexTask has a setAnalyzer method.  The document
> formats are rather complicated and I need special code to isolate the
> text strings which should be indexed.   My file analyzer isolates the
> string I want to index, then does
>
> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> Field.Store.YES, Field.index.TOKENIZED));
>
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer.  Can anyone help?
>
> Thanks.
>
> Bill Taylor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message