lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ronnie Kolehmainen <ronnie.kolehmai...@ub.uu.se>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 16:13:07 GMT
Have a look at PerFieldAnalyzerWrapper:


http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html


Citerar Bill Taylor <wataylor@as-st.com>:

> is there some way to get the standard Field constructor to use, say, 
> the Whitespace Tokenizer as opposed to the standard tokenizer?
> 
> On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:
> 
> >> I suspect that my issue is getting the Field constructor to use a
> >> different tokenizer.  Can anyone help?
> >
> > You need to basically come up with your own Tokenizer (You can always
> > write a corresponding JavaCC grammar and compiling it would give the
> > Tokenizer)
> > Then you need to extend org.apache.lucene.analysis.Analyzer class and
> > override the tokenStream() method. Now, wherever you are
> > indexing/searching, use the object of this CustomAnalyzer.
> > Public class MyAnalyzer extended Analyzer
> > {
> > 	public TokenStream tokenStream(....)
> > 	{
> > 		TokenStream ts = null;
> > 		ts = new MyTokenizer(reader);
> > 		/* Pass this tokenstream through other filters you are
> > interested in */
> > 	}
> > }
> >
> > Krovi.
> >
> > -----Original Message-----
> > From: Bill Taylor [mailto:wataylor@as-st.com]
> > Sent: Tuesday, August 29, 2006 8:10 PM
> > To: java-user@lucene.apache.org
> > Subject: Installing a custom tokenizer
> >
> > I am indexing documents which are filled with government jargon.  As
> > one would expect, the standard tokenizer has problems with
> > governmenteese.
> >
> > In particular, the documents use words such as 310N-P-Q as references
> > to other documents.  The standard tokenizer breaks this "word" at the
> > dashes so that I can find P or Q but not the entire token.
> >
> > I know how to write a new tokenizer.  I would like hints on how to
> > install it and get my indexing system to use it.  I don't want to
> > modify the standard .jar file.  What I think I want to do is set up my
> > indexing operation to use the WhitespaceTokenizer instead of the normal
> > one, but I am unsure how to do this.
> >
> > I know that the IndexTask has a setAnalyzer method.  The document
> > formats are rather complicated and I need special code to isolate the
> > text strings which should be indexed.   My file analyzer isolates the
> > string I want to index, then does
> >
> > doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> > Field.Store.YES, Field.index.TOKENIZED));
> >
> > I suspect that my issue is getting the Field constructor to use a
> > different tokenizer.  Can anyone help?
> >
> > Thanks.
> >
> > Bill Taylor
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message