Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Installing a custom tokenizer
Date: Tue, 29 Aug 2006 20:20:01 +0530
Message-ID: <33432A11DBA32B4EACBF6E37C5671DFA04F4EF@mailhyd2.hyd.deshaw.com>
Thread-Topic: Installing a custom tokenizer
Thread-Index: AcbLdMzjP2S3yEPIRg2AsfzR0CE3yQABQZ1Q
From: "Krovi, DVSR_Sarma" <DVSR.Sarma.Krovi@deshaw.com>
To: <java-user@lucene.apache.org>

> I suspect that my issue is getting the Field constructor to use a=20
> different tokenizer.  Can anyone help?=20

You need to basically come up with your own Tokenizer (You can always
write a corresponding JavaCC grammar and compiling it would give the
Tokenizer)
Then you need to extend org.apache.lucene.analysis.Analyzer class and
override the tokenStream() method. Now, wherever you are
indexing/searching, use the object of this CustomAnalyzer.
Public class MyAnalyzer extended Analyzer
{
	public TokenStream tokenStream(....)
	{
		TokenStream ts =3D null;
		ts =3D new MyTokenizer(reader);
		/* Pass this tokenstream through other filters you are
interested in */
	}
}

Krovi.

-----Original Message-----
From: Bill Taylor [mailto:wataylor@as-st.com]=20
Sent: Tuesday, August 29, 2006 8:10 PM
To: java-user@lucene.apache.org
Subject: Installing a custom tokenizer

I am indexing documents which are filled with government jargon.  As=20
one would expect, the standard tokenizer has problems with=20
governmenteese.

In particular, the documents use words such as 310N-P-Q as references=20
to other documents.  The standard tokenizer breaks this "word" at the=20
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to=20
install it and get my indexing system to use it.  I don't want to=20
modify the standard .jar file.  What I think I want to do is set up my=20
indexing operation to use the WhitespaceTokenizer instead of the normal=20
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document=20
formats are rather complicated and I need special code to isolate the=20
text strings which should be indexed.   My file analyzer isolates the=20
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,=20
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a=20
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org