Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 44994 invoked from network); 29 Aug 2006 14:50:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 29 Aug 2006 14:50:41 -0000 Received: (qmail 12794 invoked by uid 500); 29 Aug 2006 14:50:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 12761 invoked by uid 500); 29 Aug 2006 14:50:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 12706 invoked by uid 99); 29 Aug 2006 14:50:33 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Aug 2006 07:50:33 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [149.77.160.1] (HELO master.hyd.deshaw.com) (149.77.160.1) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Aug 2006 07:50:31 -0700 Received: from mshydpub1.hyd.deshaw.com (mshydpub1.hyd.deshaw.com [149.77.160.67]) by master.hyd.deshaw.com (8.13.7/8.13.7/2.0.kim.desco.357) with ESMTP id k7TEo1UT008230 for ; Tue, 29 Aug 2006 20:20:06 +0530 (IST) Received: from mailhyd2.hyd.deshaw.com ([149.77.160.75]) by mshydpub1.hyd.deshaw.com with Microsoft SMTPSVC(5.0.2195.6713); Tue, 29 Aug 2006 20:20:01 +0530 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.5 Subject: RE: Installing a custom tokenizer Date: Tue, 29 Aug 2006 20:20:01 +0530 Message-ID: <33432A11DBA32B4EACBF6E37C5671DFA04F4EF@mailhyd2.hyd.deshaw.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Installing a custom tokenizer Thread-Index: AcbLdMzjP2S3yEPIRg2AsfzR0CE3yQABQZ1Q From: "Krovi, DVSR_Sarma" To: X-OriginalArrivalTime: 29 Aug 2006 14:50:01.0405 (UTC) FILETIME=[67C4FED0:01C6CB7A] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N > I suspect that my issue is getting the Field constructor to use a=20 > different tokenizer. Can anyone help?=20 You need to basically come up with your own Tokenizer (You can always write a corresponding JavaCC grammar and compiling it would give the Tokenizer) Then you need to extend org.apache.lucene.analysis.Analyzer class and override the tokenStream() method. Now, wherever you are indexing/searching, use the object of this CustomAnalyzer. Public class MyAnalyzer extended Analyzer { public TokenStream tokenStream(....) { TokenStream ts =3D null; ts =3D new MyTokenizer(reader); /* Pass this tokenstream through other filters you are interested in */ } } Krovi. -----Original Message----- From: Bill Taylor [mailto:wataylor@as-st.com]=20 Sent: Tuesday, August 29, 2006 8:10 PM To: java-user@lucene.apache.org Subject: Installing a custom tokenizer I am indexing documents which are filled with government jargon. As=20 one would expect, the standard tokenizer has problems with=20 governmenteese. In particular, the documents use words such as 310N-P-Q as references=20 to other documents. The standard tokenizer breaks this "word" at the=20 dashes so that I can find P or Q but not the entire token. I know how to write a new tokenizer. I would like hints on how to=20 install it and get my indexing system to use it. I don't want to=20 modify the standard .jar file. What I think I want to do is set up my=20 indexing operation to use the WhitespaceTokenizer instead of the normal=20 one, but I am unsure how to do this. I know that the IndexTask has a setAnalyzer method. The document=20 formats are rather complicated and I need special code to isolate the=20 text strings which should be indexed. My file analyzer isolates the=20 string I want to index, then does doc.add(new Field(DocFormatters.CONTENT_FIELD, ,=20 Field.Store.YES, Field.index.TOKENIZED)); I suspect that my issue is getting the Field constructor to use a=20 different tokenizer. Can anyone help? Thanks. Bill Taylor --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org