Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 4696 invoked from network); 29 Aug 2006 17:46:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 29 Aug 2006 17:46:41 -0000 Received: (qmail 42190 invoked by uid 500); 29 Aug 2006 17:46:35 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 42160 invoked by uid 500); 29 Aug 2006 17:46:34 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42149 invoked by uid 99); 29 Aug 2006 17:46:34 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Aug 2006 10:46:34 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of erickerickson@gmail.com designates 64.233.166.178 as permitted sender) Received: from [64.233.166.178] (HELO py-out-1112.google.com) (64.233.166.178) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Aug 2006 10:46:33 -0700 Received: by py-out-1112.google.com with SMTP id w49so2475500pyg for ; Tue, 29 Aug 2006 10:46:12 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=b7NbafOKVAUSJrFOF/hu2oa0socCe5Q2IscDdidazpsnIHGWfxEx7U7HfIZh5B6LPaDM0hXYsEvgivAgopKigK84uQT9vdLV3EzQ6SoiRMKQvOL655OykwZId3uCX7mS/2MAXtTisuHCgRccGp2N8AG4yfQck7kHSEk/lj5fIa4= Received: by 10.35.78.9 with SMTP id f9mr15364029pyl; Tue, 29 Aug 2006 10:46:12 -0700 (PDT) Received: by 10.35.9.18 with HTTP; Tue, 29 Aug 2006 10:46:12 -0700 (PDT) Message-ID: <359a92830608291046l4cd85347ice7e5f75af63ad72@mail.gmail.com> Date: Tue, 29 Aug 2006 13:46:12 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Installing a custom tokenizer In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_13371_17300962.1156873572243" References: <44F3F64B.8030709@sirma.bg> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_13371_17300962.1156873572243 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I'm in a real rush here, so pardon my brevity, but..... one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. Same kind of thing for a Query. Erick On 8/29/06, Bill Taylor wrote: > > I am indexing documents which are filled with government jargon. As > one would expect, the standard tokenizer has problems with > governmenteese. > > In particular, the documents use words such as 310N-P-Q as references > to other documents. The standard tokenizer breaks this "word" at the > dashes so that I can find P or Q but not the entire token. > > I know how to write a new tokenizer. I would like hints on how to > install it and get my indexing system to use it. I don't want to > modify the standard .jar file. What I think I want to do is set up my > indexing operation to use the WhitespaceTokenizer instead of the normal > one, but I am unsure how to do this. > > I know that the IndexTask has a setAnalyzer method. The document > formats are rather complicated and I need special code to isolate the > text strings which should be indexed. My file analyzer isolates the > string I want to index, then does > > doc.add(new Field(DocFormatters.CONTENT_FIELD, , > Field.Store.YES, Field.index.TOKENIZED)); > > I suspect that my issue is getting the Field constructor to use a > different tokenizer. Can anyone help? > > Thanks. > > Bill Taylor > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_13371_17300962.1156873572243--