Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 43613 invoked from network); 21 Apr 2010 14:32:11 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Apr 2010 14:32:11 -0000 Received: (qmail 13047 invoked by uid 500); 21 Apr 2010 14:32:08 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 13011 invoked by uid 500); 21 Apr 2010 14:32:08 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13003 invoked by uid 99); 21 Apr 2010 14:32:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 14:32:08 +0000 X-ASF-Spam-Status: No, hits=-0.4 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jmuguruza@gmail.com designates 72.14.220.155 as permitted sender) Received: from [72.14.220.155] (HELO fg-out-1718.google.com) (72.14.220.155) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 14:32:03 +0000 Received: by fg-out-1718.google.com with SMTP id e21so2039052fga.5 for ; Wed, 21 Apr 2010 07:31:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type; bh=pGPCNRRMfxO/TjeWQ+Rmgpi5N9t9R3FGh7hIBY2tPmQ=; b=L3YXIn5hDh+n1oglieszT84bg9TsZmTIqtxjMAHfTypB/NgvUXRToxSZFbTuZEeGJM n/FOlK/ZbsBb/OIMA30p0D3GZ7s/6kd6oJKtAw8aEzxn6NLK9Dnsg3E96GF53UTx78d4 FbtSXfWn214M9k1GmVgwp5QhX5ZSjBL5fDMUU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Ra3hXdKrDm1Y75gUhhkCoIlMHuqyOzO+rytr/ogxBWUTD/moc2kizZx1r4fjykocny wIBYXYtILYRit5eNdpvlPGUhdQROj6VCyQ9ztxBhqcFcsH9fYOgVNmvwU10Gc0jMzRMt AFQviFB63sWu3JDeVYXrCZfDiXNqyqsOdy6OI= MIME-Version: 1.0 Received: by 10.223.124.81 with HTTP; Wed, 21 Apr 2010 07:31:03 -0700 (PDT) In-Reply-To: <005e01cae15d$88164750$9842d5f0$@de> References: <470047.33619.qm@web52908.mail.re2.yahoo.com> <005e01cae15d$88164750$9842d5f0$@de> Date: Wed, 21 Apr 2010 16:31:03 +0200 Received: by 10.223.17.197 with SMTP id t5mr1957186faa.84.1271860263688; Wed, 21 Apr 2010 07:31:03 -0700 (PDT) Message-ID: Subject: Re: are long words split into up to 256 long tokens? From: jm To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 ok https://issues.apache.org/jira/browse/LUCENE-2407 On Wed, Apr 21, 2010 at 4:18 PM, Uwe Schindler wrote: > Can you open a bug report to make this configureable, so we don't forget this? E.g. StandardTokenizer is able to change this. > > Thanks, > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: uwe@thetaphi.de > > >> -----Original Message----- >> From: jm [mailto:jmuguruza@gmail.com] >> Sent: Wednesday, April 21, 2010 3:59 PM >> To: java-user@lucene.apache.org >> Subject: Re: are long words split into up to 256 long tokens? >> >> oh, yes it does extend CharTokenizer..thanks Ahmet. I had searched >> lucene source code for 256 and found nothing suspicious, and that was >> itself suspicious cause it looked clearly like an inner limit. Of >> course I should have searched for 255... >> >> I'll see how I proceed cause I don't want to use a custom build. >> >> On Wed, Apr 21, 2010 at 3:50 PM, Ahmet Arslan >> wrote: >> >> Is 256 some inner maximum too >> >> in some >> >> lucene internal that causes this? What is happening is that >> >> the long >> >> word is split into smaller words up to 256 and then the min >> >> and max >> >> limit applied. Is that correct? I have removed LengthFilter >> >> and still >> >> see the splitting at 256 happen. I would like not to have >> >> this, and >> >> removed altogheter any word longer than max, wihtout >> >> decomposing into >> >> smaller ones. Is there a way to achieve this? >> >> >> >> Using lucene 3.0.1 >> > >> > >> > Assuming your Tokenizer extends CharTokenizer: >> > >> > CharTokenizer.java has this field: >> > private static final int MAX_WORD_LEN = 255; >> > >> > you can modify CharTokenizer.java according to your needs. >> > >> > >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > For additional commands, e-mail: java-user-help@lucene.apache.org >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org