Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of jmuguruza@gmail.com
 designates 72.14.220.155 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=Ra3hXdKrDm1Y75gUhhkCoIlMHuqyOzO+rytr/ogxBWUTD/moc2kizZx1r4fjykocny
         wIBYXYtILYRit5eNdpvlPGUhdQROj6VCyQ9ztxBhqcFcsH9fYOgVNmvwU10Gc0jMzRMt
         AFQviFB63sWu3JDeVYXrCZfDiXNqyqsOdy6OI=
MIME-Version: 1.0
In-Reply-To: <005e01cae15d$88164750$9842d5f0$@de>
References: <j2vf071061b1004210620k7bbf23cbi21ea63a89670735b@mail.gmail.com>
	 <470047.33619.qm@web52908.mail.re2.yahoo.com>
	 <y2gf071061b1004210658sa9583ce3uca62d4b349399bcb@mail.gmail.com>
	 <005e01cae15d$88164750$9842d5f0$@de>
Date: Wed, 21 Apr 2010 16:31:03 +0200
Message-ID: <w2wf071061b1004210731p49ff73c2q26cbd73a58c77c5d@mail.gmail.com>
Subject: Re: are long words split into up to 256 long tokens?
From: jm <jmuguruza@gmail.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1

ok https://issues.apache.org/jira/browse/LUCENE-2407

On Wed, Apr 21, 2010 at 4:18 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Can you open a bug report to make this configureable, so we don't forget this? E.g. StandardTokenizer is able to change this.
>
> Thanks,
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: jm [mailto:jmuguruza@gmail.com]
>> Sent: Wednesday, April 21, 2010 3:59 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: are long words split into up to 256 long tokens?
>>
>> oh, yes it does extend CharTokenizer..thanks Ahmet. I had searched
>> lucene source code for 256 and found nothing suspicious, and that was
>> itself suspicious cause it looked clearly like an inner limit. Of
>> course I should have searched for 255...
>>
>> I'll see how I proceed cause I don't want to use a custom build.
>>
>> On Wed, Apr 21, 2010 at 3:50 PM, Ahmet Arslan <iorixxx@yahoo.com>
>> wrote:
>> >> Is 256 some inner maximum too
>> >> in some
>> >> lucene internal that causes this? What is happening is that
>> >> the long
>> >> word is split into smaller words up to 256 and then the min
>> >> and max
>> >> limit applied. Is that correct? I have removed LengthFilter
>> >> and still
>> >> see the splitting at 256 happen. I would like not to have
>> >> this, and
>> >> removed altogheter any word longer than max, wihtout
>> >> decomposing into
>> >> smaller ones. Is there a way to achieve this?
>> >>
>> >> Using lucene 3.0.1
>> >
>> >
>> > Assuming your Tokenizer extends CharTokenizer:
>> >
>> > CharTokenizer.java has this field:
>> > private static final int MAX_WORD_LEN = 255;
>> >
>> > you can modify CharTokenizer.java according to your needs.
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org