lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jm <jmugur...@gmail.com>
Subject Re: are long words split into up to 256 long tokens?
Date Wed, 21 Apr 2010 13:58:33 GMT
oh, yes it does extend CharTokenizer..thanks Ahmet. I had searched
lucene source code for 256 and found nothing suspicious, and that was
itself suspicious cause it looked clearly like an inner limit. Of
course I should have searched for 255...

I'll see how I proceed cause I don't want to use a custom build.

On Wed, Apr 21, 2010 at 3:50 PM, Ahmet Arslan <iorixxx@yahoo.com> wrote:
>> Is 256 some inner maximum too
>> in some
>> lucene internal that causes this? What is happening is that
>> the long
>> word is split into smaller words up to 256 and then the min
>> and max
>> limit applied. Is that correct? I have removed LengthFilter
>> and still
>> see the splitting at 256 happen. I would like not to have
>> this, and
>> removed altogheter any word longer than max, wihtout
>> decomposing into
>> smaller ones. Is there a way to achieve this?
>>
>> Using lucene 3.0.1
>
>
> Assuming your Tokenizer extends CharTokenizer:
>
> CharTokenizer.java has this field:
> private static final int MAX_WORD_LEN = 255;
>
> you can modify CharTokenizer.java according to your needs.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message