lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jm <jmugur...@gmail.com>
Subject are long words split into up to 256 long tokens?
Date Wed, 21 Apr 2010 13:20:39 GMT
I am analizying this wiht my custom analyzer:
String s = "mail77 mail88888 tc ro45mine durante
jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddkkkkkkkkkkkkkkkkkkkkkkkkdssssssss230iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
juju";

my custom analizer is basically:
        TokenStream result = new LowerCaseLetterNumberTokenizer(reader);
        result = new LengthFilter(result, 3, 256);
        result = new StopFilter(enablePositionIncrements, result,
stopWords, true);

and depending on the value I pass as max word lenght to LengthFilter I
have different results...

max 30
1: [mail77:0->6:word]
2: [mail88888:7->16:word]
3: [tc:17->19:word]
4: [ro45mine:20->28:word]
5: [juju:334->338:word]

(this is what I expected)

max 70, 100 or 253
1: [mail77:0->6:word]
2: [mail88888:7->16:word]
3: [tc:17->19:word]
4: [ro45mine:20->28:word]
5: [iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii:292->333:word]
6: [juju:334->338:word]


max 256 or 270
1: [mail77:0->6:word]
2: [mail88888:7->16:word]
3: [tc:17->19:word]
4: [ro45mine:20->28:word]
5: [jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddkkkkkkkkkkkkkkkkkkkkkkkkdssssssss230iiiiiiiiiiiiiiiiiiiiiiiii:37->292:word]
6: [iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii:292->333:word]
7: [juju:334->338:word]

token 5 has a lenght of 256. Is 256 some inner maximum too in some
lucene internal that causes this? What is happening is that the long
word is split into smaller words up to 256 and then the min and max
limit applied. Is that correct? I have removed LengthFilter and still
see the splitting at 256 happen. I would like not to have this, and
removed altogheter any word longer than max, wihtout decomposing into
smaller ones. Is there a way to achieve this?

Using lucene 3.0.1

thanks
javier

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message