lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
Date Sat, 21 Feb 2009 01:11:00 GMT

: In 2.3.2 if the token ‘Cómo’ came through this it would get changed to
: ‘como’ by the time it made it through the filters.    In 2.4.0 this isn’t
: the case.   It treats this one token as two so we get ‘co’ and ‘mo’.    So
: instead of search ‘como’ or ‘Cómo’ to get all the hits we now have to do
: them separately.

some changes were made to the StandardTokenizer.jflex grammer (you can svn 
diff the two URLs fairly trivially) to better deal with correctly 
identifying word characters, but from what i can tell that should have 
reduced the number of splits, not increased them.

it's hard to tell from your email (because it was sent in the windows-1252 
charset) but what exactly are the unicode characters you are putting 
through the tokenizer (ie: "\u0030") ?  knowing where it's splitting would 
help figure out what's happening.

worst case scenerio, you could probably use the StandardTokenizer from 
2.3.2 with the rest of the 2.4 code.

this will show you exactly what changed...
svn diff


View raw message