lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Puffinburger" <ppuffinbur...@tlcdelivers.com>
Subject RE: 2.3.2 -> 2.4.0 StandardTokenizer issue
Date Sat, 21 Feb 2009 04:16:54 GMT
>some changes were made to the StandardTokenizer.jflex grammer (you can svn diff the two
URLs fairly trivially) to better deal with correctly >identifying word characters, but
from what i can tell that should have reduced the number of splits, not increased them.
>
>it's hard to tell from your email (because it was sent in the windows-1252
>charset) but what exactly are the unicode characters you are putting through the tokenizer
(ie: "\u0030") ?  knowing where it's splitting would >help figure out what's happening.

These are the characters that of going through:

\u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o

It's splitting at the \u0301.

>worst case scenerio, you could probably use the StandardTokenizer from
>2.3.2 with the rest of the 2.4 code.

We've thought of that, but would be the last thing we did to get it back to working.

>this will show you exactly what changed...
>svn diff >http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

Thanks for the links.   I've never dealt with JFlex, so I'll have to do some reading to know
what is going on in those files.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message