lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
Date Sat, 21 Feb 2009 13:34:43 GMT
normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
will work...

On Fri, Feb 20, 2009 at 11:16 PM, Philip Puffinburger <
ppuffinburger@tlcdelivers.com> wrote:

> >some changes were made to the StandardTokenizer.jflex grammer (you can svn
> diff the two URLs fairly trivially) to better deal with correctly
> >identifying word characters, but from what i can tell that should have
> reduced the number of splits, not increased them.
> >
> >it's hard to tell from your email (because it was sent in the windows-1252
> >charset) but what exactly are the unicode characters you are putting
> through the tokenizer (ie: "\u0030") ?  knowing where it's splitting would
> >help figure out what's happening.
>
> These are the characters that of going through:
>
> \u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o
>
> It's splitting at the \u0301.
>
> >worst case scenerio, you could probably use the StandardTokenizer from
> >2.3.2 with the rest of the 2.4 code.
>
> We've thought of that, but would be the last thing we did to get it back to
> working.
>
> >this will show you exactly what changed...
> >svn diff >
> http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex>
> http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> Thanks for the links.   I've never dealt with JFlex, so I'll have to do
> some reading to know what is going on in those files.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message