lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
Date Sat, 21 Feb 2009 16:51:52 GMT
that was just a suggestion as a quick hack...

it still won't really fix the problem because some character + accent
combinations don't have composed forms.

even if you added entire combining diacritical marks block to the jflex
grammar, its still wrong... what needs to be supported is \p{Word_Break =
Extend} property, etc etc.


On Sat, Feb 21, 2009 at 11:23 AM, Philip Puffinburger <
ppuffinburger@tlcdelivers.com> wrote:

> That's something we can try.   I don't know how much it performance we'd
> lose doing that as our custom filter has to decompose the tokens to do its
> operations.   So instead of 0..1 conversions we'd be doing 1..2 conversions
> during indexing and searching.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Saturday, February 21, 2009 8:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
>
> normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
> will work...
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message