lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Unicode normalisation *before* tokenisation?
Date Mon, 17 Jan 2011 00:53:36 GMT
On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <> wrote:
> So I guess I have two questions:
>    1. Is there some way to do filtering to the text before
> tokenisation without upsetting the offsets reported by the tokeniser?
>    2. Is there some more general solution to this problem, such as an
> existing tokeniser similar to StandardTokeniser but with better
> Unicode awareness?

Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
you pass Version.LUCENE_31 to get the new behavior)
It implements UAX#29 algorithm which respects canonical equivalence...
it sounds like thats what you want.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message