lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Unicode normalisation *before* tokenisation?
Date Mon, 17 Jan 2011 00:53:36 GMT
On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <trejkaz@trypticon.org> wrote:
> So I guess I have two questions:
>    1. Is there some way to do filtering to the text before
> tokenisation without upsetting the offsets reported by the tokeniser?
>    2. Is there some more general solution to this problem, such as an
> existing tokeniser similar to StandardTokeniser but with better
> Unicode awareness?
>

Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
you pass Version.LUCENE_31 to get the new behavior)
It implements UAX#29 algorithm which respects canonical equivalence...
it sounds like thats what you want.

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message