lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: Unicode normalisation *before* tokenisation?
Date Mon, 17 Jan 2011 01:54:18 GMT
On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir <rcmuir@gmail.com> wrote:
> On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <trejkaz@trypticon.org> wrote:
>> So I guess I have two questions:
>>    1. Is there some way to do filtering to the text before
>> tokenisation without upsetting the offsets reported by the tokeniser?
>>    2. Is there some more general solution to this problem, such as an
>> existing tokeniser similar to StandardTokeniser but with better
>> Unicode awareness?
>>
>
> Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
> you pass Version.LUCENE_31 to get the new behavior)
> It implements UAX#29 algorithm which respects canonical equivalence...
> it sounds like thats what you want.

This does sound like what we want, although it sounds like it might
take time to first identify whether UAX#29 will break the text the way
we want it (there aren't any solid examples of how the algorithm works
on different kinds of text in the standard itself, which is a bit
unfortunate.)

The other problem is that we're still stuck on 2.9 due to having
deprecated features in our codebase still, and having very little time
to do anything about it.  Moving to the new API is taking a while, as
some of those API changes are quite tricky to refactor for
(TokenStream in particular, makes fixing a single class take half a
day, once you add the time to verify that it is working correctly.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message