Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of trejkaz@trypticon.org
 designates 74.125.83.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <AANLkTikuDNxLcCUUeLPg6tDdzyEaNh-d3Am-m5AUdaJw@mail.gmail.com>
References: <AANLkTimEzfkueke_ryT2LR9+zO+FsPR7-h5ioD9AteM5@mail.gmail.com>
	<AANLkTikuDNxLcCUUeLPg6tDdzyEaNh-d3Am-m5AUdaJw@mail.gmail.com>
Date: Mon, 17 Jan 2011 12:54:18 +1100
Message-ID: <AANLkTimzE2_FY=LDAt+1ABNm4j2TDZ+fDwLteRMjVY9T@mail.gmail.com>
Subject: Re: Unicode normalisation *before* tokenisation?
From: Trejkaz <trejkaz@trypticon.org>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir <rcmuir@gmail.com> wrote:
> On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <trejkaz@trypticon.org> wrote:
>> So I guess I have two questions:
>> =C2=A0 =C2=A01. Is there some way to do filtering to the text before
>> tokenisation without upsetting the offsets reported by the tokeniser?
>> =C2=A0 =C2=A02. Is there some more general solution to this problem, suc=
h as an
>> existing tokeniser similar to StandardTokeniser but with better
>> Unicode awareness?
>>
>
> Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
> you pass Version.LUCENE_31 to get the new behavior)
> It implements UAX#29 algorithm which respects canonical equivalence...
> it sounds like thats what you want.

This does sound like what we want, although it sounds like it might
take time to first identify whether UAX#29 will break the text the way
we want it (there aren't any solid examples of how the algorithm works
on different kinds of text in the standard itself, which is a bit
unfortunate.)

The other problem is that we're still stuck on 2.9 due to having
deprecated features in our codebase still, and having very little time
to do anything about it.  Moving to the new API is taking a while, as
some of those API changes are quite tricky to refactor for
(TokenStream in particular, makes fixing a single class take half a
day, once you add the time to verify that it is working correctly.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org