Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 65398 invoked from network); 17 Jan 2011 01:54:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Jan 2011 01:54:49 -0000 Received: (qmail 89823 invoked by uid 500); 17 Jan 2011 01:54:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 89766 invoked by uid 500); 17 Jan 2011 01:54:46 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 89758 invoked by uid 99); 17 Jan 2011 01:54:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Jan 2011 01:54:46 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of trejkaz@trypticon.org designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-gw0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Jan 2011 01:54:40 +0000 Received: by gwj15 with SMTP id 15so1573644gwj.35 for ; Sun, 16 Jan 2011 17:54:19 -0800 (PST) Received: by 10.91.153.10 with SMTP id f10mr3900095ago.172.1295229259590; Sun, 16 Jan 2011 17:54:19 -0800 (PST) Received: from mail-yi0-f46.google.com (mail-yi0-f46.google.com [209.85.218.46]) by mx.google.com with ESMTPS id i10sm4822926anh.12.2011.01.16.17.54.18 (version=SSLv3 cipher=RC4-MD5); Sun, 16 Jan 2011 17:54:19 -0800 (PST) Received: by yib18 with SMTP id 18so2273011yib.5 for ; Sun, 16 Jan 2011 17:54:18 -0800 (PST) MIME-Version: 1.0 Received: by 10.236.102.171 with SMTP id d31mr6943920yhg.19.1295229258546; Sun, 16 Jan 2011 17:54:18 -0800 (PST) Received: by 10.147.170.8 with HTTP; Sun, 16 Jan 2011 17:54:18 -0800 (PST) In-Reply-To: References: Date: Mon, 17 Jan 2011 12:54:18 +1100 Message-ID: Subject: Re: Unicode normalisation *before* tokenisation? From: Trejkaz To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir wrote: > On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz wrote: >> So I guess I have two questions: >> =C2=A0 =C2=A01. Is there some way to do filtering to the text before >> tokenisation without upsetting the offsets reported by the tokeniser? >> =C2=A0 =C2=A02. Is there some more general solution to this problem, suc= h as an >> existing tokeniser similar to StandardTokeniser but with better >> Unicode awareness? >> > > Hi, I think you want to try the StandardTokenizer in 3.1 (make sure > you pass Version.LUCENE_31 to get the new behavior) > It implements UAX#29 algorithm which respects canonical equivalence... > it sounds like thats what you want. This does sound like what we want, although it sounds like it might take time to first identify whether UAX#29 will break the text the way we want it (there aren't any solid examples of how the algorithm works on different kinds of text in the standard itself, which is a bit unfortunate.) The other problem is that we're still stuck on 2.9 due to having deprecated features in our codebase still, and having very little time to do anything about it. Moving to the new API is taking a while, as some of those API changes are quite tricky to refactor for (TokenStream in particular, makes fixing a single class take half a day, once you add the time to verify that it is working correctly.) TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org