Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DCF2B9E4F for ; Fri, 25 May 2012 14:46:24 +0000 (UTC) Received: (qmail 6047 invoked by uid 500); 25 May 2012 14:46:23 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 5906 invoked by uid 500); 25 May 2012 14:46:23 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 5897 invoked by uid 99); 25 May 2012 14:46:23 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2012 14:46:23 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 6C702141F4C for ; Fri, 25 May 2012 14:46:23 +0000 (UTC) Date: Fri, 25 May 2012 14:46:23 +0000 (UTC) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Message-ID: <502574459.2394.1337957183446.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1377587995.6262.1337652461492.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (LUCENE-4072) CharFilter that Unicode-normalizes input MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-4072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-4072: -------------------------------- Attachment: LUCENE-4072.patch attached is the filter, turned into a patch. however, I added an additional random test and it currently fails... will look into this more. > CharFilter that Unicode-normalizes input > ---------------------------------------- > > Key: LUCENE-4072 > URL: https://issues.apache.org/jira/browse/LUCENE-4072 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis > Reporter: Ippei UKAI > Attachments: LUCENE-4072.patch, ippeiukai-ICUNormalizer2CharFilter-4752cad.zip > > > I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J. > The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break. > The implementation is available at following repository: > https://github.com/ippeiukai/ICUNormalizer2CharFilter > Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org