Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 95343 invoked from network); 22 Apr 2010 20:55:16 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Apr 2010 20:55:16 -0000 Received: (qmail 56780 invoked by uid 500); 22 Apr 2010 20:55:15 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 56709 invoked by uid 500); 22 Apr 2010 20:55:15 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 56702 invoked by uid 500); 22 Apr 2010 20:55:15 -0000 Delivered-To: apmail-lucene-java-dev@lucene.apache.org Received: (qmail 56699 invoked by uid 99); 22 Apr 2010 20:55:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Apr 2010 20:55:15 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Apr 2010 20:55:13 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3MKspTn020515 for ; Thu, 22 Apr 2010 20:54:51 GMT Message-ID: <7832941.148651271969691130.JavaMail.jira@thor> Date: Thu, 22 Apr 2010 16:54:51 -0400 (EDT) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-2414) add icu-based tokenizer for unicode text segmentation In-Reply-To: <17205450.148581271969572970.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2414: -------------------------------- Attachment: LUCENE-2414.patch attached is a patch. after applying it run 'ant genrbbi', which compiles the rule tailorings to binary DFAs for faster loading. you dont have to do this, e.g. if you want you can always create these from String, but this is much faster. > add icu-based tokenizer for unicode text segmentation > ----------------------------------------------------- > > Key: LUCENE-2414 > URL: https://issues.apache.org/jira/browse/LUCENE-2414 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Affects Versions: 3.1 > Reporter: Robert Muir > Fix For: 3.1 > > Attachments: LUCENE-2414.patch > > > I pulled out the last part of LUCENE-1488, the tokenizer itself and cleaned it up some. > The idea is simple: > * First step is to divide text into writing system boundaries (scripts) > * You supply an ICUTokenizerConfig (or just use the default) which lets you tailor segmentation on a per-writing system basis. > * This tailoring can be any BreakIterator, so rule-based or dictionary-based or your own. > The default implementation (if you do not customize) is just to do UAX#29, but with tailorings for stuff with no clear word division: > * Thai (uses dictionary-based word breaking) > * Khmer, Myanmar, Lao (uses custom rules for syllabification) > Additionally as more of an example i have a tailoring for hebrew that treats the punctuation special. (People have asked before > for ways to make standardanalyzer treat dashes differently, etc) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org