lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Created: (LUCENE-2414) add icu-based tokenizer for unicode text segmentation
Date Thu, 22 Apr 2010 20:52:52 GMT
add icu-based tokenizer for unicode text segmentation

                 Key: LUCENE-2414
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*
    Affects Versions: 3.1
            Reporter: Robert Muir
             Fix For: 3.1

I pulled out the last part of LUCENE-1488, the tokenizer itself and cleaned it up some.

The idea is simple:
* First step is to divide text into writing system boundaries (scripts)
* You supply an ICUTokenizerConfig (or just use the default) which lets you tailor segmentation
on a per-writing system basis.
* This tailoring can be any BreakIterator, so rule-based or dictionary-based or your own.

The default implementation (if you do not customize) is just to do UAX#29, but with tailorings
for stuff with no clear word division:
* Thai (uses dictionary-based word breaking)
* Khmer, Myanmar, Lao (uses custom rules for syllabification)

Additionally as more of an example i have a tailoring for hebrew that treats the punctuation
special. (People have asked before
for ways to make standardanalyzer treat dashes differently, etc)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message