lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text
Date Thu, 04 Jun 2009 14:21:07 GMT


Robert Muir updated LUCENE-1488:

    Attachment: LUCENE-1488.patch

updated patch, not ready yet but you can see where i am going.

ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text Segmentation. Text
is divided across script boundaries so that this segmentation can be tailored for different
writing systems; for example Thai text is segmented with a different method. The default and
script-specific rules can be tailored. In the resources folder i have some examples for Southeast
Asian scripts, etc.  Since i need script boundaries for tailoring, i stuff the ISO 15924 script
code constant in the flags; this could be useful for downstream consumers.

ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching; Full case
folding. This may change the length of the token, for example german sharp s is folded to
'ss'. This filter interacts with the downstream normalization filter in a special way, so
you can provide a hint as to what the desired normalization form will be. In the NFKC or NFKD
case it will apply the NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold(x))))

ICUDigitFoldingFilter: Standardize digits from different scripts to the latin values, 0-9.

ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those from the Format

ICUNormalizationFilter: Apply unicode normalization to text. This is accelerated with a quick-check.

ICUAnalyzer ties all this together. All of these components should also work correctly with
surrogate-pair data. 

Needs more doc and tests. any comments appreciated.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>                 Key: LUCENE-1488
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking
text into words, especially with respect to non-alphabetic scripts.  This is because it is
unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until
i looked at the jflex rules and saw that codepoint range for most of the Thai block was added
to the alphanum specification. defining the exact codepoint ranges like this for every language
could help with the problem but you'd basically be reimplementing the bounds properties already
stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance,
the analyzer will break words around accent marks in decomposed form. While most latin letter
+ accent combinations have composed forms in unicode, some do not. (this is also an issue
for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead
of jflex. Using this method you can define word boundaries according to the unicode bounds
properties. After getting it into some good shape i'd be happy to contribute it for contrib
but I wonder if theres a better solution so that out of box lucene will be more friendly to
non-ASCII text. Unfortunately it seems jflex does not support use of these properties such
as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message