lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <>
Subject [jira] Commented: (LUCENE-1488) multilingual analyzer based on icu
Date Wed, 02 Dec 2009 22:46:20 GMT


DM Smith commented on LUCENE-1488:

Robert, just finished reviewing the code. Looks great! Doesn't look like there's too much
left. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private
PositionIncrementAttribute posIncAtt;)

The documentation in ICUNormalizationFilter is very instructive. Kudos. The only part that's
hard to for me to understand is the filter order dependency, but then again that's a hard
topic in the first place.

I'm wondering whether it would make sense to have multiple representations of a token with
the same position in the index. Specifically, transliterations and case-folding. That is,
the one is a "synonym" for the other. Is that possible and does it make sense? I'm imagining
a use case where a end user enters for a search request a Latin script transliteration of
Greek "uios" but might also enter "υιος".

The other question on my mind is that given a text of German, Greek and Hebrew (three distinct
scripts) does it make sense to apply stop words to them based on script? And should stop words
be normalized on load with the ICUNormalizationFilter? Or is it a given that they work as

Can/How does all this integrate with stemmers?

Again, many thanks! (Btw, special thanks for this working with 2.9 and Java 1.4!)

> multilingual analyzer based on icu
> ----------------------------------
>                 Key: LUCENE-1488
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, LUCENE-1488.patch,
LUCENE-1488.txt, LUCENE-1488.txt
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking
text into words, especially with respect to non-alphabetic scripts.  This is because it is
unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until
i looked at the jflex rules and saw that codepoint range for most of the Thai block was added
to the alphanum specification. defining the exact codepoint ranges like this for every language
could help with the problem but you'd basically be reimplementing the bounds properties already
stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance,
the analyzer will break words around accent marks in decomposed form. While most latin letter
+ accent combinations have composed forms in unicode, some do not. (this is also an issue
for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead
of jflex. Using this method you can define word boundaries according to the unicode bounds
properties. After getting it into some good shape i'd be happy to contribute it for contrib
but I wonder if theres a better solution so that out of box lucene will be more friendly to
non-ASCII text. Unfortunately it seems jflex does not support use of these properties such
as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message