lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text
Date Sat, 13 Jun 2009 14:49:12 GMT


Robert Muir commented on LUCENE-1488:

Michael, I don't think it will be ready for 2.9, here is some answers to your questions.

going with your arabic example:
The only thing this absorbs is language-specific tokenization (like ArabicLetterTokenizer),
because as mentioned I think thats generally the wrong approach.
But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text
in a language-specific way, which has a huge effect on retrieval quality for Arabic language

Some of what it does the language-specific analyzers don't do though.

In this specific example, it would be nice if ArabicAnalyzer really used the functionality
here, then did its Arabic-specific stuff!
Because this functionality will do things like normalize 'Arabic Presentation Forms' and deal
with Arabic digits and things that aren't in the ArabicAnalyzer. It also will treat any non-Arabic
text in your corpus very nicely!

Yes, you are correct about the difference from StandardAnalyzer and I would argue there are
tokenization bugs in how StandardAnalyzer works with European languages too, just see LUCENE-1545!

I know StandardAnalyzer does these things. This tokenizer has some built-in types already,
such as number. If you want to add more types, its easy. Just make a .txt file with your grammar,
create a RuleBasedBreakIterator with it, and pass it along to the tokenizer constructor. you
will have to subclass the tokenizer's getType() for any new types though, because RBBI 'types'
are really just integer codes in the rule file, and you have to map them to some text such
as "WORD".

Yes, case-folding will work better than lowercase for a few european languages.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>                 Key: LUCENE-1488
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking
text into words, especially with respect to non-alphabetic scripts.  This is because it is
unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until
i looked at the jflex rules and saw that codepoint range for most of the Thai block was added
to the alphanum specification. defining the exact codepoint ranges like this for every language
could help with the problem but you'd basically be reimplementing the bounds properties already
stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance,
the analyzer will break words around accent marks in decomposed form. While most latin letter
+ accent combinations have composed forms in unicode, some do not. (this is also an issue
for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead
of jflex. Using this method you can define word boundaries according to the unicode bounds
properties. After getting it into some good shape i'd be happy to contribute it for contrib
but I wonder if theres a better solution so that out of box lucene will be more friendly to
non-ASCII text. Unfortunately it seems jflex does not support use of these properties such
as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message