lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text
Date Fri, 05 Jun 2009 17:43:07 GMT


Robert Muir commented on LUCENE-1488:

here's a simple description of what the current functionality buys you, its this:

all indic languages (Hindi, Bengali, Tamil, ...), middle eastern languages (Arabic, Hebrew,
etc) will work pretty well here (by that I mean tokenized, normalized, etc). Most of these
lucene cannot parse correctly with any of the built-in analyzers.

obviously european languages lucene handles quite well already, but unicode still has some
improvements here, i.e. better case-folding.

And finally, of course, the situation where you have data in a bunch of these different languages!

in general, the unicode defaults work quite well for almost all languages, with the exception
of CJK and southeast-asian languages. 
its not my intent to really solve those harder cases, only to provide a mechanism for someone
else to deal with it if they don't like the defaults.

a great example is the arabic tokenizer, it should not exist. unicode defaults work great
for that language. and it would be silly to think about HindiTokenizer, BengaliTokenizer,
etc etc when unicode defaults will tokenize those correctly as well. 

there's still some annoying complexity here, and any comments are appreciated. Especially
tricky is the complexity-performance-maintenance balance, i.e. the case-folding filter could
be a lot faster, but then it would have to be updated when a new unicode version is released...
Another thing is i didn't optimize the BMP case anywhere [i.e. working at 32-bit codepoint
to ensure surrogate data works], and I think thats worth considering... like 99.9% of data
is in the BMP :)


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>                 Key: LUCENE-1488
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking
text into words, especially with respect to non-alphabetic scripts.  This is because it is
unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until
i looked at the jflex rules and saw that codepoint range for most of the Thai block was added
to the alphanum specification. defining the exact codepoint ranges like this for every language
could help with the problem but you'd basically be reimplementing the bounds properties already
stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance,
the analyzer will break words around accent marks in decomposed form. While most latin letter
+ accent combinations have composed forms in unicode, some do not. (this is also an issue
for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead
of jflex. Using this method you can define word boundaries according to the unicode bounds
properties. After getting it into some good shape i'd be happy to contribute it for contrib
but I wonder if theres a better solution so that out of box lucene will be more friendly to
non-ASCII text. Unfortunately it seems jflex does not support use of these properties such
as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message