lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text
Date Sat, 13 Jun 2009 10:23:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719109#action_12719109
] 

Michael McCandless commented on LUCENE-1488:
--------------------------------------------

ICUAnalyzer looks very useful!  Good work Robert.  (And, thanks!).

Do you think this'll be ready to go in time for 2.9 (which we are
trying to wrap up soonish)?

It seems like this absorbs the functionality of many of Lucene's
current analyzers.  EG you mentioned ArabicAnalyzer already.  What
other analyzers (eg in contrib/analyzers/*) would you say are
logically subsumed by this?

Also, this seems quite different from StandardAnalyzer, in that it
focuses entirely on doing "good" tokenization, by relying on the
Unicode standard (defaults) instead of fixed char ranges in
StandardAnalyzer.  So it fixes many bugs in how StandardAnalyzer
tokenizes, especially on non-European languages.

Also, StandardAnalyzer goes beyond making the initial tokens: it also
tries to label things as acronym, host name, number, etc.; tries to
filter out stop words.

I assume ICUCaseFoldingFilter logically subsumes LowercaseFilter?

bq. Especially tricky is the complexity-performance-maintenance balance, i.e. the case-folding
filter could be a lot faster, but then it would have to be updated when a new unicode version
is released.

I think it's fine to worry about this later.  Correctness is more
important than performance at this point.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking
text into words, especially with respect to non-alphabetic scripts.  This is because it is
unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be working until
i looked at the jflex rules and saw that codepoint range for most of the Thai block was added
to the alphanum specification. defining the exact codepoint ranges like this for every language
could help with the problem but you'd basically be reimplementing the bounds properties already
stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even latin, for instance,
the analyzer will break words around accent marks in decomposed form. While most latin letter
+ accent combinations have composed forms in unicode, some do not. (this is also an issue
for asciifoldingfilter i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead
of jflex. Using this method you can define word boundaries according to the unicode bounds
properties. After getting it into some good shape i'd be happy to contribute it for contrib
but I wonder if theres a better solution so that out of box lucene will be more friendly to
non-ASCII text. Unfortunately it seems jflex does not support use of these properties such
as [\p{Word_Break = Extend}] so this is probably the major barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message