lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
Date Wed, 10 Nov 2010 17:59:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930681#action_12930681
] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a
variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would
be the same type of token stream, just alpha words.

Ok, now i understand you, and yes I agree...  My question is, should we even bother fixing
it? Like would anyone who actually cares about unicode really want only some hacked subset
of UAX#29 ?

These simple ones like SimpleAnalyzer, WhitespaceAnalyzer, StopAnalyzer are all really bad
for Unicode text in different ways, though Simple/Stop are bigger offenders i think, because
they will separate a base character from its combining characters (in my opinion, this should
always be avoided) and worse: they will break on these.

But people using them are probably happy? e.g. you can do like Solr,  use whitespaceanalyzer
and follow thru with something like WordDelimiterFilter and its mostly ok, depending upon
options, except for cases like CJK where its a death trap.

Personally i just don't use these things since I know the problems, but we could document
"this is simplistic and won't work well for many languages" and keep them around for people
that don't care?

And yeah i suppose its confusing these really "simple" ones are in the .core package, but
to me the package is meaningless, i was just trying to keep the analyzers arranged in some
kind of order (e.g. pattern-based analysis in the .pattern package, etc).

We could just as well call the package .basic or .simple or something else, its just a name.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide
language-neutral tokenization.  Lucene contains several language-specific tokenizers that
should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).
 The language-specific *analyzers*, by contrast, should remain, because they contain language-specific
post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer
in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing
the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the
fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C),
but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested
using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted
PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message