lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <>
Subject [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
Date Tue, 09 Nov 2010 13:56:08 GMT


DM Smith commented on LUCENE-2747:

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code rather than the
jira comments alone. The old StandardAnalyzer had a kitchen sink approach to tokenizations
trying to do too much with *modern* constructs, e.g. URLs, email addresses, acronyms.... It
and SimpleAnalyzer would produce about the same stream on "old" English and some other texts,
but the StandardAnalyzer was much slower. (I don't remember how slow, but it was obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I could take the
language specific tokenizers, lists of stop words, stemmers and create variations of the SimpleAnalyzer
that properly handled a particular language. (I created my own analyzers because I wanted
to make stop words and stemming optional)

In looking at the code in trunk (should have done that before making my comment), I see that
UAX29Tokenizer is duplicated in the StandardAnalyzer's jflex and that ClassicAnalyzer is the
old jflex. Also, the new StandardAnalyzer does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer will no longer
handle modern constructs. As I see it this is what SimpleAnalyzer should be: Based on UAX29
and does little else. Thus my confusion. Is there a point to having SimpleAnalyzer? Shouldn't
UAX29Tokenizer be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin ICUTokenizer as a
replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer
would get the alternate implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>                 Key: LUCENE-2747
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide
language-neutral tokenization.  Lucene contains several language-specific tokenizers that
should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).
 The language-specific *analyzers*, by contrast, should remain, because they contain language-specific
post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer
in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing
the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the
fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C),
but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested
using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message