lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
Date Tue, 09 Nov 2010 14:34:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930137#action_12930137
] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

DM, thanks, I see exactly where you are coming from.

I see your point: previously it was much easier to take something like SimpleAnalyzer and
'adapt' it to a given language based on things like unicode properties.
In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, etc)

But now we can actually tokenize "correctly" for more languages with jflex, thanks to its
improved unicode support, and its superior to these previous hacks :)

to try to answer some of your questions (all my opinion):

bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and are probably
happy with discard numbers etc... its not a big loss to me if it goes though.

bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more.
they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer
is in fact in lucene core.

bq. Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer,
such that all Analyzers using StandardTokenizer would get the alternate implementation?

Personally, i would prefer if we move towards a factory model where things like these supplied
"language analyzers" are actually xml/json/properties snippets.
In other words, they are just example configurations that builds your analyzer, like solr
does.
This is nice, because then you dont have to write code to easily customize how your analyzer
works.

I think we have been making slow steps towards this, just doing basic things like moving stopwords
lists to .txt files.
But i think the next step would be LUCENE-2510, where we have factories/config attribute parsers
for all these analysis components already written.

Then we could have support for declarative analyzer specification via xml/json/.properties/whatever,
and move all these Analyzers to that.
I still think you should be able to code up your own analyzer, but in my opinion this is much
easier and preferred for the ones we supply.

Also i think this would solve a lot of analyzer-backwards-compat problems, because then our
supplied analyzers are really just configuration file examples,
and we can change our examples however we want... someone can use their old config file (and
hopefully old analysis module jar file!) to guarantee
the exact same behavior if they want.

Finally, most of the benefits of ICUTokenizer are actually in the UAX29 support... the tokenizers
are pretty close with some minor differences:
* the jflex-based implementation is faster, and better in my opinion.
* the ICU-based implementation allows tailoring, and supplies tailored tokenization for several
complex scripts (jflex doesnt have this... yet)
* the ICU-based implementation works with all of unicode, at the moment jflex is limited to
the basic multilingual plane.

In my opinion the last 2 points will probably be eventually resolved... i could see our ICUTokenizer
possibly becoming obselete down the road 
by some better jflex support, though it would have to probably have hooks into ICU for the
complex script support (so we get it for free from ICU)


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide
language-neutral tokenization.  Lucene contains several language-specific tokenizers that
should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).
 The language-specific *analyzers*, by contrast, should remain, because they contain language-specific
post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer
in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing
the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the
fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C),
but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested
using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted
PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message