Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <15701950.101831289310968771.JavaMail.jira@thor>
Date: Tue, 9 Nov 2010 08:56:08 -0500 (EST)
From: "DM Smith (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific
 tokenizers in favor of StandardTokenizer
In-Reply-To: <17755171.80791289246946346.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code rather than the jira comments alone. The old StandardAnalyzer had a kitchen sink approach to tokenizations trying to do too much with *modern* constructs, e.g. URLs, email addresses, acronyms.... It and SimpleAnalyzer would produce about the same stream on "old" English and some other texts, but the StandardAnalyzer was much slower. (I don't remember how slow, but it was obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I could take the language specific tokenizers, lists of stop words, stemmers and create variations of the SimpleAnalyzer that properly handled a particular language. (I created my own analyzers because I wanted to make stop words and stemming optional)

In looking at the code in trunk (should have done that before making my comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer will no longer handle modern constructs. As I see it this is what SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org