lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
Date Tue, 09 Nov 2010 10:29:06 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930072#action_12930072
] 

Simon Willnauer commented on LUCENE-2747:
-----------------------------------------

I looked at the patch briefly and the charStream(Reader) extension looks good to me while
I would make it protected and throw a IOException. Since this API is public and folks will
use it in the wild we need to make sure we don't have to add the exception later or people
creating Readers have to play tricks just because the interface has no IOException. About
making it protected, do we need to call that in a non-protected context, maybe I miss something..
{code}
 public Reader charStream(Reader reader) {
   return reader;
 }

// should be 

  protected Reader charStream(Reader reader) throws IOException{
    return reader;
  }
{code}


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide
language-neutral tokenization.  Lucene contains several language-specific tokenizers that
should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).
 The language-specific *analyzers*, by contrast, should remain, because they contain language-specific
post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer
in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing
the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the
fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C),
but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested
using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted
PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message