lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer
Date Sun, 21 Jun 2009 00:44:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722283#action_12722283
] 

Koji Sekiguchi commented on LUCENE-1466:
----------------------------------------

Oops. Thanks for the updated patch, Mike!
{quote}
    *  Can you add a CHANGES entry describing this new feature, as well
      as the change in type of Tokenizer.input?
    * Can we rename NormalizeMap -> NormalizeCharMap?
    * Could you add some javadocs to NormalizeCharMap,
      MappingCharFilter, BaseCharFilter?
{quote}
Your patch looks good!
{quote}
    * The BaseCharFilter correct method looks spookily costly (has a for
      loop, going backwards for all added mappings). It seems like in
      practice it should not be costly, because typically one corrects
      the offset only for the "current" token? And, one could always
      build their own CharFilter (eg using arrays of ints or something)
      if they needed a more efficient mapping.
{quote}
Yes, users can create their own CharFilter if they needed a more efficient mapping.
{quote}
    * MappingCharFilter's match method is recursive. But I think the
      depth of that recursion equals the length of character sequence
      that's being mapped, right? So risk of stack overlflow should be
      basically zero, unless someone does some insanely long character
      string mappings?
{quote}
You are correct.

{quote}
I think we should make an exception to back-compat here, and simply
change TokenStream.input from Reader to CharStream (subclasses
Reader). Properly respecting back-compat will be alot of work, and,
if external subclasses are directly assigning to input, they really
ought to be using reaset(Reader) instead. 
{quote}
I agree with you, Mike.

> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466-back.patch, LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch,
LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message