lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2098) make BaseCharFilter more efficient in performance
Date Tue, 16 Mar 2010 10:13:27 GMT


Michael McCandless commented on LUCENE-2098:

Ahh ok.

Probably we should switch to parallel arrays here, to make it very fast... yes this will consume
RAM (8 bytes per position, if we keep all of them).

Really most apps do not need all positions stored, ie, they only need to see typically the
current token.  So maybe we could make a filter that takes a "lookbehind size" and it'd only
keep that number of mappings cached?  That'd have to be > the max size of any token you
may analyze, so hard to bound perfectly, but eg setting this to the max allowed token in IndexWriter
would guarantee that we'd never have a miss?

For analyzers that buffer tokens... they'd have to set this max to infinity, or, ensure they
remap the offsets before capturing the token's state?

> make BaseCharFilter more efficient in performance
> -------------------------------------------------
>                 Key: LUCENE-2098
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-2098.patch
> Performance degradation in Solr 1.4 was reported. See:
> The inefficiency has been pointed out in BaseCharFilter javadoc by Mike:
> {panel}
> NOTE: This class is not particularly efficient. For example, a new class instance is
created for every call to addOffCorrectMap(int, int), which is then appended to a private
> {panel}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message