lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2450) Explore write-once attr bindings in the analysis chain
Date Fri, 07 May 2010 15:10:48 GMT


Michael McCandless commented on LUCENE-2450:

TokenFilters that need to capture & restore state should also be more efficient, since
they only need to capture attrs "live" at their point in the pipeline.  Ie, the attrs added
by future stages in the pipeline are not visible to it and don't need to be captured.

Also, when custom attrs are added to the pipeline solely for communication b/w stages, the
consuming stage can remove the attr from the namespace of all stages after it.

> Explore write-once attr bindings in the analysis chain
> ------------------------------------------------------
>                 Key: LUCENE-2450
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Michael McCandless
>         Attachments:
> I'd like to propose a new means of tracking attrs through the analysis
> chain, whereby a given stage in the pipeline cannot overwrite attrs
> from stages before it (write once).  It can only write to new attrs
> (possibly w/ the same name) that future stages can see; it can never
> alter the attrs or bindings from the prior stages.
> I coded up a prototype chain in python (I'll attach), showing the
> equivalent of WhitespaceTokenizer -> StopFilter -> SynonymFilter ->
> Indexer.
> Each stage "sees" a frozen namespace of attr bindings as its input;
> these attrs are all read-only from its standpoint.  Then, it writes to
> an "output namespace", which is read/write, eg it can add new attrs,
> remove attrs from its input, change the values of attrs.  If that
> stage doesn't alter a given attr it "passes through", unchanged.
> This would be an enormous change to how attrs are managed... so this
> is very very exploratory at this point.  Once we decouple indexer from
> analysis, creating such an alternate chain should be possible -- it'd
> at least be a good test that we've decoupled enough :)
> I think the idea offers some compelling improvements over the "global
> read/write namespace" (AttrFactory) approach we have today:
>   * Injection filters can be more efficient -- they need not
>     capture/restoreState at all
>   * No more need for the initial tokenizer to "clear all attrs" --
>     each stage becomes responsible for clearing the attrs it "owns"
>   * You can truly stack stages (vs having to make a custom
>     AttrFactory) -- eg you could make a Bocu1 stage which can stack
>     onto any other stage.  It'd look up the CharTermAttr, remove it
>     from its output namespace, and add a BytesRefTermAttr.
>   * Indexer should be more efficient, in that it doesn't need to
>     re-get the attrs on each next() -- it gets them up front, and
>     re-uses them.
> Note that in this model, the indexer itself is just another stage in
> the pipeline, so you could do some wild things like use 2 indexer
> stages (writing to different indexes, or maybe the same index but
> somehow with further processing or something).
> Also, in this approach, the analysis chain is more informed about the
> what each stage is allowed to change, up front after the chain is
> created.  EG (say) we will know that only 2 stages write to the term
> attr, and that only 1 writes posIncr/offset attrs, etc.  Not sure
> if/how this helps us... but it's more strongly typed than what we have
> today.
> I think we could use a similar chain for processing a document at the
> field level, ie, different stages could add/remove/change different
> fields in the doc....

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message