lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] [Commented] (LUCENE-2309) Fully decouple IndexWriter from analyzers
Date Fri, 22 Jul 2011 11:20:58 GMT


Uwe Schindler commented on LUCENE-2309:

bq. I think Robert has stated here that he's comfortable continuing to use TokenStream as
the API for IW to get the terms it indexes, is that what others feel too? I agree the inverted
API I proposed is a little convoluted and I'm sure we can come up with a simple Consumable
like abstraction (which Robert did also suggest above). But if people are content with TokenStream
then theres no need.

I feel the same. The API of TokenStream is so stupid-simple, why replace it by another push-like
API that is not simplier nor more complicated, just different? I see no reason in this. IW
should simply request a TokenStream from the field and consume it.

Likewise, for multi-valued fields, IW shouldn't "see" the separate
values; it should just receive a single token stream, and under the
hood (in Document/Field impl) it's concatenating separate token
streams, adding posIncr/offset gaps, etc. This too is now hardwired
in indexer but shouldn't be. Maybe an app wants to insert custom
"separator" tokens between the values...

I agree with that, too. There is one problem with this: Concenatting TokenStreams is not easy
to do, as they have different attribute instances, so IW getting all attributes at the start
would then somehow in the middle of the TS have to change the attributes.

To implement this fast (without wrapping and copying), we need some notification that the
consumer of a TokenStream needs to "request" the attribute instances again, but this is a
"bad" idea. For me the only simple solutions to this problem is to make the Field return an
iterator of TokenStreams and IW consumes them one after each other, and doing the addAttribute
before each separate instance.

About the PosIncr Gap: The field can change the final offsets/posIncr in end() before handling
over to a new TokenStream. IW would only consume TokenStreams one by one.

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>                 Key: LUCENE-2309
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>         Attachments: LUCENE-2309-analyzer-based.patch, LUCENE-2309.patch
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message