lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
Date Fri, 12 Mar 2010 14:20:27 GMT


Shai Erera commented on LUCENE-2309:

bq. We should really move back to JIRA / devlist for such discussions

+1 !! I also find it very hard to track so many sources of discussions (JIRA, java-dev, java-user,
general, and now IRC). Also IRC is not logged/archived and searchable (I think?) which makes
it impossible to trace back a discussion, and/or randomly stumble upon it in Google.

I'd like to donate my two cents here - we've just recently changed the TokenStream API, but
we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly.
The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field,
so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources,
should not care how they are indexed/stored etc. Forcing them to implement this push logic
to IW seems to me like a real unnecessary overhead and complexity.

And having the Field control the flow of indexing seems also dangerous ... might expose Lucene
to lots of bugs by users. Today when IW controls it, it's one place to look for, but tomorrow
when Field will control it, where do we look? In the app's custom Field code? In IW's atts
consuming methods?

Will the Field also control how stored fields are added? Or only AttributeSourced ones?

Maybe I need to get used to this change, but currently it looks wrong to reverse the control
flow. Maybe in principle the DocInverter now accepts tokens from IW, and so it looks as if
we can pass it to the Field (as IW's AttAcceptor), but still the concept is different. We
(IW) control the indexing flow, and not the user.

I also may not understand what will that give to users. Shouldn't users get enough flexibility
w/ the current API and the Flex (once out) stuff? Do they really need to be bothered w/ pushing
tokens to IW?

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>                 Key: LUCENE-2309
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message