lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
Date Fri, 12 Mar 2010 14:31:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844515#action_12844515
] 

Uwe Schindler commented on LUCENE-2309:
---------------------------------------

bq. I'd like to donate my two cents here - we've just recently changed the TokenStream API,
but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly.
The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field,
so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources,
should not care how they are indexed/stored etc. Forcing them to implement this push logic
to IW seems to me like a real unnecessary overhead and complexity.

The idea was not to change this behaviour, but also give the user the posibility to reverse
that. For some tokenstreams it would simplify things much. The current IndexWriter code works
exactly like that:
# DocInverter gets TokenStream
# DocInverter calls reset() -- to be left out and moved to field/analyzer
# DocInverter does while-loop on incrementToken - it iterates. On each Token it calls add()
on the field consumer
# DocInverter calls end() and updates end offset
# DocInverter calls close() -- to be left out and moved to field/analyzer

The change is simply that step (3) is removed from DocInverter which only provides the add()
method for accepting Tokens. The current while loop simply is done in the current TokenStream/Field
code, so nobody needs to change his code. But somebody that actively wants to push tokens
can now do this. If he wants to do this currently he has no chance without heavy buffering.

So the push API will be very expert and the current TokenStreams is just a user of this API.

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message