lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] [Resolved] (LUCENE-2935) Let Codec consume entire document
Date Thu, 09 Jun 2011 10:48:58 GMT


Simon Willnauer resolved LUCENE-2935.

    Resolution: Fixed

the main infrastructure has been committed to the docvalues branch - moving out here

> Let Codec consume entire document
> ---------------------------------
>                 Key: LUCENE-2935
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/index
>    Affects Versions: CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: CSF branch, 4.0
> Currently the codec API is limited to consume Terms & Postings upon a segment flush.
To enable stored fields & DocValues to make use of the Codec abstraction codecs should
allow to pull a consumer ahead of flush time and consume all values from a document's field
though a consumer API. An alternative to consuming the entire document would be extending
FieldsConsumer to return a StoredValueConsumer / DocValuesConsumer like it is done in DocValues
- Branch right now side by side to the TermsConsumer. Yet, extending this has proven to be
very tricky and error prone for several reasons:
> * FieldsConsumer requires SegmentWriteState which might be different upon flush compared
to when the document is consumed. SegmentWriteState must therefor be created twice 1. when
the first docvalues field is indexed 2. when flushed. 
> * FieldsConsumer are current pulled for each indexed field no matter if there are terms
to be indexed or not. Yet, if we use something like DocValuesCodec which essentially wraps
another codec and creates FieldConsumer on demand the wrapped codecs consumer might not be
initialized even if the field is indexed. This causes problems once such a field is opened
but missing the required files for that codec. I added some harsh logic to work around this
which should be prevented.
> * SegmentCodecs are created for each SegmentWriteState which might yield wrong codec
IDs depending on how fields numbers are assigned. We currently depend on the fact that all
fields for a segment and therefore their codecs are known when SegmentCodecs are build. To
enable consuming perDoc values in codecs we need to do that incrementally
> Codecs should instead provide a DocumentConsumer side by side with the FieldsConsumer
created prior to flush. This is also a prerequisite for LUCENE-2621

This message is automatically generated by JIRA.
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message