lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] Commented: (LUCENE-2700) Expose DocValues via Fields
Date Sat, 30 Oct 2010 12:12:19 GMT


Simon Willnauer commented on LUCENE-2700:

Some might have followed the recent commit on the [branch|]
some didn't so I will sum up what has happened so far.
I integrated the currently named "DocValues" (we might need to rename it to something like
"PerDocValues" due to the naming conflict with func queries - I will wait for other suggestions
though) into the 4 dimensional Flex API and in turn changed the FieldsConsumer and FieldsProducer
interface to accept a new "DocValuesConsumer" / "DocValuesProducer" (implementing Fields)
receptively. We have a default implementation for both of them while none of them are used
by the "Term / Postings" codecs yet. I added a DocValuesCodec  which wraps any other codec
and forwards if there is a TermsConsumer / Producer requested. The test case already uses
a random codec wrapped by DocValuesCodec so they are ultimately pluggable. 
DocValues are supported on a SegmentReader as well as DirectoryReader level i.e. they are
integrated into MultiFields just the same way as Terms / DocsEnum etc. are.

I run into one rather big issue while integrating a "PerDoc" consumer / producer into Codec.
When a codec instantiates a FieldsConsumer most of the codecs already create all necessary
"resources" to consumer terms and postings which is problematic since PerDocConsumers are
created way before the segment is flushed while "TermConsumer" are created / needed only before
/ during flush. So in the case of DocValues I pass in the SegmentsWriteState into Codec#fieldsConsumer(..)
and once the segment if flushed DocumentsWriter creates another one which in turn fails since
the files for this codec / consumer have already been creates. Yet the solution I have implemented
/ hacked :) is that I initialize the wrapped codec lazily with the SegmentsWriteState passed
to Codec#fieldsConsumer(..) before the flush. This only works as long as nobody tries to get
a TermsConsumer before we are ready to flush which is kind of flaky. 

IMO we should not necessarily create all resources / files in directory etc. when a FieldsConsumer
is created but move it one level down and do it onces a TermsConsumer is requested. We gonna
need these facilities anyway to integrate StoredFields etc. since they are per doc too. 

Comments welcome.

> Expose DocValues via Fields
> ---------------------------
>                 Key: LUCENE-2700
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: CSF branch
> DocValues Reader are currently exposed / accessed directly via IndexReader. To integrate
the new feature in a more "native" way we should expose the DocValues via Fields on a perSegment
level and on MultiFields in the multi reader case. DocValues should be side by side with Fields.terms
 enabling access to Source, SortedSource and ValuesEnum something like that:
> {code}
> public abstract class Fields {
> ...
>   public DocValues values();
> }
> public abstract class DocValues {
>   /** on disk enum based API */
>   public abstract ValuesEnum getEnum() throws IOException;
>   /** in memory Random Access API - with enum support - first call loads values in ram*/
>   public abstract Source getSource() throws IOException;
>   /** sorted in memory Random Access API - optional operation */
>   public SortedSource getSortedSource(Comparator<BytesRef> comparator) throws IOException,
>   /** unloads previously loaded source only but keeps the doc values open */
>   public abstract unload();
>   /** closes the doc values */
>   public abstract close();
> }
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message