lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
Date Wed, 26 Mar 2008 22:45:24 GMT


Michael McCandless commented on LUCENE-1231:

Sorry you're right: the payload is the binary data.

So there are a number of features these fields would have that differ from other fields:

Maybe add "stored in its own file" or some such, to that list.  Ie to
efficiently update field X I would think you want it stored in its own
file.  We would then fully write a new geneation of that file whenever
it had changes.

I agree it would be great to implement this as "flexible indexing",
such that these are simply a-la-cart options on how the field is
indexed, rather than make a new specialized kind of field that just
does one of these "combinations".  But I haven't wrapped my brain
around what all this will entail... it's a biggie!

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for
updating sparse fields.
I like that!

> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>                 Key: LUCENE-1231
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
> This new feature has been proposed and discussed here:
> Currently it is possible in Lucene to store data as stored fields or as payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting list
> that has one posting with payload for each document you can "simulate" a column-
> stride field. The performance is significantly better compared to stored fields,
> however still not optimal. The reason is that for each document the freq value,
> which is in this particular case always 1, has to be decoded, also one position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data structure. 
> This could work like this: When the DocumentsWriter writes a segment it checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message