lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
Date Wed, 08 Apr 2009 10:20:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696972#action_12696972
] 

Michael McCandless commented on LUCENE-1231:
--------------------------------------------

bq. If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc
using stored fields, but will need 5 seeks with CSFs.

If the CSFs have been loaded into the FieldCache then it's fast.  If they haven't, I agree
(one seek per doc).

Anyway this would just be "sugar" (a single 'retrieve doc' API that'd pull from the FieldCache
& from stored fields, and merge).  EG if I store body text as stored field, but then have
a bunch of small "metadata" fields (price, weight, manufacturer, etc.) stored in FieldCache
(loaded via CSF), it'd be convenient to call document() somewhere and get all fields back.

> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting list
> that has one posting with payload for each document you can "simulate" a column-
> stride field. The performance is significantly better compared to stored fields,
> however still not optimal. The reason is that for each document the freq value,
> which is in this particular case always 1, has to be decoded, also one position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data structure. 
> This could work like this: When the DocumentsWriter writes a segment it checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message