lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Earwin Burrfoot (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
Date Wed, 08 Apr 2009 11:00:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696977#action_12696977
] 

Earwin Burrfoot commented on LUCENE-1231:
-----------------------------------------

I can share my design for doc loading, if anybody needs it:

public interface FieldCache {
  DocLoader loader(FieldInfo<?>... fields);
  ....
}

public interface DocLoader {
  void load(Doc doc);
  <T> T value(FieldInfo<T> field);
}

Doc is my analog for ScoreDoc, for these purporses it is completely identical
FieldInfos are constants defined like UserFields.EMAIL, they hold the type for field, its
name, indexing method, whether it is cached or not and the way it is cached. Two synthetic
fields exist - LUCENE_ID and SCORE, they allow to use same api for anything field-related.

Workflow looks like this:

// I create a loader. Fields are checked against the cache, for those that aren't cached I
create a FieldSelector
loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, sortBy.field));

// Then for each document I'm going to send in response for search request I select this document
// an indexReader.document(fieldSelector) happens here if there are any uncached fields
loader.load(doc);

// Then I extract the values I need. Cached ones arrive from the cache, uncached are decoded
from Document retrieved in previous step
hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), loader.value(sortBy.field)) //
etc, etc


Having a single API to retrieve values regardless of the way they are stored/cached is very
handy. Loading a mix of stored/column-stride (if I correctly understand what are they) fields
is pointless, you're more likely to lose performance than to gain it. Loading a mix of cached/uncached
fields is massive win, it becomes even more massive if all required fields happen to be cached.

> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting list
> that has one posting with payload for each document you can "simulate" a column-
> stride field. The performance is significantly better compared to stored fields,
> however still not optimal. The reason is that for each document the freq value,
> which is in this particular case always 1, has to be decoded, also one position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data structure. 
> This could work like this: When the DocumentsWriter writes a segment it checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message