lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Per-document Payloads (was: Re: lucene indexing and merge process)
Date Sat, 20 Oct 2007 12:11:37 GMT

On Oct 19, 2007, at 6:53 PM, Michael Busch wrote:

> John Wang wrote:
>>
>>      I can tried to get some numbers for leading an int[] array vs
>> FieldCache.getInts().
>
> I've had a similar performance problem when I used the FieldCache. The
> loading performance is apparently so slow, because each value is  
> stored
> as a term in the dictionary. For loading the cache it is necessary to
> iterate over all terms for the field in the dictionary. And for each
> term it's posting list is opened to check which documents have that  
> value.
>
> If you store unique docIds, then there are no two documents that share
> the same value. That means, that each value gets its own entry in the
> dictionary and to load each value it is necessary to perform two  
> random
> I/O seeks (one for term lookup + one to open the posting list).
>
> In my app it took for a big index several minutes to fill the cache  
> like
> that.
>
> To speed things up I did essentially what Ning suggested. Now I store
> the values as payloads in the posting list of an artificial term. To
> fill my cache it's only necessary to perform a couple of I/O seeks for
> opening the posting list of the specific term, then it is just a
> sequential scan to load all values. With this approach the time for
> filling the cache went down from minutes to seconds!
>
> Now this approach is already much better than the current field cache
> implementation, but it still can be improved. In fact, we already  
> have a
> mechanism for doing that: the norms. Norms are stored with a fixed  
> size,
> which means both random access and sequential scan are optimal. Norms
> are also cached in memory, and filling that cache is much faster
> compared to the current FieldCache approach.
>
> I was therefore thinking about adding per-document payloads to Lucene
> (we can also call it document-metadata). The API could look like this:
>
> Document d = new Document();
> byte[] uidValue = ...
> d.addMetadata("uid", uidValue);
>
> And on the retrieval side all values could either be loaded into the
> field cache, or, if the index is too big, a new API can be used:
>
> IndexReader reader = IndexReader.open(...);
> DocumentMetadataIterator it = reader.metadataIterator("uid");
>
> where DocumentMetadataIterator is an interface similar to TermDocs:
>
> interface DocumentMetadataIterator {
>   void seek(String name);
>   boolean next();
>   boolean skipTo(int doc);
>
>   int doc();
>   byte[] getMetadata();
> }
>
> The next question would be how to store the per-doc payloads (PDP). If
> all values have the same length (as the unique docIds), then we should
> store them as efficiently as possible, like the norms. However, we  
> still
> want to offer the flexibility of having variable-length values. For  
> this
> case we could use a new data structure similar to our posting list.
>
> PDPList               --> FixedLengthPDPList | <VariableLengthPDPList,
> SkipList>
> FixedLengthPDPList    --> <Payload>^SegSize
> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload>
> Payload               --> Byte^PayloadLength
> PayloadLength         --> VInt
> SkipList              --> see frq.file
>
> Because we don't have global field semantics Lucene should  
> automatically
> pick the "right" data structure. This could work like this: When the
> DocumentsWriter writes a segment it checks whether all values of a PDP
> have the same length. If yes, it stores them as FixedLengthPDPList, if
> not, then as VariableLengthPDPList.
> When the SegmentMerger merges two or more segments it checks if all
> segments have a FixedLengthPDPList with the same length for a PDP. If
> not, it writes a VariableLengthPDPList to the new segment.
>
> I think this would be a nice new feature for Lucene. We could then  
> have
> user-defined and Lucene-specific PDPs. For example, norms would be in
> the latter category (this way we would get rid of the special code for
> norms, as they could be handled as PDPs). It would also be easy to add
> new features in the future, like splitting the norms into two  
> values: a
> norm and a boost value.

Some randomly pieced together thoughts (I may not even be fully awake  
yet :-)  so feel free to tell me I'm not understanding this correctly)

My first thought was how is this different from just having a binary  
field, but if I understand correctly it is to be stored in a separate  
file?

Now you are proposing a faster storage mechanism for them,  
essentially, since they are to be stored separately from the  
Documents themselves?   But the other key is they are all stored next  
to each other, right, so the scan is a lot faster?

I think one of the questions that will come up from users is when  
should I use addMetadata and when should I use addField?  Why make  
the distinction to the user?  Fields have always represented  
metadata, all your doing is optimizing the internal storage of them.   
So from an interface side of things, I would just make it a new Field  
type.  Essentially what we are doing is creating a two level document  
store, right?  First level contains all of the small metadata that is  
likely to be accessed on every hit, second level contains all of the  
non-essential fields, right?  Perhaps in this way, if users were  
willing to commit to fixed length fields for the first level, we  
could also make field updating of these types of fields possible w/o  
having to reindex?????

Btw, I've thought ever since we added payloads that we should find a  
way to hook in scoring on the binary fields and I would presume  
people would eventually want scoring of metadata too, just like the  
FunctionQuery stuff does.

And yes, to Nicholas point, it starts to sound like flexible  
indexing.  :-)  Which I still would like to get to sometime in my  
lifetime...


Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message