lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-755) Payloads
Date Thu, 15 Mar 2007 04:18:09 GMT


Michael Busch commented on LUCENE-755:

Grant Ingersoll commented on LUCENE-755:

> OK, I've applied the patch.  All tests pass for me.  I think it looks  
> good.  Have you run any benchmarks on it?  I ran the standard one on  
> the patched version and on trunk, in a totally unscientific test.  In  
> theory, the case with no payloads should perform very closely to the  
> existing code, and this seems to be born out by me running the micro- 
> standard (ant run-task in contrib/benchmark).   Once we have this  

Grant, thank you for running the benchmarks!
In case no payloads are used there is indeed no performance decrease to 
expect, because the file format does not change at all in that case.

> committed someone can take a crack at adding support to the  
> benchmarker for payloads.

Good point! This will help us finding possible optimizations.

> Payload should probably be serializable.

Agreed. Will do ...

> All in all, I think we could commit this, then adding the search/ 
> scoring capabilities like we've talked about.  I like the  
> documentation/comments you have added, very useful.  (One of these  
> days I will take on documenting the index package like I intend to,  
> so what you've added will be quite helpful!)   We will/may want to  

That's what I was planning to do as well... haven't had time yet. But 
good that there's another volunteer, so we can split the work ;-)

> add in, for example, a PayloadQuery and derivatives and a QueryParser  
> operator that supported searching in the payload, or possibly  
> boosting if a certain term has a certain type of payload (not that I  
> want anything to do with the QueryParser).  Even beyond that,  
> SpanPayloadQuery, etc.  I will possibly have some cycles to actually  
> write some code for these next week.

Yes there are lots of things we could do. I was also thinking about
providing a demo that uses payloads. Let's commit this first, then
we can start working on these items...

> Just throwing this out there, I'm not sure I really mean it or  
> not  :-) , but:
> do you think it would be useful to consider restricting the size of  
> the payload?  I know, I know, as soon as we put a limit on it,  
> someone will want to expand it, but I was thinking if we knew the  
> size had a limit we could better control the performance and caching,  
> etc. on the scoring/search side.    I guess it is buyer beware, maybe  
> we put some javadocs on this.

Hmm, I'm not sure if we should limit the size... since there are
so many different use cases I wouldn't even know how to pick such 
a limit. However, if we discover later that a limit would be helpful
to optimize things on the search side we could think about a limit
parameter on field level, which would be easy to add if we introduce
a schema and global field semantics with FI.

> Also, I started as I  
> think we will want to have some docs explaining why Payloads are  
> useful in non-javadoc format.

Cool, that will be helpful!

> On a side note, have a look at 
> PatchCheckList to see if there is anything you feel you can add.

Thanks for reviewing this so thoroughly, Grant! I will commit it soon!

> Payloads
> --------
>                 Key: LUCENE-755
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch, payloads.patch, payloads.patch
> This patch adds the possibility to store arbitrary metadata (payloads) together with
each position of a term in its posting lists. A while ago this was discussed on the dev mailing
list, where I proposed an initial design. This patch has a much improved design with modifications,
that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore
this patch provides low-level APIs to simply store and retrieve byte arrays in the posting
lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together
with int variables for offset and length. So a user does not have to create a byte array for
every payload, but can rather allocate one array for all payloads of a document and provide
offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter
that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two
new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int
offset, int length). So far there was only a writeBytes()-method without an offset argument.

> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field.
The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled
for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and
FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload
of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta
is shifted one bit. The lowest bit is used to indicate whether the length of the following
payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same
length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip
point has to be known. Therefore the payload length is also stored in the skip list located
in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip
is used to indicate if the payload length is stored for a SkipDatum or if the length is the
same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only
the position and the payload length is loaded from the ProxFile. If the user calls getPayload()
then the payload is actually loaded. If getPayload() is not called before nextPosition() is
called again, then the payload data is just skipped.
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth
lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for
the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the
document and the previous occurrence (or zero, if this is the first   occurrence in this document).
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the
document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored.
If PositionDelta is even, then the length of the current payload equals the length of the
previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs.
Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs.
If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the
payload at the current skip point equals the length of the payload at the last skip point
and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for
the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length
only has to be stored once for every term. This should be a common case, because users probably
use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because
we benefit again from the same-length-compression since we only have to store the length zero
for the empty payloads once per term.
> All unit tests pass.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message