lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Payloads
Date Wed, 20 Dec 2006 14:31:37 GMT
Hi Michael,

Have a look at

I am planning on starting on this soon (I know, I have been saying  
that for a while, but I really am.)  At any rate, another set of eyes  
would be good and I would be interested in hearing how your version  
compares/works with this patch from Nicolas.


On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:

> Hi all,
> currently it is not possible to add generic payloads to a posting  
> list. However, this feature would be useful for various use cases.  
> Some examples:
> - XML search
>  to index XML documents and allow structured search (e.g. XPath) it  
> is neccessary to store the depths of the terms
> - part-of-speech
>  payloads can be used to store the part of speech of a term occurrence
> - term boost
>  for terms that occur e.g. in bold font a payload containing a  
> boost value can be stored
> - ...
> The feature payloads has been requested and discussed a couple of  
> times, e. g. in
> -
> -
> In the latter thread I proposed a design a couple of months ago  
> that adds the possibility to Lucene to store variable-length  
> payloads inline in the posting list of a term. However, this design  
> had some drawbacks: the already complex field API was extended and  
> the payloads encoding was not optimal in terms of disk space.   
> Furthermore, the overall Lucene runtime performance suffered due to  
> the growth of the .prx file. In the meantime the patch LUCENE-687  
> (Lazy skipping on proximity file) was committed, which reduces the  
> number of reads and seeks on the .prx file. This minimizes the  
> performance degradation of a bigger .prx file. Also, LUCENE-695  
> (Improve BufferedIndexInput.readBytes() performance) was committed,  
> that speeds up reading mid-size chunks of bytes, which is  
> beneficial for payloads that are bigger than just a few bytes.
> Some weeks ago I started working on an improved design which I  
> would like to propose now. The new design simplifies the API  
> extensions (the Field API remains unchanged) and uses less disk  
> space in most use cases. Now there are only two classes that get  
> new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form  
> of a byte[] array.
> - TermPositions.getPayload()
>  Use this method to retrieve the payload of a term occurrence.
> The implementation is very flexible: the user does not have to  
> enable payloads explicilty for a field and can add payloads to all,  
> some or no Tokens. Due to the improved encoding those use cases are  
> handled efficiently in terms of disk space.
> Another thing I would like to point out is that this feature is  
> backwards compatible, meaning that the file format only changes if  
> the user explicitly adds payloads to the index. If no payloads are  
> used, all data structures remain unchanged.
> I'm going to open a new JIRA issue soon containing the patch and  
> details about implementation and file format changes.
> One more comment: It is a rather big patch and this is the initial  
> version, so I'm sure there will be a lot of discussions. I would  
> like to encourage people who consider this feature as useful to try  
> it out and give me some feedback about possible improvements.
> Best regards,
> - Michael
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message