lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Payloads
Date Wed, 20 Dec 2006 14:19:18 GMT
Hi all,

currently it is not possible to add generic payloads to a posting list. 
However, this feature would be useful for various use cases. Some examples:
- XML search
  to index XML documents and allow structured search (e.g. XPath) it is 
neccessary to store the depths of the terms
- part-of-speech
  payloads can be used to store the part of speech of a term occurrence
- term boost
  for terms that occur e.g. in bold font a payload containing a boost 
value can be stored
- ...

The feature payloads has been requested and discussed a couple of times, 
e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In the latter thread I proposed a design a couple of months ago that 
adds the possibility to Lucene to store variable-length payloads inline 
in the posting list of a term. However, this design had some drawbacks: 
the already complex field API was extended and the payloads encoding was 
not optimal in terms of disk space.  Furthermore, the overall Lucene 
runtime performance suffered due to the growth of the .prx file. In the 
meantime the patch LUCENE-687 (Lazy skipping on proximity file) was 
committed, which reduces the number of reads and seeks on the .prx file. 
This minimizes the performance degradation of a bigger .prx file. Also, 
LUCENE-695 (Improve BufferedIndexInput.readBytes() performance) was 
committed, that speeds up reading mid-size chunks of bytes, which is 
beneficial for payloads that are bigger than just a few bytes.

Some weeks ago I started working on an improved design which I would 
like to propose now. The new design simplifies the API extensions (the 
Field API remains unchanged) and uses less disk space in most use cases. 
Now there are only two classes that get new methods:
- Token.setPayload()
  Use this method to add arbitrary metadata to a Token in the form of a 
byte[] array.
 
- TermPositions.getPayload()
  Use this method to retrieve the payload of a term occurrence.
 
The implementation is very flexible: the user does not have to enable 
payloads explicilty for a field and can add payloads to all, some or no 
Tokens. Due to the improved encoding those use cases are handled 
efficiently in terms of disk space.

Another thing I would like to point out is that this feature is 
backwards compatible, meaning that the file format only changes if the 
user explicitly adds payloads to the index. If no payloads are used, all 
data structures remain unchanged.

I'm going to open a new JIRA issue soon containing the patch and details 
about implementation and file format changes.

One more comment: It is a rather big patch and this is the initial 
version, so I'm sure there will be a lot of discussions. I would like to 
encourage people who consider this feature as useful to try it out and 
give me some feedback about possible improvements.

Best regards,
- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message