lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <>
Subject Flexible index format / Payloads Cont'd
Date Thu, 29 Jun 2006 21:22:42 GMT
Hi everyone,

I'm working for IBM and started recently looking into Lucene.
I am very interested in the topic "flexible indexing / payloads",
that was discussed a couple of times in the last two months. I
did some investigation in the mailing lists, and found several
threads about this topic. Those threads didn't really lead to a
conclusion. That's my reason for starting this new thread: I hope
to get an understanding about:
   - Who is working on this feature?
   - Is there a concrete design?
   - Which functions/changes will the implementation include?
Furthermore, I would like to describe the work I did so far on
this feature.

To sum up the recent discussions, I'm going to list the different
threads about this topic:

--> There is a page in the Lucene Wiki to plan / track this topic:

--> May 08, 2006 - May 10, 2006

    - Grant Ingersoll mentions, that he is interested in working
      on this topic.
    - Doug suggests to have docs, frequencies, positions, and
      norms in one postings-file (freqs, pos, and norms optional).
      A suggested file format for such a postings-file can be found
      on the mentioned Wiki page.

--> May 28, 2006 - May 31, 2006;#36039

    - Nadav Har'El suggests to have arbitrary data associated with
      each posting, i. e. a variable-length payload stored with
      each position, an idea Nadav and I discussed earlier. Doug
      voted +1 for this idea.

--> May 31, 2006 - Jun  2, 2006;#36210
    - Marvin Humphrey talks about pluggable PostingsWriter/Reader,
      to make the postings file customizable. Marvin goes a step
      further and suggests to use plugins also for other index files.

I have the feeling, that many people are interested in having a
flexible index format. There are already various use cases:
   - Efficient parametric search
   - XML search
   - Part Of Speech (POS) annotations with each position
   - Multi-faceted search
   - ...

But I also have the feeling, that no clear course of action has
been defined yet, because this issue is quite complex since
it is not so easy to generalize the index data structures to
satisfy all demands/use cases, while maintaining the
straightforwardness of Lucene.

In the following I would like to describe the work I did so far
on this issue and propose a strategy on how to work on it in the
future to get the complexity under control.

I have made a prototype implementation of payloads. In my approach
I leave the frequency file as is and only change the positions file.
I can store a variable length payload (byte[]) with each position.
The payloads can be enabled/disabled on field level. The API changes
  - new Field constructor, that takes a Payload as additional data
  - a Token stores a Payload, so an analyzer can produce tokens with
    arbitrary payloads
  - TermPositions got a getPayload method()

This prototype works very well, and we use it to play around with
multi-faceted search. But I think I should go a bit further, and
merge the frequency and position files into a single postings file,
which seemed to be the opinion in the mailing list threads.

I would suggest to split up the whole work to have smaller work items
and to have clearly defined milestones. Thus I suggest the
following steps:
1. Introduce postings file with the following format:
   <DocDelta, Payload>*
     DocDelta --> VInt
     DocDelta/2 is the difference between this document number and
     the previous document number.
     Payload --> Byte, if DocDelta is even
     Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
       Payload_Length --> VInt
       Payload_Data   --> Byte^Payload_Length

   Furthermore, it should be possible to enabled/disable payloads
   on field level.

2. Add multilevel skipping (tree structure) for the postings-file.
   One-level skipping, as being used now in Lucene, is probably
   not efficient enough for the new postings file, because it can
   be very big. Question: Should we include skipping information
   directly in the postings file, or should we introduce a new file
   containing the skipping infos? I think it should improve cache
   performance to have the skip tree in a different file.

3. Optional: Add a type-system for the payloads to make it
   easier to develop PostingsWriter/Reader plugins.

4. Make the PostingsWriter/Reader pluggable and develop default
   PostingsWriter/Reader plugins, that store frequencies, positions,
   and norms as payloads in the postings file. Should be configurable,
   to enable the different options Doug suggested:
   a. <doc>+
   b. <doc, boost>+
   c. <doc, freq, <position>+ >+
   d. <doc, freq, <position, boost>+ >+

5. Develop new or extend existing PostingsWriter/Reader plugins for
   desired features like XML search, POS, multi-faceted search, ...

Please let me know what you think about my suggestions. If people
like this approach, then I can add the information to the Wiki
planning page and start working on it.

Best Regards,
  Michael Busch

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message