lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible index format / Payloads Cont'd
Date Thu, 29 Jun 2006 23:47:28 GMT

On Jun 29, 2006, at 2:22 PM, Michael Busch wrote:

>   - Is there a concrete design?

Not that I am aware of.

> I have the feeling, that many people are interested in having a
> flexible index format. There are already various use cases:
>   - Efficient parametric search

This comes at the expense of a significant file size increase and  
performance hit.  Think a book index that not only lists page number  
but also category.

   axle => 3, 67, 89, 244

vs...

   axle => 3 cars, 67 cars, 89 trucks, 244 cars

Scanning through the latter is going to be more expensive.  It might  
be worth it in specific cases, but it's not the long-hoped-for  
panacea that would give Lucene all the  features of an RDBMS without  
incurring any costs.  :)

>   - Part Of Speech (POS) annotations with each position

This is an example of where it might be worth it... to Grant, and  
Grant only.

Personally, I'm less interested in adding new features than I am in  
solidifying and improving the core.

The benefits I care about are:

   * Decouple Lucene from it's file format.
     o Make back-compatibility easier.
     o Make refactoring easier.
     o All the other goodness that comes with loose coupling.
   * Improve IR precision, by writing a Boolean Scorer that
     takes position into account, a la Brin/Page '98.
   * Decrease time to launch a Searcher from rest.
   * Simplify Lucene, conceptually.
     o Indexes would have three parts: Term dictionary,
       Postings, and Storage.
     o Each part could be pluggable, following this format:
       <header><object>+
       * The de-serialization for each object is determined by
         a plugin spec'd in the header.
       * It's probably better to have separate header and data
         files.

> I would suggest to split up the whole work to have smaller work items
> and to have clearly defined milestones. Thus I suggest the
> following steps:
> 1. Introduce postings file with the following format:
>   <DocDelta, Payload>*
>     DocDelta --> VInt
>     DocDelta/2 is the difference between this document number and
>     the previous document number.
>     Payload --> Byte, if DocDelta is even
>     Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
>       Payload_Length --> VInt
>       Payload_Data   --> Byte^Payload_Length

Good stuff!  Now, if you put that whole thing in a plugin, you'll  
have the chance to refine it even after deployment if you think of a  
way to improve it -- by adding another plugin.  And, if it becomes  
too unwieldy and inflexible, you're not stuck with it.

>   Furthermore, it should be possible to enabled/disable payloads
>   on field level.

Maybe each field should get its own file, and its own encoding/ 
decoding object.  Then you don't have to check each object/record to  
see which codec to use.

Or maybe there should be an array of codec objects, indexed by field  
number.

   fieldNum = input->readVint();
   decoders[fieldNum].read(input);

> 2. Add multilevel skipping (tree structure) for the postings-file.
>   One-level skipping, as being used now in Lucene, is probably
>   not efficient enough for the new postings file, because it can
>   be very big. Question: Should we include skipping information
>   directly in the postings file, or should we introduce a new file
>   containing the skipping infos? I think it should improve cache
>   performance to have the skip tree in a different file.

Interesting.  I think I'd punt and leave it up to the plugin.  Maybe  
you'd have an extra large header if there was a lot of stuff to be  
cached.

> 3. Optional: Add a type-system for the payloads to make it
>   easier to develop PostingsWriter/Reader plugins.

IMO, this should wait.  It's going to be freakishly difficult to get  
this stuff to work and maintain the commitments that Doug has laid  
out for backwards compatibility.  There's also going to be trade- 
offs, and so I'd anticipate contentious, interminable debate along  
the lines of the recent Java 1.4/1.5 thread once there's real code  
and it becomes clear who's lost a clock tick or two.

Actually, I think pushing this forward is going to be so difficult,  
that I'll be focusing my attentions on implementing it elsewhere.

> 4. Make the PostingsWriter/Reader pluggable and develop default
>   PostingsWriter/Reader plugins, that store frequencies, positions,
>   and norms as payloads in the postings file. Should be configurable,
>   to enable the different options Doug suggested:
>   a. <doc>+
>   b. <doc, boost>+
>   c. <doc, freq, <position>+ >+
>   d. <doc, freq, <position, boost>+ >+

Got any ideas as to how the Field constructors should look?

> 5. Develop new or extend existing PostingsWriter/Reader plugins for
>   desired features like XML search, POS, multi-faceted search, ...

People will definitely want to scratch their own itches, but I'd  
argue that this stuff should start out private.  And maybe stay that  
way!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message