lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: Flexible index format / Payloads Cont'd
Date Fri, 30 Jun 2006 08:55:10 GMT
Marvin Humphrey wrote:
>
> Personally, I'm less interested in adding new features than I am in 
> solidifying and improving the core.
>
> The benefits I care about are:
>
>   * Decouple Lucene from it's file format.
>     o Make back-compatibility easier.
>     o Make refactoring easier.
>     o All the other goodness that comes with loose coupling.
>   * Improve IR precision, by writing a Boolean Scorer that
>     takes position into account, a la Brin/Page '98.
>   * Decrease time to launch a Searcher from rest.
>   * Simplify Lucene, conceptually.
>     o Indexes would have three parts: Term dictionary,
>       Postings, and Storage.
>     o Each part could be pluggable, following this format:
>       <header><object>+
>       * The de-serialization for each object is determined by
>         a plugin spec'd in the header.
>       * It's probably better to have separate header and data
>         files. 

> 3. Optional: Add a type-system for the payloads to make it
>>   easier to develop PostingsWriter/Reader plugins.
>
> IMO, this should wait.  It's going to be freakishly difficult to get 
> this stuff to work and maintain the commitments that Doug has laid out 
> for backwards compatibility.  There's also going to be trade-offs, and 
> so I'd anticipate contentious, interminable debate along the lines of 
> the recent Java 1.4/1.5 thread once there's real code and it becomes 
> clear who's lost a clock tick or two.
>
> Actually, I think pushing this forward is going to be so difficult, 
> that I'll be focusing my attentions on implementing it elsewhere.

I understand that backward compatibility is a big concern. Doug pointed
out, that Y.X+1 versions should be backward compatible to Y.X. The
things we talk about (fundamental change of index data structures,
plugins) will break the compatibility, so should be targeted for Lucene 3.

To have payloads in a earlier release 2.X, we could go a simpler way and
use the implementation I've done so far and which I'll finish soon. In the
following I'm going to describe this implementation in detail.

* File changes
   - Field Infos
     I'm using the 6th lowest order Bit of FieldBits, which is currently
     unused, to store whether payloads are enabled for a certain field.
   - Positions file
     For fields with disabled payloads, the format of the positions file
     does not change at all. If payloads are enabled, than a variable
     length payload is being stores with each position:

     ProxFile (.prx) --> <TermPositions>^TermCount
     TermPositions   --> <Positions>^DocFreq
     Positions       --> <PositionDelta, Payload>^Freq
     PositionDelta   --> VInt
     Payload         --> Byte+   

     Encoding of the Payload:
     If the payload is only one byte long then
        - if the value of the byte is <128, then this byte is stored as is
        - if the value of the byte is >=128, then a byte 10000001 (0x81)
          is stored, followed by the payload byte itself
     If the payload is longer than one byte but <127 then
        - a byte (0x80 | length) is stored, followed by the payload bytes
     If the payload is length is >=127 then
        - the payload_length-127 is stored as a VInt, followed by the 
payload
          bytes
     If the payload length is 0, then
        - one byte 0x80 is stored. This is being done to distinguish a
          payload with length=0 from a payload with length=1 and value=0
       

* API changes
   - org.apache.lucene.index.Payload
     Added this class with the following constructor and getter method:
     * public Payload(byte[] value);
     * public byte[] getValue();

   - org.apache.lucene.analysis.Token
     Added two new constructors and getter/setter:
     * public Token(String text, int start, int end, Payload payload);
     * public Token(String text, int start, int end, String typ,
                    Payload payload);
     * public Payload getPayload();
     * public void setPayload(Payload payload);


   - org.apache.lucene.document.Field
     Added PayloadParameter.YES/.NO to indicate whether Field stores 
payloads
     and added new constructors to create a field with payloads enabled:
     * public Field(String name, String value, Store store, Index index,
                    TermVector termVector, PayloadParameter payloadParam);
     * public Field(String name, String value, Store store, Index index,
                    TermVector termVector, Payload payload);
     * public Field(String name, Reader reader, TermVector termVector,
                    PayloadParameter payloadParam);

     Furthermore:
     * public Payload getPayload();
     * public boolean isPayloadStored();

   - org.apache.lucene.index.TermPositions
     Added the new method:
     * public Payload getPayload() throws IOException;
     Remark: In contrast to nextPosition(), this method does not move 
the pointer
             in the prox file. Therefore it should always be called after
             nextPosition().


So adding this payload feature to the Lucene core for a release 2.X
is not a big risk in my opinion for the following reasons:
   - API only extended
   - Lucene 2.X will be able to read an index created with an earlier
     version, because the Payload bit in FieldInfos will always be 0 then.
   - Payloads are disabled by default. They will only be enabled by 
using the
     new API.
   - If Payloads are disabled, then Lucene 2.0 is able to read an index
     created with Lucene 2.X, because the file formats don't change at 
all in
     that case.

So we could go ahead and add this to 2.X and keep working on the more
fundamental changes for Lucene 3. Sounds like a plan?

>
>
>> 5. Develop new or extend existing PostingsWriter/Reader plugins for
>>   desired features like XML search, POS, multi-faceted search, ...
>
> People will definitely want to scratch their own itches, but I'd argue 
> that this stuff should start out private.  And maybe stay that way!

I agree with that. We should focus on improving the Lucene core and start
offering a flexible payload mechanism, so that people can start developing
their own stuff. Later, if people submit good solutions, those might be
good candidates for contrib.

>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
Regards,
  Michael Busch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message