lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Flexible index format / Payloads Cont'd
Date Fri, 21 Jul 2006 10:37:03 GMT

On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> In fact, that was my first implementaion. The problem with that is  
> you can
> only store one value. But thinking a little more about it, storing  
> one or
> more value is not an issue, because with the solution I proposed,  
> no space is
> saved at all.
> In fact, when I thought about this format of field metadata, I was  
> thinking
> about a way to make the Lucene user specify how to store it in the  
> Lucene
> index format. For instance, the simple one would specify that it's  
> a pointeur
> on some metadata (as you proposed), another one would specify that  
> there are
> two pointeurs (in my use case, one for type, the other one for the  
> language),
> and another one whould specify that it will be store directly as it is
> actually an integer (so no need to make a pointer on integer. But  
> it was just
> a thought, I don't know if it is possible. WDYT ?

I'm thinking that there would be a codecs file, say with the  
extension .cdx and this format:

   Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
   CodecCount     --> Uint32
   CodecClassName --> String

That file would be read in its entirety when the index was  
initialized and expanded into an array of codec objects, one per  

The .fdx file would add an additional int per doc...

   FieldIndex (.fdx) -->  <FieldValuesPosition,
   FieldValuesPosition    --> Uint64
   FieldValuesCodecNumber --> Uint32

Now, before you read any data from the .fdt file, you know how to  
interpret it.  You seek the .fdt IndexInput to the right spot, then  
feed it to the appropriate codec object from the codecs array.  The  
codec does the rest.  In your case, you might write a codec that  
would read a few bytes and strings of metadata up front.  Or you  
might have several different codecs, the identity of which indicates  
fixed values for certain metadata fields: FrenchDocument,  
ArabicDocument, etc.

Would that scheme meet your needs?

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message