lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <>
Subject Re: Flexible index format / Payloads Cont'd
Date Mon, 31 Jul 2006 15:25:26 GMT
Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> > In fact, that was my first implementaion. The problem with that is
> > you can
> > only store one value. But thinking a little more about it, storing
> > one or
> > more value is not an issue, because with the solution I proposed,
> > no space is
> > saved at all.
> > In fact, when I thought about this format of field metadata, I was
> > thinking
> > about a way to make the Lucene user specify how to store it in the
> > Lucene
> > index format. For instance, the simple one would specify that it's
> > a pointeur
> > on some metadata (as you proposed), another one would specify that
> > there are
> > two pointeurs (in my use case, one for type, the other one for the
> > language),
> > and another one whould specify that it will be store directly as it is
> > actually an integer (so no need to make a pointer on integer. But
> > it was just
> > a thought, I don't know if it is possible. WDYT ?
> I'm thinking that there would be a codecs file, say with the
> extension .cdx and this format:
>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
>    CodecCount     --> Uint32
>    CodecClassName --> String
> That file would be read in its entirety when the index was
> initialized and expanded into an array of codec objects, one per
> CodecClassName.
> The .fdx file would add an additional int per doc...
>    FieldIndex (.fdx) -->  <FieldValuesPosition,
>                            FieldValuesCodecNumber>SegSize
>    FieldValuesPosition    --> Uint64
>    FieldValuesCodecNumber --> Uint32
> Now, before you read any data from the .fdt file, you know how to
> interpret it.  You seek the .fdt IndexInput to the right spot, then
> feed it to the appropriate codec object from the codecs array.  The
> codec does the rest.  In your case, you might write a codec that
> would read a few bytes and strings of metadata up front.  Or you
> might have several different codecs, the identity of which indicates
> fixed values for certain metadata fields: FrenchDocument,
> ArabicDocument, etc.
> Would that scheme meet your needs?

That looks good, but there is one restriction : it have to be per document. 
Let's explain a lit bit more my needs.

In fact my app have to index some data which is structured in a RDF graph. 
Each rdf resource have a title and a description, each title and description 
being in different languages. The model we choose is to map a rdf resource on 
a document. Then the field name is the URI of the rdf property, and the field 
value is the litteral or other resource.
for instance :
doc1 : URI:   title:[en]foo   title:[fr]truc
So, in a document I will have several fields with different languages. For my 
use case, in fact I need only one "codec". It is a codec that will get 3 
values, 2 of them being optionnal : a language, a type, and a value.

In fact I was thinking about a more generic version that will allow the format 
compatibility, keeping .fdx as is :

FieldData (.fdt) -->  <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount

And a default FieldsDataWriter will be the actual one, it will read the 
RawData as Bits, Value, with Value -->  String | BinaryValue,....
Then, for my app, I will provide some custom FieldsDataWriter that will do 
exactly what I want.

What I don't know yet is how it breaks that API... because if I want to 
provide my own FieldsDataWriter, I would also want to have my own 
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message