lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <reng...@ix.netcom.com>
Subject Re: Flexible index format / Payloads Cont'd
Date Mon, 31 Jul 2006 15:28:30 GMT
Doing this beak compatibility with non-Java Lucene implementations.  
Not sure it matters, but I thought I would point it out. I have  
always thought that Lucene should be compatible at an API level only,  
and MAYBE create a network access protocol for queries and updates.

On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:

> Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
>> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
>>> In fact, that was my first implementaion. The problem with that is
>>> you can
>>> only store one value. But thinking a little more about it, storing
>>> one or
>>> more value is not an issue, because with the solution I proposed,
>>> no space is
>>> saved at all.
>>> In fact, when I thought about this format of field metadata, I was
>>> thinking
>>> about a way to make the Lucene user specify how to store it in the
>>> Lucene
>>> index format. For instance, the simple one would specify that it's
>>> a pointeur
>>> on some metadata (as you proposed), another one would specify that
>>> there are
>>> two pointeurs (in my use case, one for type, the other one for the
>>> language),
>>> and another one whould specify that it will be store directly as  
>>> it is
>>> actually an integer (so no need to make a pointer on integer. But
>>> it was just
>>> a thought, I don't know if it is possible. WDYT ?
>>
>> I'm thinking that there would be a codecs file, say with the
>> extension .cdx and this format:
>>
>>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
>>    CodecCount     --> Uint32
>>    CodecClassName --> String
>>
>> That file would be read in its entirety when the index was
>> initialized and expanded into an array of codec objects, one per
>> CodecClassName.
>>
>> The .fdx file would add an additional int per doc...
>>
>>    FieldIndex (.fdx) -->  <FieldValuesPosition,
>>                            FieldValuesCodecNumber>SegSize
>>    FieldValuesPosition    --> Uint64
>>    FieldValuesCodecNumber --> Uint32
>>
>> Now, before you read any data from the .fdt file, you know how to
>> interpret it.  You seek the .fdt IndexInput to the right spot, then
>> feed it to the appropriate codec object from the codecs array.  The
>> codec does the rest.  In your case, you might write a codec that
>> would read a few bytes and strings of metadata up front.  Or you
>> might have several different codecs, the identity of which indicates
>> fixed values for certain metadata fields: FrenchDocument,
>> ArabicDocument, etc.
>>
>> Would that scheme meet your needs?
>
> That looks good, but there is one restriction : it have to be per  
> document.
> Let's explain a lit bit more my needs.
>
> In fact my app have to index some data which is structured in a RDF  
> graph.
> Each rdf resource have a title and a description, each title and  
> description
> being in different languages. The model we choose is to map a rdf  
> resource on
> a document. Then the field name is the URI of the rdf property, and  
> the field
> value is the litteral or other resource.
> for instance :
> doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
> So, in a document I will have several fields with different  
> languages. For my
> use case, in fact I need only one "codec". It is a codec that will  
> get 3
> values, 2 of them being optionnal : a language, a type, and a value.
>
> In fact I was thinking about a more generic version that will allow  
> the format
> compatibility, keeping .fdx as is :
>
> FieldData (.fdt) -->  <DocFieldData>SegSize
> DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
>
> And a default FieldsDataWriter will be the actual one, it will read  
> the
> RawData as Bits, Value, with Value -->  String | BinaryValue,....
> Then, for my app, I will provide some custom FieldsDataWriter that  
> will do
> exactly what I want.
>
> What I don't know yet is how it breaks that API... because if I  
> want to
> provide my own FieldsDataWriter, I would also want to have my own
> implementation of Fieldable...
> If you think this is a good idea, I will try to implement it.
>
> cheers,
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message