lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <nicolas.lale...@anyware-tech.com>
Subject Re: Flexible index format / Payloads Cont'd
Date Mon, 31 Jul 2006 16:26:01 GMT
Hi,

Le Lundi 31 Juillet 2006 17:28, robert engels a écrit :
> Doing this beak compatibility with non-Java Lucene implementations.

For me, a such compatibilty is the file format one. Am I wrong ?
In such a case, I don't see any compatibilty break as the default 
implementation of FieldsDataWriter is a actual one. And if I generate an 
index with my custom writer, I will expect my index to be uncompatible with 
other implementation, even with other Java ones.

> Not sure it matters, but I thought I would point it out. I have
> always thought that Lucene should be compatible at an API level only,
> and MAYBE create a network access protocol for queries and updates.

I didn't talked about network access... I don't see your point...

>
> On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:
> > Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
> >> On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
> >>> In fact, that was my first implementaion. The problem with that is
> >>> you can
> >>> only store one value. But thinking a little more about it, storing
> >>> one or
> >>> more value is not an issue, because with the solution I proposed,
> >>> no space is
> >>> saved at all.
> >>> In fact, when I thought about this format of field metadata, I was
> >>> thinking
> >>> about a way to make the Lucene user specify how to store it in the
> >>> Lucene
> >>> index format. For instance, the simple one would specify that it's
> >>> a pointeur
> >>> on some metadata (as you proposed), another one would specify that
> >>> there are
> >>> two pointeurs (in my use case, one for type, the other one for the
> >>> language),
> >>> and another one whould specify that it will be store directly as
> >>> it is
> >>> actually an integer (so no need to make a pointer on integer. But
> >>> it was just
> >>> a thought, I don't know if it is possible. WDYT ?
> >>
> >> I'm thinking that there would be a codecs file, say with the
> >> extension .cdx and this format:
> >>
> >>    Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
> >>    CodecCount     --> Uint32
> >>    CodecClassName --> String
> >>
> >> That file would be read in its entirety when the index was
> >> initialized and expanded into an array of codec objects, one per
> >> CodecClassName.
> >>
> >> The .fdx file would add an additional int per doc...
> >>
> >>    FieldIndex (.fdx) -->  <FieldValuesPosition,
> >>                            FieldValuesCodecNumber>SegSize
> >>    FieldValuesPosition    --> Uint64
> >>    FieldValuesCodecNumber --> Uint32
> >>
> >> Now, before you read any data from the .fdt file, you know how to
> >> interpret it.  You seek the .fdt IndexInput to the right spot, then
> >> feed it to the appropriate codec object from the codecs array.  The
> >> codec does the rest.  In your case, you might write a codec that
> >> would read a few bytes and strings of metadata up front.  Or you
> >> might have several different codecs, the identity of which indicates
> >> fixed values for certain metadata fields: FrenchDocument,
> >> ArabicDocument, etc.
> >>
> >> Would that scheme meet your needs?
> >
> > That looks good, but there is one restriction : it have to be per
> > document.
> > Let's explain a lit bit more my needs.
> >
> > In fact my app have to index some data which is structured in a RDF
> > graph.
> > Each rdf resource have a title and a description, each title and
> > description
> > being in different languages. The model we choose is to map a rdf
> > resource on
> > a document. Then the field name is the URI of the rdf property, and
> > the field
> > value is the litteral or other resource.
> > for instance :
> > doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
> > So, in a document I will have several fields with different
> > languages. For my
> > use case, in fact I need only one "codec". It is a codec that will
> > get 3
> > values, 2 of them being optionnal : a language, a type, and a value.
> >
> > In fact I was thinking about a more generic version that will allow
> > the format
> > compatibility, keeping .fdx as is :
> >
> > FieldData (.fdt) -->  <DocFieldData>SegSize
> > DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
> >
> > And a default FieldsDataWriter will be the actual one, it will read
> > the
> > RawData as Bits, Value, with Value -->  String | BinaryValue,....
> > Then, for my app, I will provide some custom FieldsDataWriter that
> > will do
> > exactly what I want.
> >
> > What I don't know yet is how it breaks that API... because if I
> > want to
> > provide my own FieldsDataWriter, I would also want to have my own
> > implementation of Fieldable...
> > If you think this is a good idea, I will try to implement it.
> >
> > cheers,
> > Nicolas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message