From java-dev-return-15160-apmail-lucene-java-dev-archive=lucene.apache.org@lucene.apache.org Mon Jul 31 15:25:56 2006 Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 94059 invoked from network); 31 Jul 2006 15:25:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 31 Jul 2006 15:25:55 -0000 Received: (qmail 45528 invoked by uid 500); 31 Jul 2006 15:25:52 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 45486 invoked by uid 500); 31 Jul 2006 15:25:52 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 45472 invoked by uid 99); 31 Jul 2006 15:25:52 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Jul 2006 08:25:51 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [84.96.21.10] (HELO trinity.anyware-tech.com) (84.96.21.10) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Jul 2006 08:25:51 -0700 Received: from localhost (localhost [127.0.0.1]) by trinity.anyware-tech.com (Postfix) with ESMTP id 628EF4005EE for ; Mon, 31 Jul 2006 17:25:29 +0200 (CEST) Received: from trinity.anyware-tech.com ([127.0.0.1]) by localhost (trinity.anyware-tech.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 03640-03 for ; Mon, 31 Jul 2006 17:25:28 +0200 (CEST) Received: from [10.0.0.189] (unknown [10.0.0.189]) by trinity.anyware-tech.com (Postfix) with ESMTP id 14CB14002E3 for ; Mon, 31 Jul 2006 17:25:28 +0200 (CEST) From: Nicolas =?iso-8859-1?q?Lalev=E9e?= Organization: Anyware Technologies To: java-dev@lucene.apache.org Subject: Re: Flexible index format / Payloads Cont'd Date: Mon, 31 Jul 2006 17:25:26 +0200 User-Agent: KMail/1.9.1 References: <44A444A2.20003@gmail.com> <200607211023.54158.nicolas.lalevee@anyware-tech.com> <4F8AC42E-8371-4BB3-826E-9C7E6E4A749C@rectangular.com> In-Reply-To: <4F8AC42E-8371-4BB3-826E-9C7E6E4A749C@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200607311725.26387.nicolas.lalevee@anyware-tech.com> X-Virus-Scanned: Debian amavisd-new at anyware-tech.com X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a =E9crit=A0: > On Jul 21, 2006, at 1:23 AM, Nicolas Lalev=E9e wrote: > > In fact, that was my first implementaion. The problem with that is > > you can > > only store one value. But thinking a little more about it, storing > > one or > > more value is not an issue, because with the solution I proposed, > > no space is > > saved at all. > > In fact, when I thought about this format of field metadata, I was > > thinking > > about a way to make the Lucene user specify how to store it in the > > Lucene > > index format. For instance, the simple one would specify that it's > > a pointeur > > on some metadata (as you proposed), another one would specify that > > there are > > two pointeurs (in my use case, one for type, the other one for the > > language), > > and another one whould specify that it will be store directly as it is > > actually an integer (so no need to make a pointer on integer. But > > it was just > > a thought, I don't know if it is possible. WDYT ? > > I'm thinking that there would be a codecs file, say with the > extension .cdx and this format: > > Codecs (.cdx) --> CodecCount, CodecCount > CodecCount --> Uint32 > CodecClassName --> String > > That file would be read in its entirety when the index was > initialized and expanded into an array of codec objects, one per > CodecClassName. > > The .fdx file would add an additional int per doc... > > FieldIndex (.fdx) --> FieldValuesCodecNumber>SegSize > FieldValuesPosition --> Uint64 > FieldValuesCodecNumber --> Uint32 > > Now, before you read any data from the .fdt file, you know how to > interpret it. You seek the .fdt IndexInput to the right spot, then > feed it to the appropriate codec object from the codecs array. The > codec does the rest. In your case, you might write a codec that > would read a few bytes and strings of metadata up front. Or you > might have several different codecs, the identity of which indicates > fixed values for certain metadata fields: FrenchDocument, > ArabicDocument, etc. > > Would that scheme meet your needs? That looks good, but there is one restriction : it have to be per document.= =20 Let's explain a lit bit more my needs. In fact my app have to index some data which is structured in a RDF graph.= =20 Each rdf resource have a title and a description, each title and descriptio= n=20 being in different languages. The model we choose is to map a rdf resource = on=20 a document. Then the field name is the URI of the rdf property, and the fie= ld=20 value is the litteral or other resource. for instance : doc1 : URI:http://foo.com title:[en]foo title:[fr]truc So, in a document I will have several fields with different languages. For = my=20 use case, in fact I need only one "codec". It is a codec that will get 3=20 values, 2 of them being optionnal : a language, a type, and a value. In fact I was thinking about a more generic version that will allow the for= mat=20 compatibility, keeping .fdx as is : =46ieldData (.fdt) --> SegSize DocFieldData --> FieldCount, FieldCount And a default FieldsDataWriter will be the actual one, it will read the=20 RawData as Bits, Value, with Value --> String | BinaryValue,.... Then, for my app, I will provide some custom FieldsDataWriter that will do= =20 exactly what I want. What I don't know yet is how it breaks that API... because if I want to=20 provide my own FieldsDataWriter, I would also want to have my own=20 implementation of Fieldable... If you think this is a good idea, I will try to implement it. cheers, Nicolas --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org