Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 29675 invoked from network); 30 Jun 2006 08:55:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 30 Jun 2006 08:55:42 -0000 Received: (qmail 14152 invoked by uid 500); 30 Jun 2006 08:55:40 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 14129 invoked by uid 500); 30 Jun 2006 08:55:39 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 14118 invoked by uid 99); 30 Jun 2006 08:55:39 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jun 2006 01:55:39 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of buschmic@gmail.com designates 64.233.182.190 as permitted sender) Received: from [64.233.182.190] (HELO nf-out-0910.google.com) (64.233.182.190) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jun 2006 01:55:38 -0700 Received: by nf-out-0910.google.com with SMTP id a27so20653nfc for ; Fri, 30 Jun 2006 01:55:17 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=rjr8xdo8GC3Z6jRRytoMHgcvUPi3WLjfYe3CngA7qHdOmJusYFMjpmEBQ7GZ+w6RQx0qhtZURkZr+eyu97r1ynLEkxEweL93hYWM5+nZx4Ibh2ozzH0VptOc/otAahJOUHKhVsrE4zOXbajUjcRXiPgZBJJBTNAXcYRHdc50IGQ= Received: by 10.49.93.8 with SMTP id v8mr109029nfl; Fri, 30 Jun 2006 01:55:17 -0700 (PDT) Received: from ?192.168.0.65? ( [80.146.113.175]) by mx.gmail.com with ESMTP id n22sm1359962nfc.2006.06.30.01.55.16; Fri, 30 Jun 2006 01:55:16 -0700 (PDT) Message-ID: <44A4E6EE.9020708@gmail.com> Date: Fri, 30 Jun 2006 10:55:10 +0200 From: Michael Busch User-Agent: Thunderbird 1.5.0.4 (Windows/20060516) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Flexible index format / Payloads Cont'd References: <44A444A2.20003@gmail.com> <04005062-2CB4-4A26-AAD3-DB96015A87EB@rectangular.com> In-Reply-To: <04005062-2CB4-4A26-AAD3-DB96015A87EB@rectangular.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Marvin Humphrey wrote: > > Personally, I'm less interested in adding new features than I am in > solidifying and improving the core. > > The benefits I care about are: > > * Decouple Lucene from it's file format. > o Make back-compatibility easier. > o Make refactoring easier. > o All the other goodness that comes with loose coupling. > * Improve IR precision, by writing a Boolean Scorer that > takes position into account, a la Brin/Page '98. > * Decrease time to launch a Searcher from rest. > * Simplify Lucene, conceptually. > o Indexes would have three parts: Term dictionary, > Postings, and Storage. > o Each part could be pluggable, following this format: >
+ > * The de-serialization for each object is determined by > a plugin spec'd in the header. > * It's probably better to have separate header and data > files. > 3. Optional: Add a type-system for the payloads to make it >> easier to develop PostingsWriter/Reader plugins. > > IMO, this should wait. It's going to be freakishly difficult to get > this stuff to work and maintain the commitments that Doug has laid out > for backwards compatibility. There's also going to be trade-offs, and > so I'd anticipate contentious, interminable debate along the lines of > the recent Java 1.4/1.5 thread once there's real code and it becomes > clear who's lost a clock tick or two. > > Actually, I think pushing this forward is going to be so difficult, > that I'll be focusing my attentions on implementing it elsewhere. I understand that backward compatibility is a big concern. Doug pointed out, that Y.X+1 versions should be backward compatible to Y.X. The things we talk about (fundamental change of index data structures, plugins) will break the compatibility, so should be targeted for Lucene 3. To have payloads in a earlier release 2.X, we could go a simpler way and use the implementation I've done so far and which I'll finish soon. In the following I'm going to describe this implementation in detail. * File changes - Field Infos I'm using the 6th lowest order Bit of FieldBits, which is currently unused, to store whether payloads are enabled for a certain field. - Positions file For fields with disabled payloads, the format of the positions file does not change at all. If payloads are enabled, than a variable length payload is being stores with each position: ProxFile (.prx) --> ^TermCount TermPositions --> ^DocFreq Positions --> ^Freq PositionDelta --> VInt Payload --> Byte+ Encoding of the Payload: If the payload is only one byte long then - if the value of the byte is <128, then this byte is stored as is - if the value of the byte is >=128, then a byte 10000001 (0x81) is stored, followed by the payload byte itself If the payload is longer than one byte but <127 then - a byte (0x80 | length) is stored, followed by the payload bytes If the payload is length is >=127 then - the payload_length-127 is stored as a VInt, followed by the payload bytes If the payload length is 0, then - one byte 0x80 is stored. This is being done to distinguish a payload with length=0 from a payload with length=1 and value=0 * API changes - org.apache.lucene.index.Payload Added this class with the following constructor and getter method: * public Payload(byte[] value); * public byte[] getValue(); - org.apache.lucene.analysis.Token Added two new constructors and getter/setter: * public Token(String text, int start, int end, Payload payload); * public Token(String text, int start, int end, String typ, Payload payload); * public Payload getPayload(); * public void setPayload(Payload payload); - org.apache.lucene.document.Field Added PayloadParameter.YES/.NO to indicate whether Field stores payloads and added new constructors to create a field with payloads enabled: * public Field(String name, String value, Store store, Index index, TermVector termVector, PayloadParameter payloadParam); * public Field(String name, String value, Store store, Index index, TermVector termVector, Payload payload); * public Field(String name, Reader reader, TermVector termVector, PayloadParameter payloadParam); Furthermore: * public Payload getPayload(); * public boolean isPayloadStored(); - org.apache.lucene.index.TermPositions Added the new method: * public Payload getPayload() throws IOException; Remark: In contrast to nextPosition(), this method does not move the pointer in the prox file. Therefore it should always be called after nextPosition(). So adding this payload feature to the Lucene core for a release 2.X is not a big risk in my opinion for the following reasons: - API only extended - Lucene 2.X will be able to read an index created with an earlier version, because the Payload bit in FieldInfos will always be 0 then. - Payloads are disabled by default. They will only be enabled by using the new API. - If Payloads are disabled, then Lucene 2.0 is able to read an index created with Lucene 2.X, because the file formats don't change at all in that case. So we could go ahead and add this to 2.X and keep working on the more fundamental changes for Lucene 3. Sounds like a plan? > > >> 5. Develop new or extend existing PostingsWriter/Reader plugins for >> desired features like XML search, POS, multi-faceted search, ... > > People will definitely want to scratch their own itches, but I'd argue > that this stuff should start out private. And maybe stay that way! I agree with that. We should focus on improving the Lucene core and start offering a flexible payload mechanism, so that people can start developing their own stuff. Later, if people submit good solutions, those might be good candidates for contrib. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ Regards, Michael Busch --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org