Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 1333 invoked from network); 1 Sep 2004 21:25:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 1 Sep 2004 21:25:50 -0000 Received: (qmail 23148 invoked by uid 500); 1 Sep 2004 21:25:21 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 23031 invoked by uid 500); 1 Sep 2004 21:25:20 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 22916 invoked by uid 99); 1 Sep 2004 21:25:19 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: domain of roydebox@gmail.com designates 64.233.170.201 as permitted sender) Received: from [64.233.170.201] (HELO mproxy.gmail.com) (64.233.170.201) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 01 Sep 2004 14:25:18 -0700 Received: by mproxy.gmail.com with SMTP id 75so511910rnk for ; Wed, 01 Sep 2004 14:25:15 -0700 (PDT) Received: by 10.38.3.58 with SMTP id 58mr1761392rnc; Wed, 01 Sep 2004 14:25:15 -0700 (PDT) Received: by 10.38.9.51 with HTTP; Wed, 1 Sep 2004 14:25:15 -0700 (PDT) Message-ID: <7f5f89bf04090114253b55f678@mail.gmail.com> Date: Wed, 1 Sep 2004 14:25:15 -0700 From: Roy Reply-To: Roy To: Lucene Developers List Subject: Re: Binary fields and data compression In-Reply-To: <4136344E.4030808@intrafind.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit References: <41339EF6.1080906@intrafind.de> <4136283D.5070309@apache.org> <4136344E.4030808@intrafind.de> X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I also tried Drew Farris's binary patch. It works fine with a few testing cases of mine. However, I didn't have enough time to do a thorough performance comparison. I suggest the patch should be checked into cvs. On Wed, 01 Sep 2004 22:42:54 +0200, Bernhard Messer wrote: > Doug Cutting wrote: > > > Bernhard Messer wrote: > > > >> a few month ago, there was a very interesting discussion about field > >> compression and the possibility to store binary field values within a > >> lucene document. Regarding to this topic, Drew Farris came up with a > >> patch to add the necessary functionality. I ran all the necessary > >> tests on his implementation and didn't find one problem. So the > >> original implementation from Drew could now be enhanced to compress > >> the binary field data (maybe even the text fields if they are stored > >> only) before writing to disc. I made some simple statistical > >> measurements using the java.util.zip package for data compression. > >> Enabling it, we could save about 40% data when compressing plain text > >> files with a size from 1KB to 4KB. If there is still some interest, > >> we could first try to update the patch, because it's outdated due to > >> several changes within the Fields class. After finishing that, > >> compression could be added to the updated version of the patch. > > > > > > I like this patch and support upgrading it and adding it to Lucene. > > > Having a single, huge patch, implementing all the functionality, seems > to be very difficult to maintain thru Bugzilla. So i would suggest to > split the whole implementation in maybe 3 different steps. > 1) updating the binary field patch and add it to lucene > 2) making FieldsReader and FieldsWriter more readable using private > static finals and add compression > 3) additional thoughts about compressing whole documents instead of > single fields. > > > I imagine a public API like: > > > > public static final class Store { > > > > [ ... ] > > > > public static final COMPRESS = new Store(); > > } > > > > new Field(String, byte[]) // stored, not compressed or indexed > > new Field(String, byte[], Store) > > > > Also, in Field.java, perhaps we could replace: > > > > String stringValue; > > Reader readerValue; > > byte[] binaryValue; > > > > with: > > > > Object value; > > > > And in FieldsReader.java and FieldsWriter.java, some package-private > > constants would make the code more readable, like: > > > > static final int FieldWriter.IS_TOKENIZED = 1; > > static final int FieldWriter.IS_BINARY = 2; > > static final int FieldWriter.IS_COMPRESSED = 4; > > > > Note that it makes sense to compress non-binary values. One could use > > String.getBytes("UTF-8") and compress that. > > > I'm totally with you. Compressing string values would make sense if the > length reaches a certain size (the same for byte[]). This limit is > something we have to figure out, what the minimum size of a compression > candidate has to be. During my tests, i saw that everything up to 100 > bytes is a perfect candidate for compression. But there is much more > work to do in that area. > > > I wonder if it might make more sense to compress entire document > > records, rather than individual fields. This would probably do better > > when documents have lots of short text fields, as is not uncommon, and > > would also minimize the fixed compression/decompression setup costs > > (i.e., inflator/deflator allocations). We could instead add a > > "isCompressed" flag to Document, and then, in Field{Reader,Writer}, > > store a bit per document indicating whether it is compressed. > > Document records could first be serialized uncompressed to a buffer > > which is then compressed and written. Thoughts? > > > Interesting idea. I think this strongly depends on the fields, the > options they have and at least their values. Would it make sense to > compress a field which is tokenized and indexed but not stored ? My be > we could think on some kind of algorithm, checking the document fields > setting and decide if it is a candidate for compression. Just a thought ;-) > > > > > Doug > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org