Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 54314 invoked from network); 14 May 2004 21:09:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 14 May 2004 21:09:48 -0000 Received: (qmail 62107 invoked by uid 500); 14 May 2004 21:10:08 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 62068 invoked by uid 500); 14 May 2004 21:10:07 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 62045 invoked by uid 98); 14 May 2004 21:10:07 -0000 Received: from dmitrys@earthlink.net by hermes.apache.org by uid 82 with qmail-scanner-1.20 (clamuko: 0.70. Clear:RC:0(65.125.35.13):. Processed in 0.284151 secs); 14 May 2004 21:10:07 -0000 X-Qmail-Scanner-Mail-From: dmitrys@earthlink.net via hermes.apache.org X-Qmail-Scanner: 1.20 (Clear:RC:0(65.125.35.13):. Processed in 0.284151 secs) Received: from unknown (HELO host-65-125-35-13.larp.gov) (65.125.35.13) by hermes.apache.org with SMTP; 14 May 2004 21:10:07 -0000 Received: from earthlink.net ([65.174.70.194]) by host-65-125-35-13.larp.gov (8.11.6/8.11.6) with ESMTP id i4EKBfB31883 for ; Fri, 14 May 2004 14:11:41 -0600 Message-ID: <40A537DC.8020403@earthlink.net> Date: Fri, 14 May 2004 15:19:24 -0600 From: Dmitry Serebrennikov User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3) Gecko/20030312 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: stored field compression References: <20040514115837.33290.qmail@web12703.mail.yahoo.com> <200405141610.08425.ykingma@xs4all.nl> <40A4F1A4.4090408@apache.org> <40A4F24D.5000907@apache.org> <40A508A0.2000204@earthlink.net> <40A5108D.9080605@apache.org> In-Reply-To: <40A5108D.9080605@apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: hermes.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Doug Cutting wrote: > Dmitry Serebrennikov wrote: > >> A different approach would be to just allow binary data in fields. >> That way applications can compress and decompress as they see fit, >> plus they would be able to store numerical and other data more >> efficiently. > > > That's an interesting idea. One could, for convenience and > compatibility, add accessor methods to Field that, when you add a > String, convert it to UTF-8 bytes, and make stringValue() parse (and > possibly cache) a UTF-8 string from the binary value. There'd be > another allocation per field read: FieldReader would construct a > byte[], then stringValue() would construct a String with a char[]. > Right now we only construct a String with a char[] per stringValue(). > Perhaps this is moot, especially if we're lazy about constructing the > strings and they're cached. That way, for all the fields you don't > access you save an allocation. Actually, I was thinking of something simpler... Somthing like a special case where one could supply binary data directly into a stored field. Something like: public class Field { public static Field Binary(String name, byte[] value); public boolean isBinary(); public byte[] binaryValue(); } This would automatically become a stored field. Lucene wouldn't need to know what the data means - just carry it around. The binaryValue() can return null unless isBinary() is true, in which case you'd get the data back and stringValue() would return null instead. This would be a start. If we want to provide special handling for ints, floats, and so on, we provide a BinaryField class, a la DateField. We might lose some efficiency because ints and longs would be better off if they were stored as ints and longs rather than a byte[]... Actually, we might be able to represent binary data fields as offsets into the complete byte[] that was read from the index file in the first place. That way we woudln't need to copy the data until binaryValue() method was called. Also the BinaryField class can do byte[] -> int conversion directly from the offsets into the main byte[] buffer, again saving byte[] allocation. Would binary fields only be useful for stored fields? I can't really see how binary data could be usefully tokenized, but maybe in some multimedia applications? Binary keyword fields might be interesting. These could allow searching on integer ranges, more straight-forward date ranges, and more efficient data storage in some cases. That's a big change though. We'd have to change all searching to be based on binary tokens instead of strings. > > >> Of course, this would then be a per-value compression and probably >> not as effective as a whole index compression that could be done with >> the other approaches. > > > But, since documents are accessed randomly, we can't easily do a lot > better for field data. I don't know much about how Zip algorithm works internally, but it seems that there could be a parallel between the zip file with zip entries and the lucene index with lucene documents. > This feature is primarily intended to make life easier for folks who > want to store whole documents in the index. Selective use of gzip > would be a huge improvement over the present situation. Alternate > compression algorithms might make things a bit better yet, but > probably not hugely. I agree, unless one can figure out how to share the dictionary across documents. If we just go now with a simple binary data-bucket design described above, applications can do any clever implementation they chose. BinaryField class will provide helper methods for the most common things. Perhaps GZipField is another good candidate for the immediate future. Going forward, perhaps there is a way to do compression such that dictionary is managed for each segment of the index, and merged when the segments are merged? If this is possible, it would be a good argument for Lucene to be compression-aware. How does all of this sound? Dmitry. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org