Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <40A537DC.8020403@earthlink.net>
Date: Fri, 14 May 2004 15:19:24 -0600
From: Dmitry Serebrennikov <dmitrys@earthlink.net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:1.3) Gecko/20030312
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: stored field compression
References: <20040514115837.33290.qmail@web12703.mail.yahoo.com>
 <200405141610.08425.ykingma@xs4all.nl> <40A4F1A4.4090408@apache.org>
 <40A4F24D.5000907@apache.org> <40A508A0.2000204@earthlink.net>
 <40A5108D.9080605@apache.org>
In-Reply-To: <40A5108D.9080605@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>
>> A different approach would be to just allow binary data in fields. 
>> That way applications can compress and decompress as they see fit, 
>> plus they would be able to store numerical and other data more 
>> efficiently.
>
>
> That's an interesting idea.  One could, for convenience and 
> compatibility, add accessor methods to Field that, when you add a 
> String, convert it to UTF-8 bytes, and make stringValue() parse (and 
> possibly cache) a UTF-8 string from the binary value.  There'd be 
> another allocation per field read: FieldReader would construct a 
> byte[], then stringValue() would construct a String with a char[].  
> Right now we only construct a String with a char[] per stringValue().  
> Perhaps this is moot, especially if we're lazy about constructing the 
> strings and they're cached.  That way, for all the fields you don't 
> access you save an allocation.

Actually, I was thinking of something simpler... Somthing like a special 
case where one could supply binary data directly into a stored field. 
Something like:
public class Field {
    public static Field Binary(String name, byte[] value);
    public boolean isBinary();
    public byte[] binaryValue();
}

This would automatically become a stored field. Lucene wouldn't need to 
know what the data means - just carry it around. The binaryValue() can 
return null unless isBinary() is true, in which case you'd get the data 
back and stringValue() would return null instead.

This would be a start. If we want to provide special handling for ints, 
floats, and so on, we provide a BinaryField class, a la DateField.

We might lose some efficiency because ints and longs would be better off 
if they were stored as ints and longs rather than a byte[]...

Actually, we might be able to represent binary data fields as offsets 
into the complete byte[] that was read from the index file in the first 
place. That way we woudln't need to copy the data until binaryValue() 
method was called. Also the BinaryField class can do byte[] -> int 
conversion directly from the offsets into the main byte[] buffer, again 
saving byte[] allocation.

Would binary fields only be useful for stored fields? I can't really see 
how binary data could be usefully tokenized, but maybe in some 
multimedia applications? Binary keyword fields might be interesting. 
These could allow searching on integer ranges, more straight-forward 
date ranges, and more efficient data storage in some cases. That's a big 
change though. We'd have to change all searching to be based on binary 
tokens instead of strings.

>
>
>> Of course, this would then be a per-value compression and probably 
>> not as effective as a whole index compression that could be done with 
>> the other approaches.
>
>
> But, since documents are accessed randomly, we can't easily do a lot 
> better for field data. 

I don't know much about how Zip algorithm works internally, but it seems 
that there could be a parallel between the zip file with zip entries and 
the lucene index with lucene documents.

> This feature is primarily intended to make life easier for folks who 
> want to store whole documents in the index.  Selective use of gzip 
> would be a huge improvement over the present situation.  Alternate 
> compression algorithms might make things a bit better yet, but 
> probably not hugely. 

I agree, unless one can figure out how to share the dictionary across 
documents.
If we just go now with a simple binary data-bucket design described 
above, applications can do any clever implementation they chose. 
BinaryField class will provide helper methods for the most common 
things. Perhaps GZipField is another good candidate for the immediate 
future.

Going forward, perhaps there is a way to do compression such that 
dictionary is managed for each segment of the index, and merged when the 
segments are merged? If this is possible, it would be a good argument 
for Lucene to be compression-aware.

How does all of this sound?

Dmitry.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org