lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roy <royde...@gmail.com>
Subject Re: Binary fields and data compression
Date Wed, 01 Sep 2004 21:25:15 GMT
I also tried Drew Farris's binary patch. It works fine with a few
testing cases of mine. However, I didn't have enough time to do a
thorough performance comparison. I suggest the patch should be checked
into cvs.

On Wed, 01 Sep 2004 22:42:54 +0200, Bernhard Messer
<bernhard.messer@intrafind.de> wrote:
> Doug Cutting wrote:
> 
> > Bernhard Messer wrote:
> >
> >> a few month ago, there was a very interesting discussion about field
> >> compression and the possibility to store binary field values within a
> >> lucene document. Regarding to this topic, Drew Farris came up with a
> >> patch to add the necessary functionality. I ran all the necessary
> >> tests on his implementation and didn't find one problem. So the
> >> original implementation from Drew could now be enhanced to compress
> >> the binary field data (maybe even the text fields if they are stored
> >> only) before writing to disc. I made some simple statistical
> >> measurements using the java.util.zip package for data compression.
> >> Enabling it, we could save about 40% data when compressing plain text
> >> files with a size from 1KB to 4KB. If there is still some interest,
> >> we could first try to update the patch, because it's outdated due to
> >> several changes within the Fields class. After finishing that,
> >> compression could be added to the updated version of the patch.
> >
> >
> > I like this patch and support upgrading it and adding it to Lucene.
> >
> Having a single, huge patch, implementing all the functionality, seems
> to be very difficult to maintain thru Bugzilla. So i would suggest to
> split the whole implementation in maybe 3 different steps.
> 1) updating the binary field patch and add it to lucene
> 2) making FieldsReader and FieldsWriter more readable using private
> static finals and add compression
> 3) additional thoughts about compressing whole documents instead of
> single fields.
> 
> > I imagine a public API like:
> >
> >   public static final class Store {
> >
> >      [ ... ]
> >
> >      public static final COMPRESS = new Store();
> >   }
> >
> >   new Field(String, byte[]) // stored, not compressed or indexed
> >   new Field(String, byte[], Store)
> >
> > Also, in Field.java, perhaps we could replace:
> >
> >   String stringValue;
> >   Reader readerValue;
> >   byte[] binaryValue;
> >
> > with:
> >
> >   Object value;
> >
> > And in FieldsReader.java and FieldsWriter.java, some package-private
> > constants would make the code more readable, like:
> >
> >   static final int FieldWriter.IS_TOKENIZED = 1;
> >   static final int FieldWriter.IS_BINARY = 2;
> >   static final int FieldWriter.IS_COMPRESSED = 4;
> >
> > Note that it makes sense to compress non-binary values.  One could use
> > String.getBytes("UTF-8") and compress that.
> >
> I'm totally with you. Compressing string values would make sense if the
> length reaches a certain size (the same for byte[]). This limit is
> something we have to figure out, what the minimum size of a compression
> candidate has to be. During my tests, i saw that everything up to 100
> bytes is a perfect candidate for compression. But there is much more
> work to do in that area.
> 
> > I wonder if it might make more sense to compress entire document
> > records, rather than individual fields.  This would probably do better
> > when documents have lots of short text fields, as is not uncommon, and
> > would also minimize the fixed compression/decompression setup costs
> > (i.e., inflator/deflator allocations).  We could instead add a
> > "isCompressed" flag to Document, and then, in Field{Reader,Writer},
> > store a bit per document indicating whether it is compressed.
> > Document records could first be serialized uncompressed to a buffer
> > which is then compressed and written.  Thoughts?
> >
> Interesting idea. I think this strongly depends on the fields, the
> options they have and at least their values. Would it make sense to
> compress a field which is tokenized and indexed but not stored ? My be
> we could think on some kind of algorithm, checking the document fields
> setting and decide if it is a candidate for compression. Just a thought ;-)
> 
> 
> 
> > Doug
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message