lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Binary fields and data compression
Date Wed, 01 Sep 2004 19:51:25 GMT
Bernhard Messer wrote:
> a few month ago, there was a very interesting discussion about field 
> compression and the possibility to store binary field values within a 
> lucene document. Regarding to this topic, Drew Farris came up with a 
> patch to add the necessary functionality. I ran all the necessary tests 
> on his implementation and didn't find one problem. So the original 
> implementation from Drew could now be enhanced to compress the binary 
> field data (maybe even the text fields if they are stored only) before 
> writing to disc. I made some simple statistical measurements using the 
> java.util.zip package for data compression. Enabling it, we could save 
> about 40% data when compressing plain text files with a size from 1KB to 
> 4KB. If there is still some interest, we could first try to update the 
> patch, because it's outdated due to several changes within the Fields 
> class. After finishing that, compression could be added to the updated 
> version of the patch.

I like this patch and support upgrading it and adding it to Lucene.

I imagine a public API like:

   public static final class Store {

      [ ... ]

      public static final COMPRESS = new Store();
   }

   new Field(String, byte[]) // stored, not compressed or indexed
   new Field(String, byte[], Store)

Also, in Field.java, perhaps we could replace:

   String stringValue;
   Reader readerValue;
   byte[] binaryValue;

with:

   Object value;

And in FieldsReader.java and FieldsWriter.java, some package-private 
constants would make the code more readable, like:

   static final int FieldWriter.IS_TOKENIZED = 1;
   static final int FieldWriter.IS_BINARY = 2;
   static final int FieldWriter.IS_COMPRESSED = 4;

Note that it makes sense to compress non-binary values.  One could use 
String.getBytes("UTF-8") and compress that.

I wonder if it might make more sense to compress entire document 
records, rather than individual fields.  This would probably do better 
when documents have lots of short text fields, as is not uncommon, and 
would also minimize the fixed compression/decompression setup costs 
(i.e., inflator/deflator allocations).  We could instead add a 
"isCompressed" flag to Document, and then, in Field{Reader,Writer}, 
store a bit per document indicating whether it is compressed.  Document 
records could first be serialized uncompressed to a buffer which is then 
compressed and written.  Thoughts?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message