lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: stored field compression
Date Fri, 14 May 2004 18:31:41 GMT
Dmitry Serebrennikov wrote:
> A different approach would be to just allow binary data in fields. That 
> way applications can compress and decompress as they see fit, plus they 
> would be able to store numerical and other data more efficiently.

That's an interesting idea.  One could, for convenience and 
compatibility, add accessor methods to Field that, when you add a 
String, convert it to UTF-8 bytes, and make stringValue() parse (and 
possibly cache) a UTF-8 string from the binary value.  There'd be 
another allocation per field read: FieldReader would construct a byte[], 
then stringValue() would construct a String with a char[].  Right now we 
only construct a String with a char[] per stringValue().  Perhaps this 
is moot, especially if we're lazy about constructing the strings and 
they're cached.  That way, for all the fields you don't access you save 
an allocation.

Then you could also add intValue() and floatValue() methods, etc. which 
use binary representations.  These could speed up lots of stuff.

For easy extensibility you could do something like:

   interface FieldValue {
     byte[] getBytes();
     void setValue(byte[]);
   }

   /** Extracts the value of the field into <code>value</code>.
    * @see FieldValue#setValue()
    */
   void getValue(FieldValue value) {
     value.setValue(getBytes());
   }

   // replace the base Field ctor with:
   public Field(String name, FieldValue value,
                boolean store, boolean index,
                boolean token, boolean vector) {
     ...
     bytes = value.getBytes();
     ...
   }

   public class CompressedTextFieldValue implements FieldValue {
     public CompressedTextFieldValue(String text) { ... }
     public String toString() { ... }
     ...
   }

   public class SerializeableFieldValue implements FieldValue {
     public SerializeableFieldValue(Serializeable) { ... }
     public Serializeable getSerializeable() { ... }
     ...
   }

It could be up to the application to always use the same FieldValue 
class with an field, or we could add the FieldValue class to the index's 
FieldInfos...

I'd like to continue to be able avoid storing type information per field 
instance, and to avoid re-inventing object serialization, but maybe I 
need to give these up...

> Of course, this would then be a per-value compression and probably not 
> as effective as a whole index compression that could be done with the 
> other approaches.

But, since documents are accessed randomly, we can't easily do a lot 
better for field data.

> Doug, what compression algorithm did you have in mind 
> for the actual compression?

I was just thinking gzip.  Alternately, one could make it extensible, 
and tag each item with the compression algorithm, but I think that gets 
to be a mess.  Also, it's good to stick to a standard algorithm, so that 
perl, c#, C++, etc. ports can easily incorporate the feature.

This feature is primarily intended to make life easier for folks who 
want to store whole documents in the index.  Selective use of gzip would 
be a huge improvement over the present situation.  Alternate compression 
algorithms might make things a bit better yet, but probably not hugely.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message