lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Engels <>
Subject Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields
Date Thu, 17 Aug 2006 00:33:11 GMT
I just think the compressed field type should be removed from lucene all together. Only the
binary field type should remain, and the application can externally compress/uncompress fields
using a fascade/containment hierarchy using Document.

That is

class MyDocument {
    Document doc;

   String getField(String name) {
       if(isCompressed(name) {
           return decompress(doc.getBinaryField())
           return doc.getField();

Or some such thing, and not deal with the compression at a lucene level. In order to have
Lucene deal with the compression, you would really need to settle on the compression type,
and parameters and how they would be stored - otherwise cross platform (or Plucene) would
never be able to read to access the index. If the compression were external, all the implementation
need is binary field support, and then they would only no be able to access the compressed
fields if they did not have a suitable way to decompress them.

Otherwise, I think you need a much more advanced compression scheme - similar to the PDF specification
- because different fields would ideally be compressed using different alogorithyms, and forcing
a one size fits all doesn't normally work well in such a low-level library.

-----Original Message-----
>From: Grant Ingersoll <>
>Sent: Aug 16, 2006 6:51 AM
>Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for
compressed fields
>On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote:
>> Hi,
>> In the issue, you wrote that "This way the indexing level just  
>> stores opaque
>> binary fields, and then Document handles compress/uncompressing as  
>> needed."
>> I have looked into the Lucene code, and it seems to me that it is  
>> Field that
>> should take care of compress/uncompress, and it is the FieldsReader  
>> and
>> FieldsWriter that should only view binary data.
>> Or you mean that compression should be completely external to Lucene ?
>I believe the consensus is it should be done externally.
>> In fact, from the end of the other thread "Flexible index format /  
>> Payloads
>> Cont'd", I was discussing about how to cutomize the way data are  
>> stored. So I
>> have looked deeper in the code and I think I have found a way to do  
>> so. And
>> as you could change the way is it stored, you also can define the  
>> compression
>> level, or handle your own compression algorithm. I will show you a  
>> patch, but
>> I have modified so much code because of my sevral tries, that I  
>> need first to
>> remove the unecessary changes. To describe it shortly :
>> - I have provided a way to provide you own FieldsReader and  
>> FieldsWriter (via
>> a factory). To create a IndexReader, you have to provide that  
>> factory; the
>> actual API is just using a default factory.
>> - I have moved the code of FieldsReader and FieldsReader that do  
>> the field
>> data reading to a new class FieldData. The FieldsReader instanciates a
>> FieldData, do a, and do a new Field 
>> (fielddata,...). The
>> FieldsReader do a field.getFieldData().write(output);
>> - so extending FieldsReader, you can provide you own implementation of
>> FieldData, so you can implement the way you want how data are  
>> stored and
>> read.
>> The tests pass successfully, but I have an issue with that design :  
>> one thing
>> that is important I think is that in the current design, we can  
>> read an index
>> in an old format, and just do a writer.addIndexes() into a new  
>> format. With
>> the new design, you cannot, because the writer will use the  
>> FieldData.write
>> provided by the reader.
>> To be continued...
>I would love to see this patch.  I think one could make a pretty good  
>argument for this kind of implementation being done "cleanly", that  
>is, it shouldn't necessarily involve reworking the internals, but  
>instead could represent the foundation for a new, codec based  
>indexing mechanism (with an implementation that can read/write the  
>existing file format.)
>> cheers,
>> Nicolas
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
>Grant Ingersoll
>Sr. Software Engineer
>Center for Natural Language Processing
>Syracuse University
>335 Hinds Hall
>Syracuse, NY 13244
>Voice: 315-443-5484
>Skype: grant_ingersoll
>Fax: 315-443-6886
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message