lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@syr.edu>
Subject Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields
Date Thu, 17 Aug 2006 01:25:10 GMT
I agree.  I would vote for deprecating the compression stuff.  I am  
still interested in the flexible indexing part mentioned later in  
Nicolas' response, but that is a separate thread.



On Aug 16, 2006, at 8:33 PM, Robert Engels wrote:

> I just think the compressed field type should be removed from  
> lucene all together. Only the binary field type should remain, and  
> the application can externally compress/uncompress fields using a  
> fascade/containment hierarchy using Document.
>
> That is
>
> class MyDocument {
>     Document doc;
>
>    String getField(String name) {
>        if(isCompressed(name) {
>            return decompress(doc.getBinaryField())
>       else
>            return doc.getField();
> }
>
> Or some such thing, and not deal with the compression at a lucene  
> level. In order to have Lucene deal with the compression, you would  
> really need to settle on the compression type, and parameters and  
> how they would be stored - otherwise cross platform (or Plucene)  
> would never be able to read to access the index. If the compression  
> were external, all the implementation need is binary field support,  
> and then they would only no be able to access the compressed fields  
> if they did not have a suitable way to decompress them.
>
> Otherwise, I think you need a much more advanced compression scheme  
> - similar to the PDF specification - because different fields would  
> ideally be compressed using different alogorithyms, and forcing a  
> one size fits all doesn't normally work well in such a low-level  
> library.
>
>
>
> -----Original Message-----
>> From: Grant Ingersoll <gsingers@syr.edu>
>> Sent: Aug 16, 2006 6:51 AM
>> To: java-dev@lucene.apache.org
>> Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP  
>> compression level for compressed fields
>>
>>
>> On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote:
>>
>>> Hi,
>>>
>>> In the issue, you wrote that "This way the indexing level just
>>> stores opaque
>>> binary fields, and then Document handles compress/uncompressing as
>>> needed."
>>>
>>> I have looked into the Lucene code, and it seems to me that it is
>>> Field that
>>> should take care of compress/uncompress, and it is the FieldsReader
>>> and
>>> FieldsWriter that should only view binary data.
>>> Or you mean that compression should be completely external to  
>>> Lucene ?
>>>
>>
>> I believe the consensus is it should be done externally.
>>
>>> In fact, from the end of the other thread "Flexible index format /
>>> Payloads
>>> Cont'd", I was discussing about how to cutomize the way data are
>>> stored. So I
>>> have looked deeper in the code and I think I have found a way to do
>>> so. And
>>> as you could change the way is it stored, you also can define the
>>> compression
>>> level, or handle your own compression algorithm. I will show you a
>>> patch, but
>>> I have modified so much code because of my sevral tries, that I
>>> need first to
>>> remove the unecessary changes. To describe it shortly :
>>> - I have provided a way to provide you own FieldsReader and
>>> FieldsWriter (via
>>> a factory). To create a IndexReader, you have to provide that
>>> factory; the
>>> actual API is just using a default factory.
>>> - I have moved the code of FieldsReader and FieldsReader that do
>>> the field
>>> data reading to a new class FieldData. The FieldsReader  
>>> instanciates a
>>> FieldData, do a fielddata.read(input), and do a new Field
>>> (fielddata,...). The
>>> FieldsReader do a field.getFieldData().write(output);
>>> - so extending FieldsReader, you can provide you own  
>>> implementation of
>>> FieldData, so you can implement the way you want how data are
>>> stored and
>>> read.
>>> The tests pass successfully, but I have an issue with that design :
>>> one thing
>>> that is important I think is that in the current design, we can
>>> read an index
>>> in an old format, and just do a writer.addIndexes() into a new
>>> format. With
>>> the new design, you cannot, because the writer will use the
>>> FieldData.write
>>> provided by the reader.
>>> To be continued...
>>
>> I would love to see this patch.  I think one could make a pretty good
>> argument for this kind of implementation being done "cleanly", that
>> is, it shouldn't necessarily involve reworking the internals, but
>> instead could represent the foundation for a new, codec based
>> indexing mechanism (with an implementation that can read/write the
>> existing file format.)
>>
>>
>>>
>>> cheers,
>>> Nicolas
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Skype: grant_ingersoll
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message