hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-54) SequenceFile should compress blocks, not individual entries
Date Tue, 25 Jul 2006 02:32:29 GMT
I completely agree that you should incrementally decompress.  The  
right answer might be just enough for the next entry or a small  
buffer,  should performance test that.

My point on raw is that you can return a reference tuple in an object:

    <raw bytes,is compressed flag, compressor class> used in a reference

Then you read the bytes, decompressed if they come from a block  
compressed or an uncompressed file, compressed if they come from an  
item compressed file.

Then you pass this reference to the target sequence file's raw write  
method.  The target then compresses or decompresses as needed.

Since you package all of this up behind an API, folks will not get  
confused into using this essentially internal API to do the wrong  
thing  and it will efficiently passed item compressed objects from  
one such stream to another if given the chance.

This may be worth considering, since sorts and merges may often  
operate on item compressed values and this will avoid a lot of  
spurious decompression/compression.

PS we probably should only bother doing this for values.

On Jul 24, 2006, at 2:33 PM, Owen O'Malley (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-54? 
> page=comments#action_12423167 ]
>
> Owen O'Malley commented on HADOOP-54:
> -------------------------------------
>
> Eric, I don't see how to implement both block compression, which is  
> a huge win, and access to a pre-decompression representation.  
> Especially if what you want to do with the pre-decompression  
> representation is sorting or merging. Therefore, I was (and am)  
> proposing that the "raw" access is a little less raw and that the  
> byte[] representation is always decompressed. Am I missing  
> something? This is an semantic change to the "raw" SequenceFile  
> API, but I think it is required to get block-level compression.
>
> On a slight tangent, I think that the SequenceFile.Reader should  
> not decompress the entire block but just enough to get the next key/ 
> value pair.
>
>> SequenceFile should compress blocks, not individual entries
>> -----------------------------------------------------------
>>
>>                 Key: HADOOP-54
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: io
>>    Affects Versions: 0.2.0
>>            Reporter: Doug Cutting
>>         Assigned To: Arun C Murthy
>>             Fix For: 0.5.0
>>
>>         Attachments: VIntCompressionResults.txt
>>
>>
>> SequenceFile will optionally compress individual values.  But both  
>> compression and performance would be much better if sequences of  
>> keys and values are compressed together.  Sync marks should only  
>> be placed between blocks.  This will require some changes to  
>> MapFile too, so that all file positions stored there are the  
>> positions of blocks, not entries within blocks.  Probably this can  
>> be accomplished by adding a getBlockStartPosition() method to  
>> SequenceFile.Writer.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: http://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>


Mime
View raw message