hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-54) SequenceFile should compress blocks, not individual entries
Date Mon, 17 Jul 2006 02:05:36 GMT
Wouldn't knowing the first key of the next chunk suffice?  That is  
inexpensive to uncompress (coming first).

On Jul 15, 2006, at 11:00 PM, Bryan Pendleton (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/HADOOP-54? 
> page=comments#action_12421380 ]
> Bryan Pendleton commented on HADOOP-54:
> ---------------------------------------
> Another feature that might be useful: include an (optional) un- 
> compressed copy of the *last* key in a given compression spill.  
> Why? Because, for sorted SequenceFiles (specifically, for the data  
> file of a MapFile), when seeking through to find a given key/value,  
> knowing the last key in a given chunk allows skipping of  
> decompressing the entire key array.
> It should, of course, be optional, both because keys potentially be  
> large, and because SequenceFiles aren't all sorted. But, in the  
> MapFile case, it could reduce the cost of finding a hit during  
> lookups.
> Might also be useful to try to do some DFS block-aligning, to avoid  
> block requests and CRC calculations for data that's not really  
> going to be used. That sounds like it might be tricky to get,  
> though, because default GFS blocks are so large, and we're probably  
> talking about much smaller compression chunks. Does DFS have  
> variable-length block writing yet?
>> SequenceFile should compress blocks, not individual entries
>> -----------------------------------------------------------
>>                 Key: HADOOP-54
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: io
>>    Affects Versions: 0.2.0
>>            Reporter: Doug Cutting
>>         Assigned To: Michel Tourn
>>             Fix For: 0.5.0
>> SequenceFile will optionally compress individual values.  But both  
>> compression and performance would be much better if sequences of  
>> keys and values are compressed together.  Sync marks should only  
>> be placed between blocks.  This will require some changes to  
>> MapFile too, so that all file positions stored there are the  
>> positions of blocks, not entries within blocks.  Probably this can  
>> be accomplished by adding a getBlockStartPosition() method to  
>> SequenceFile.Writer.
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: http://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira

View raw message