hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (HADOOP-54) SequenceFile should compress blocks, not individual entries
Date Wed, 19 Jul 2006 08:07:17 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-54?page=all ]

Arun C Murthy reassigned HADOOP-54:

    Assignee: Arun C Murthy  (was: Michel Tourn)

 +1 for Owen's proposal.

 An unrelated issue: the 'append' method in SequenceFile.Writer is passed 2 Writables: key
and value. The Writable interface doesn't have a 'getLength' interface. This means one would
have to write out the key/value to a temporary buffer to actually figure out it's 'length'.
The lengths are particularly relevant here to ensure that the key/value pair can be put into
the keyBuffer/valueBuffer without violating the 'configured' maxBufferSize...

 To get around this issue: how about making the 'configured' bufferSize the 'lower_bound'
instead of the 'upper_bound'? This will ensure we can write out the key/value and then check
the buffer size, and if need be go ahead and compress etc. This will save the construction
of the temporary buffer for getting the key/value lengths. Related gain: it's far simpler
with this scheme to deal with outlier/rouge keys/values which are larger than bufferSize itself.

 Logical next step: make this 'bufferSize' configurable per SequenceFile, this will let applications
control it depending on the sizes of their keys/values. I propose to introduce a new constructor
with this as an argument for SequenceFile.Writer. This will then be written out as a part
of the file-header (along with compression details) and the SequenceFile.Reader can pick this
up and read accordingly. (Of course there will be a system-wide default if unspecified per




> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>                 Key: HADOOP-54
>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>         Assigned To: Arun C Murthy
>             Fix For: 0.5.0
> SequenceFile will optionally compress individual values.  But both compression and performance
would be much better if sequences of keys and values are compressed together.  Sync marks
should only be placed between blocks.  This will require some changes to MapFile too, so that
all file positions stored there are the positions of blocks, not entries within blocks.  Probably
this can be accomplished by adding a getBlockStartPosition() method to SequenceFile.Writer.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message