hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered
Date Fri, 17 Mar 2006 20:06:22 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-87?page=comments#action_12370876 ] 

Doug Cutting commented on HADOOP-87:
------------------------------------

There's a lot of commented-out code added by this patch.  Can you remove that, or is it important
that it remain?  You also add a new public append() method, but nothing calls it outside of
this file.  So it probably doesn't need to be public.  But it replicates a lot of the logic
from another append() method.  Can't we somehow implement this with the old append method,
or define the old public append method in terms of this new private method?  Replicating logic
is not good.  Finally, there are some spurious whitespace changes in your patch.

> SequenceFile performance degrades substantially compression is on and large values are
encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer',
this grows as large as needed to store the serialized value and stays as large as the largest
serialized value encountered. If, for instance a stream has a single 8MB value followed by
several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB
buffer is always copied over the JNI boundary regardless of the size of the value. We've observed
this over several runs where compression performance degrades by a couple of orders of magnitude
when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second,
the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the
compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum
reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process
the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which
takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize
= inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate
a large enough output buffer and compress everything in 1 pass. The output buffer, of course,
needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message