hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Kang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4196) Possible performance enhancement in Hadoop compress module
Date Fri, 02 Apr 2010 05:30:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852715#action_12852715
] 

Xiao Kang commented on HADOOP-4196:
-----------------------------------

Thanks Hong Tang for noticing duplication of another jira HADOOP-6662. 

Since the first performance enhancement suggestion is clear and easy to implement, maybe we
can resolve it seperately. Close HADOOP-6662 and move the patch to this or dicuss in HADOOP-6662.

> Possible performance enhancement in Hadoop compress module
> ----------------------------------------------------------
>
>                 Key: HADOOP-4196
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4196
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.18.0
>            Reporter: Hong Tang
>
> There are several less performant implementation issues with the current Hadoop compression
module. Generally, the opportunities all come from the fact that the granuarities of I/O operations
from the CompressionStream and DecompressionStream are not controllable by the users, and
thus users are forced to attach BufferedInputStream or BufferedOutputStream to both ends of
the CompressionStream and DecompressionStream:
> - ZlibCompressor: always returns false from needInput() after setInput(), and thus lead
to a native call deflateBytesDirect() for almost every write() operation from CompressorStream().
This becomes problematic when applications call write() on the CompressorStream with small
write sizes (e.g. one byte at a time). It is better to follow similar code path in LzoCompressor
and append to internal uncompressed data buffer.
> - CompressorStream: whenever the compressor produces some compressed data, it will directly
issue write() calls to the down stream. Could be improved by keep appending to the byte[]
until it is full (or half full) before writing to the down stream. Otherwise, applications
have to use a BufferedOutputStream as the down stream in case the output sizes from CompressorStream
is too small. This generally causes double buffering.
> - BlockCompressorStream: similar issue as described above.
> - BlockDecompressorStream: getCompressedData() reads only one compressed chunk at a time.
Could be better to read a full buffer, and then obtain compressed chunk from buffer (similar
to DecompressStream is doing, but admittedly a bit more complicated).
> In generally, the following could be some guideline of Compressor/Decompressor and CompressorStream/DecompressorStream
design/implementation that can give users some performance guarantee:
> - Compressor and Decompressor keep two DirectByteBuffer, the size of which should be
tuned to be optimal with regard to the specific compression/decompression algorithm. Ensure
always call Compressor.compress() will a full (or near full) uncompressed data DirectBuffer.
> - CompressorStream and DecompressorStream maintains a byte[] to read data from the down
stream. The size of the byte[] should be user customizable (add a bufferSize parameter to
CompressionCodec's createInputStream and createOutputStream interface). Ensure that I/O from
the down stream at or near the granularity of the size of the byte[]. So applications can
simply rely on the buffering inside CompressorStream and DecompressorStream (for the case
of LZO: BlockCompressorStream and BlockDecompressorStream).
> A more radical change would be to let the downward InputStream to directly deposit data
to a ByteBuffer or the downard OutputStream to accept input data from ByteBuffer. We may call
it ByteBufferInputStream and ByteBufferOutputStream. The CompressorStream and DecompressorStream
may simply test whether the down stream indeed implements such interfaces and bypass its own
byte[] buffer if true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message