hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Broberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API
Date Mon, 23 Apr 2012 18:54:34 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259836#comment-13259836

Tim Broberg commented on HADOOP-8148:

More dust:
1 - block-based non-scatter-gather libraries (basically everything software except gzip) won't
readily support the scatter-gather List<ByteBuffer> interface. I think we should dump
it and just pass ByteBuffer's.
2 - Direct buffers have a reputation for being costly to create. As I understand it, the reason
the codec pool class exists is to allow compressors with direct buffers to be reused without
having to create a new direct buffer each time a record is read. The interface proposed does
not address ownership or recycling of the buffers. We could add calls to each interface that
passes these buffers to manage the buffers, or the buffers themselves could have a call to
return them to a pool from which they can be reused. Managing the number of elements in the
pool and the size of the buffers is a nontrivial task.
3 - If we do address buffer recycling, the codec pool approach would appear to be obsolete.
Note that, outside of compression streams, codec pool is the only customer that cares about
the compression interface any longer - an extreme statement, but witness that bzip doesn't
implement a compressor interface at all except for dummy stubs to show to codec pool.
4 - The interface of the existing compressor / decompressor classes pack a lot of baggage
from the gzip interface that decouples the input from the output for a streaming compressor
class. setInput, needsInput, finished, finish, reset, and reinit all manage state between
the input and output where a simple compress(ByteBuffer src, ByteBuffer dst) could replace
the existing call and all the rest. (Full disclosure, I want all those other calls dead personally
because all that state makes asynchronous compression a nightmare.)

So, I'm highly tempted to sweep away the compressor interface and replace it with a much simpler
one -
 - compress(src, dst) to process data
 - finish() to allow cleaning up open streams
 - getBytesRead(), getBytesWritten() for statistics

Replace the codec pool with a pool of buffers extending ByteBuffer which have a callback method
to recycle them.

Too radical? What would be a better way to solve the problems? Any problems this doesn't solve?
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop8148.patch
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have the following
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - decompression necessarily
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all applications,
and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold compressed output.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message