hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Broberg (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API
Date Wed, 11 Apr 2012 23:59:18 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252067#comment-13252067
] 

Tim Broberg commented on HADOOP-8148:
-------------------------------------

I've been thinking about this, and here's some dust: this interface seems to work well enough
for software codecs, but for multicore hardware codecs, it is necessary to process multiple
records in parallel. To do this, the decompression needs to start reading and decompressing
data before the caller shows up with his direct buffer.

This suggests that the stream should return a prefilled ByteBuffer instead of filling one
provided by the caller.

...but then the caller needs to have an efficient way to recycle it as direct buffers are
(supposedly) costly to build. So now we need to add a call to release the buffer.

ByteBuffer read();
void ReleaseBuffer(ByteBuffer);

This has one added benefit that the compression stream now has control over the sizes of all
buffers so there is no problem keeping the source and destination sizes appropriate for each
other.

On the output side we have

void write(ByteBuffer);
ByteBuffer GetBuffer();

This saves a copy for HW / multithreaded codecs, which are especially sensitive to copies.

So, is this interface too much complexity / difference from the ByteBufferReadable interface
to make upcoming fast compressors faster?
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop8148.patch
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have the following
copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - decompression necessarily
rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all applications,
and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message