hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Broberg (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API
Date Tue, 03 Jul 2012 02:00:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405492#comment-13405492
] 

Tim Broberg commented on HADOOP-8148:
-------------------------------------

Ok, Owen. Sorry not to get back sooner. Last week was crazy.

Non-empty tarball with my proposal is attached.

Comments on yours are below, keeping in mind that I'm somewhat of a Java neophyte. If I say
something stupid, please try to be gentle:
 1 - By declaring an abstract stream class in the codec, you are forcing implementations to
extend those classes, which means they can't reuse any of the existing compression stream
code. What benefits do you see coming from this that warrant the tradeoff and/or what am I
missing?
 2 - What are the benefits of creating a separate codec / factory class vs just returning
a compression stream and checking whether it implements ZeroCopy<whatever>?
 3 - The read(ByteBuffer) definition just fundamentally doesn't work for my particular use
case where there is a significant latency on reads. If the caller provides the ByteBuffer,
then the read() function has to wait for the latency of the entire operation or else read()
has to perform a copy from his own pool of completed buffers, which defeats the whole point
of the exercise. This is why I felt I needed "ByteBuffer readBuffer()" in my interface.
 4 - Am I seeing compress and decompress functions at the codec level? I guess you're trying
to hide the complex setInput / compress interface of the existing compressor? I guess you're
saying we want the compressors to still be available in non-stream form, but you want to clean
up the interface?
 5 - What about the codec pool? Do you see that disappearing?

You mentioned channels. One alternative approach would be to use the new Java 7 Asynchronous
channels, but making compression require Java 7 would have some pretty broad implications
that I don't expect we're ready to deal with. Also, this makes the caller deal with the asynchronicity
instead of having the stream just read ahead and return buffers on command.

Ideally, it would be great to see an interface that doesn't require the caller to deal with
pooling the buffers below his level in the stream, but I haven't found a way to do this that
doesn't require copies for streams that implement any kind of read ahead.

I'm sure at present I'm firmly in the minority in caring about multithreaded decompressors.
I've tried to keep the overhead innocuous such that it's worth the trouble. Any ideas you
have on addressing the problem more efficiently would be most welcome, but read(ByteBuffer)
doesn't do it.
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io, performance
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop-8148.patch, hadoop8148.patch, zerocopyifc.tgz
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have the following
copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - decompression necessarily
rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all applications,
and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message