hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "paul sutter (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-195) transfer map output transfer with http instead of rpc
Date Sun, 07 May 2006 21:18:21 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378335 ] 

paul sutter commented on HADOOP-195:


buffer copies are nowhere near a bottleneck in hadoop, yet. right now we have lots of wins
just from getting our buffering right.

reducing buffer copies only matters when buffer copies are a bottleneck. you would have to
use a profiler to see how much time was being spent in your serialization/deserialization
code, for example. if your code is the bottleneck, then reducing buffer copies might not matter.
how long are your requests? if they are small, its not likely to matter. if they are gigabytes,
then it could matter a lot.

other questions about your use of NIO:
- did you try using the native endian-ness with NIO? or the default? (the default is evil
Sun endian-ness)
- are you using direct buffers, or indirect? (indirect buffers still cost you a buffer copy
in user space)
- are you using memory mapping, or buffered io? (buffered io costs you a buffer copy in kernel

of course, an honest-to-god unbuffered read is so much better than memory mapping. someone
who is more of a unix guy could help you figure out which linux filesystem supports real unbuffered
io, and how to make that happen from java. when you're memory mapped, its hard to coerce the
system into doing the multimegabyte double-buffered reads that you really want to do if you
are interested in performance. you might have to use JNI to make that io fast. but again,
its only worthwhile if you know where the bottlenecks are. windows nt is popular among sort
people because its so easy to get an honest-to-god unbuffered io.

but again, none of that matters if you're not moving much data, or if you dont have a buffer
copy bottleneck.

using the JNI interface you mention sounds interesting. of course, if we're going to go non-pure-java,
we might as well use owen's idea of an http server to serve up the map output data, since
that server will already be tuned. we're using lighttpd here and getting super good performance
(for a different application of course).

im super glad there's an interest in performance here!


> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message