hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominik Friedrich (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-195) transfer map output transfer with http instead of rpc
Date Sun, 07 May 2006 23:07:21 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378341 ] 

Dominik Friedrich commented on HADOOP-195:


the file IO test were really pretty simple. To be honest I don't remember my actual setup
anymore but only that I saw not much difference between normal stream IO and NIO but the data
was by far not gigabytes.

I used default endian-ness, direct buffers and buffered io. From what I read before memory
mapping gives almost no performance gain against buffered io for streaming io. It makes if
you've random access within a limited region of a file. In both cases you've buffered IO because
of the OS's file system buffer. From what I know about OSs there is no copy to kernel space
and the file system buffer of a modern OS is hardly to beat performance wise. Unbuffered IO
would actually decrease the performance because the file system cannot change the write order
and do other tricks to reduce seeks. In general I don't think there is much space for file
IO performance improvements in hadoop except using e.g. APR through JNI.

To improve the sorting performance I'd start by looking at the algorithm itself, because there
seem to be better algorithms out there. This huge difference in performance cannot be caused
by suboptimal implementation. Bottlenecks are file and network IO so the goal is to reduce
those. I haven't used Nutch/hadoop for some time now and so I'm not up to date with the current

This is a really interesting problem, could be a nice project for Google's Summer of Code.

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message