hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-195) transfer map output transfer with http instead of rpc
Date Fri, 05 May 2006 19:55:28 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378127 ] 

Doug Cutting commented on HADOOP-195:

> 1 gigabyte in 7 hours on 188 nodes

Note that the total size of the data sorted in this case is several terabytes.  Owen said
1G per reduce task, and I think he has a few thousand reduce tasks.  (Owen?)  So, you're right,
we're not yet down to 7 minutes/terabyte, but things still aren't quite as bad as you state.

Google's MapReduce paper reports a terabyte sort time of 891 seconds using a cluster of 1800
dual Xeon nodes.

The Indy test you cite (TB in 7 minutes on 80 Itaniums) also has fancier switches and disk
arrays than Google's cluster or Owen's cluster.  In particular, it's not clear that it would
easily scale to sorting 10 terabytes in 7 minutes on 800 Itaniums, since 800-port switches
are harder to find.

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message