hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "paul sutter (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-195) transfer map output transfer with http instead of rpc
Date Fri, 12 May 2006 01:27:09 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-195?page=all ]

paul sutter updated HADOOP-195:
-------------------------------

    Attachment: MapFileSimulator.java
                mapfilesimulator-sort2.txt
                mapfilesimulator-big.txt


I dont have a 188 node cluster, so i wrote a simulator to test the impact of file sizes and
buffer sizes on performance for a single node in Owen's test. The program has a simiulated
map step, to create the output files using the configured buffer size, and a copy step to
copy the files. 

The idea is to isolate filesystem/disk performance issues from any interaction with RPC, TCP,
switches, etc.

Results show an 8-10X speedup with larger files and buffers:

Configration "sort2":
(32MB DFS blocks, 4KB buffer, 320 mappers/node, 356 reducers total, 10GB total data): 
- map phase: 48 minutes and 45 seconds
- copy phase: 70 minutes and 38 seconds

Configuration "big":
(1GB DFS blocks, 1MB buffer, 10 mappers/node, 356 reducers total, 10GB total data)::
- map phase: 6 minutes and 24 seconds
- copy phase, 7 minutes and 56 seconds

That final copy phase was only running at 30MB/sec, so it should be easy to move that across
the network if those 188 nodes were on one big switch. Obviously, this is about half the speed
of the bare drive, so there is another 2X improvement possible and still be able to fit within
gigabit network limitations.

Like Owen's sort test, this test generate 10GB of data per node, and the node I'm using has
4GB of RAM. The program is attached, along with outout according to Owens' last configuration
and a run with larger files and buffers. 

The program also has a configuration called "sort1" that had the original configuraiton, but
would take too long to run so i didint run it.

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3
>  Attachments: MapFileSimulator.java, data-transfer-chart.pdf, mapfilesimulator-big.txt,
mapfilesimulator-sort2.txt, netstat.log, netstat.xls
>
> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message