hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "paul sutter (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-195) transfer map output transfer with http instead of rpc
Date Sun, 07 May 2006 17:30:21 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378307 ] 

paul sutter commented on HADOOP-195:

Here are a few things you can try for your sort performance issues:
Nagle's algorithm can add 200ms to a TCP transaction, and could be accounting for a chunk
of that average 385 milliseconds you're seeing. I think Nagle is enabled by default. You could
try disabling Nagle with setTcpNoDelay on the RPC-server-side socket.

With 64,000 files, one Nagle per file could waste hours.

Note that by disabling Nagle, you'll get some runt packets (for example, the header and at
the boundary of each socket write). This should not be a material performance issue, and you
can reduce it arbitrarily by increasing the size of the buffer (right now it's a hardcoded
8KB buffer).
It would be great Java allowed the use of TCP_CORK and the use of sendfile(), since you could
avoid runts and avoid a few buffer copies because sendfile does it all in the kernel. That's
what HTTP servers do, which was your original suggestion.
Other TCP tuning could help. My intuition is that these wildly varying transfer times are
related to packet loss, and I expect some level of tuning to help this but I haventhad time
to look into it. Disabling linger should help when packets are lost in the closing handshake.
 I'm also an advocate of much larger buffers, although that isn't an issue for your particular
case. Another minor issue (and not causing your particular performance issue) is that we should
turn on keepalives with a short retry, since without keepalives TCP does not _always_ detect
hung sessions, and the current code that ignores socket timeouts can leave us with a collection
of zombie sessions. But of course, that's not an issue for your situation.

Modern disks transfer at over 50MB/s, yet seek no faster than the drives used by the ancient
15KB takes less than 300 microseconds to transfer to or from a SATA disk. Since a seek takes
probably more than 10 milliseconds, you're at least 97% seek bound with a 15KB file.  Its
actually worse: how many disk seeks does it take to create a file and then delete it later?
(I dont know) 
Transferring even 1MB whenever you touch disk is 30-50% seek-bound, because the 20ms it takes
to transfer 1MB is in the same ballpark as the seek time of the drive (I get 10-20ms when
I measure SATA seek). 

You might as well transfer 4MB or 16MB each time you touch disk, then your code will still
work a few years from now.
Increasing drive transfer speed is due to increasing data density. The more bits per inch
the more bits per second at a given rotational speed. Our grandfathers got away with using
4KB and 8KB block sizes in their JCL because data densities were just really low back then.

We're moving terabytes, yet the sort path writes map output to disk and reads it back in 5
times before the data reaches the reducer:
1 - mapper output, 
2 - copy individual map output files to reducer, 
3 - concatenate little files into big file, 
4 - write merge file when sort buffer is full, 
5 - write merged data to disk.
Steps 3 and 5 are the first opportunity for a speedup because they exist purely for programmer
convenience. The sorter can read directly from a list of files instead of one file, and the
reducer could call the merger directly to collect the input records instead of storing it
to disk as an intermediate step.
Step 2 permits the mapper-to-reducer copy to be restarted if it fails, but strictly speaking
that data only needs to go to disk if the sort buffer doesn't have enough space left to hold
the entire file (since it would be easy to back out of a failed transfer as long as the sort
buffer doesn't fill up). Also, you could just start the sort early when you encountered a
file that would fit in the sort buffer, but there's just not enough room left in the sort
Step 4 only needs to occur when the data is too large to fit into RAM.
Whether or not step 1 needs to occur could be a matter of debate.
The default 4KB buffer size is really bad for the merge phase (see suggestion 2). Since the
merger is reading from a bunch of merge files, and from the perspective of the OS, you are
performing one read at a time from a randomly-chosen merge file, its unlikely for the OS to
figure out that you really want to sequentialize. 

It's a little daredevil to rely on Linux doing 1000-to-1 sequentialization anyway; you might
as well just use a big buffer and be done with it -I'm not sure of the benefit of the small
The DFS write of the data coming out of the reducer currently writes a local copy of data
to disk, but this could be eliminated by using a 32MB RAM buffer instead of a disk file for
recoveries from connection failures. 
In general, we'd be better off using NIO instead of stream IO because of fewer buffer copies
and better control of buffering and endian-ness. Stream IO does at least two buffer copies
of the data, one in the kernel because of use of the disk cache, and at least one in Java.
 Yes, I realize that this would be a major rewrite and less important than the others mentioned.

It would be interesting to calculate the number of buffer copies in the current sort path
above considering that you get two buffer copies when you write the disk and two when you
read the disk, and you have 5 extra visits to disk plus the network transfer. 

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message