hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-195) transfer map output transfer with http instead of rpc
Date Fri, 05 May 2006 18:09:31 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378086 ] 

Doug Cutting commented on HADOOP-195:

So you're seeing nearly a 50% timeout rate.  That's not good.  It would be nice to know more
about how these occur.  For example, if they're occurring because the client connection  to
that host is busy, then increasing the number of serving threads alone won't help.  You'd
also need to disable or enhance connection pooling on the client.  Parallel RPC calls will
probably increase the timeout rate, but might still improve things, since (hopefully) one
of the requests will find an available server and keep the clients network interface busy.

Disabling pooling for these calls should not be hard.  Add an option to RPC.getProxy() to
disable connection pooling, and store this as a new field in the Invoker.  Then add an option
to Client.call() that it passes to getConnection() to bypass the connection cache.  Then,
in a finally clause of Client.call(), close the connection when noCacheConnections was passed.
 You could even bypass the connection's thread and simply read the response directly under
Client.call(), since there will be no other calls multiplexed over the connection.

Enhancing pooling might be better yet.  One could, e.g., remove a connection from the pool
while a request is outstanding, and then add it back into the pool only if there were no io
errors, if there's not already a pooled connection to that host, and perhaps if the pool's
not too big.  So each call could have a dedicated connection.

The ability to multiplex requests and responses over a single connection is still probably
an important feature for, e.g., Nutch's distributed search, where you might have a hundred
clients all broadcasting queries to a hundred backends.  In this case you don't want to have
10,000 connections, but rather just 100, and you don't want to force fast queries to wait
for slower queries to complete.  So dedicated connections per call should be an option, not

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

> The data transfer of the map output should be transfered via http instead rpc, because
rpc is very slow for this application and the timeout behavior is suboptimal. (server sends
data and client ignores it because it took more than 10 seconds to be received.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message