hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-255) Client Calls are not cancelled after a call timeout
Date Fri, 26 May 2006 06:51:30 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413382 ] 

Owen O'Malley commented on HADOOP-255:

There are three parts to this. The first part is that I'm just about done with converting
the map output to use http rather than rpc. So that aspect of this problem will go away.

The second part is that this behavior happens on all of the rpc's, not just map output transfer.
However, sending a message now to potentially prevent a future message is not necessarily
a winning game. If you did send a "cancel call" message, you'd probably want to remove the
call from the server's work queue if it is still waiting to be processed.

It is tempting to try a strategy where you check the age of a call when you start processing
it on the server and reject messages that are too old, but the problem is that it is _not_
10 seconds from the start of the call, but rather 10 seconds with no data received on the
socket, which is hard for the server to estimate. 

The final part is that this characteristic that the server can not assume that the return
message from the rpc call was received is a problem. For example, I had a problem with pollForNewTask
timing out and dropping tasks. I fixed that by adding a timeout so that after a task is assigned
to a task tracker, it it does not show up in a task tracker status message within 10 minutes
it is considered lost. However, this applies to _all_ of the rpc messages. You always need
to make sure that if the return value of the rpc call were to disappear into thin air that
the problem would be detected eventually. There are other instances of this ind of problem
that still exist in the code that need to be identified and fixed.

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown but the Call
object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's common for queued
up calls to timeout. Yet even though the caller has is no longer waiting, the request is still
serviced on the server and the data is sent to the client. The client after receiving the
full response calls callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets into a situation
where data is being transferred around, but it's all data for timed out requests and no progress
is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object which I
set to true whenever the queued caller times out. And then when the client starts to pull
over the response data (in Connection::run) to first check if the Call is timedout and immediately
close the connection.
> I think a good fix for this is to queue requests on the client, and do a single sendParam
only when there is no outstanding request. This will allow closing the connection when receiving
a response for a request we no longer have pending, reopen the connection, and resend the
next queued request. I can provide a patch for this, but I've seen a lot of recent activity
in this area so I'd like to get some feedback first.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message