hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6762) exception while doing RPC I/O closes channel
Date Fri, 04 Jun 2010 02:25:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875431#action_12875431

Todd Lipcon commented on HADOOP-6762:

bq. re: timeout, so if a server disappeared, the ping would fail and the RPC would fail that
way? if that's the case, then I think removing the timeout on the Future.get() is fine.

Yep, that should be the case. Of course a server can stay up but be unresponsive (eg deadlocked).
In those cases, while it's annoying that clients get blocked forever, I don't know that changing
the behavior to be timeout based would be a change we could really make at this point without
worrying that it would break lots and lots of downstream users :(

bq. We have seem one case of distributed deadlock here on the IPC workers in the DN, so this
isn't 100% theory

Yep, I've seen internode deadlocks several times as well. Not pretty! However, I can't think
of a situation where this could happen here -- the only thing that can block one of these
sendParam calls is TCP backpressure on the socket, and that only happens when the network
is stalled. I don't see a case where allowing other threads to start sending would have unstalled
a prior sender.

We could actually enforce the max one thread per connection thing by synchronizing on Connection.this.out
*outside* the submission of the runnable. That way we know there's only one sending going
on at a time, and we're just using the thread exactly for avoiding interruption and nothing

> exception while doing RPC I/O closes channel
> --------------------------------------------
>                 Key: HADOOP-6762
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6762
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: sam rash
>            Assignee: sam rash
>         Attachments: hadoop-6762-1.txt, hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt,
> If a single process creates two unique fileSystems to the same NN using FileSystem.newInstance(),
and one of them issues a close(), the leasechecker thread is interrupted.  This interrupt
races with the rpc namenode.renew() and can cause a ClosedByInterruptException.  This closes
the underlying channel and the other filesystem, sharing the connection will get errors.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message