hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sam rash (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6762) exception while doing RPC I/O closes channel
Date Wed, 12 May 2010 16:30:41 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866619#action_12866619

sam rash commented on HADOOP-6762:

the general problem is that 'client' threads hold the socket and do writes to it to send RPCs.
 If a client thread receives an interrupt, it will leave the socket in an unusable state.

i have a test for this general case and a patch which moves the actual writing to the socket
to a thread owned by the Client object.  This means a client can be interrupted and not ruin
the socket for other clients.

note:  other socket errors may occur that make the socket unusable. The patch doesn't handle
this (only intended to help with interrupted cases since that is common with filesystem.close).

we might also want to consider finding a way to fail fast when RPC goes bad.  Near as I can
tell from watching this happen, until the filesystem is closed, the underlying RPC is in a
bad state.  It seems like we could fail on one operation, detect the bad socket and perhaps
recreate the socket or the whole RPC object.  not sure where this retry logic goes

> exception while doing RPC I/O closes channel
> --------------------------------------------
>                 Key: HADOOP-6762
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6762
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: sam rash
> If a single process creates two unique fileSystems to the same NN using FileSystem.newInstance(),
and one of them issues a close(), the leasechecker thread is interrupted.  This interrupt
races with the rpc namenode.renew() and can cause a ClosedByInterruptException.  This closes
the underlying channel and the other filesystem, sharing the connection will get errors.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message