hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4659) Root cause of connection failure is being lost to code that uses it for delaying startup
Date Mon, 17 Nov 2008 11:20:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648138#action_12648138
] 

Steve Loughran commented on HADOOP-4659:
----------------------------------------

Raghu, 

> why does Client wrap one IOException in another?

I dont know the original reason; HADOOP-3844 retained this feature and included the hostname/port
at fault which is handy for identifying configuration problems. The patch only adds this diagnostics
to ConnectExceptions and passes the rest up

>is this a vanilla 0.18?

I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in with HADOOP-2188,
then it also exists in 0.18, but that will need a different patch. 

> Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection()
...

Client.call doesnt catch the exception. The problem is that RPC.waitForProxy does, and it
handles ConnectException and SocketTimeoutException by logging, sleeping, and trying again.
This was not happening when the ConnectException was being downgraded, so the task tracker
was failing if it came up before the job tracker, rather than waiting quietly for the tracker
to come back up. As a result there is a race condition in cluster startup and the cluster
is more brittle

Here's where the exceptions get picked up in RPC.java

  public static VersionedProtocol waitForProxy(Class protocol,
                                               long clientVersion,
                                               InetSocketAddress addr,
                                               Configuration conf
                                               ) throws IOException {
    while (true) {
      try {
        return getProxy(protocol, clientVersion, addr, conf);
      } catch(ConnectException se) {  // namenode has not been started
        LOG.info("Server at " + addr + " not available yet, Zzzzz...");
      } catch(SocketTimeoutException te) {  // namenode is busy
        LOG.info("Problem connecting to server: " + addr);
      }
      try {
        Thread.sleep(1000);
      } catch (InterruptedException ie) {
        // IGNORE
      }
    }
  }



> Root cause of connection failure is being lost to code that uses it for delaying startup
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4659
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4659
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.18.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>             Fix For: 0.18.3
>
>         Attachments: hadoop-4659.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the exception is wrapped,
hence the outside code, the one that looks for that root cause, isn't working as expected.
The results is you can't bring up a task tracker before job tracker, and probably the same
for a datanode before a  namenode. The change that triggered this is not yet located, I had
thought it was HADOOP-3844 but I no longer believe this is the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message