hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
Date Tue, 18 Mar 2014 07:46:44 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938931#comment-13938931
] 

Jing Zhao commented on HDFS-6089:
---------------------------------

Thanks for the comments, Andrew and Todd!

bq. In EditLogTailer#doTailEdits, I believe that rolling the edit log right before is intended
to freshen up the edit log for consumption by the SbNN.
But in the currently code, the auto trigger is still running periodically, which means we
cannot guarantee that we roll the editlog before we call doTailEdits. During the failover,
we call editLog.recoverUnclosedStreams() and EditLogTailer#catchupDuringFailover in FSNamesystem#startActiveServices
to guarantee the SBN can tail all the editlog. But before failover, if we can make the autoroller
on the active NN more aggressive (as you suggested), we can still guarantee that the SBN will
not do a lot of replay on a failover. What do you think?

bq. we'll need to update its check period and thresholds to be more aggressive.
Yes, agree. We should assign a smaller value to the sleep interval (maybe 2min just like the
SBN).

bq. Maybe we should just have a shorter timeout on the rollEditLog call. Or somehow..
We can also do this. But to have two auto roller working in two NN at the same time still
seems not that necessary to me..

> Standby NN while transitioning to active throws a connection refused error when the prior
active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active.
> What was noticed that some times the call to get the service state of nn2 got a socket
time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message