hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
Date Wed, 19 Mar 2014 06:23:42 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940229#comment-13940229
] 

Jing Zhao commented on HDFS-6089:
---------------------------------

Hi Andrew, thanks for the explanation. I guess I understand your concern now: only rolling
on ANN based on edits # may cause issue in some scenario. This is because if we don't have
further operations it is possible that SBN will wait a long time to tail that part of edits
which is in an in-progress segment.

bq. Checkpointing combines the edit log with the fsimage, and we purge unnecessary log segments
afterwards.
But I'm still a little confused about this part. I fail to see the difference of the based-on-time
rolling from SBN and ANN. In the current code, SBN triggers rolling still through RPC to ANN.
Also this does not affect checkpointing and purging: when SBN does a checkpoint, both SBN
and ANN will purge old edits in their own storage (SBN does the purging before uploading the
checkpoint, and ANN does it after getting the new fsimage).

So I guess a possible solution may be: just letting ANN does rolling every 2min. I think this
can achieve almost the same effect as the current mechanism, without delaying the failover.
Or you see some counter examples with this change?

Back to the changing the rpc timeout solution. Looks like we have not set timeout for this
NN-->NN rpc right now (correct me if I'm wrong). Setting a timeout (e.g., 20s just like
the default timeout from client to NN) of course can improve the failover time in our test
case, but I still prefer the above solution because it makes the rolling behavior simpler
and more predictable (especially it removes the rpc call from SBN to ANN).

> Standby NN while transitioning to active throws a connection refused error when the prior
active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active.
> What was noticed that some times the call to get the service state of nn2 got a socket
time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message