hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
Date Wed, 19 Mar 2014 01:04:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940037#comment-13940037
] 

Andrew Wang commented on HDFS-6089:
-----------------------------------

Hi guys, hope I can clarify what I said:

bq. Could you please explain how we avoid this issue with the current strategy?

Checkpointing combines the edit log with the fsimage, and we purge unnecessary log segments
afterwards. It's okay for the standby to roll the edit log on a time basis because it's also
doing checkpoints, so it's clearing edit log segments as it's making them.

The same thing is not true on the active NN. In the absence of checkpointing, rolling on a
time basis leads to a lot of small edit logs that just accumulate. This is why the active
NN autoroller triggers on a big # of edits, not time. It's trying to avoid a single giant
edit log segment by making some number of medium-sized ones. Also rolling on a time-basis
(which is good for standby tailing) would end up making a lot of small segments.

I guess we can shrug and say this is alright, but if an alternative solution is amenable,
maybe we can pursue that instead.

> Standby NN while transitioning to active throws a connection refused error when the prior
active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active.
> What was noticed that some times the call to get the service state of nn2 got a socket
time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message