hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6089) Standby NN while transitioning to active throws a connection refused error when the prior active NN process is suspended
Date Wed, 19 Mar 2014 18:54:02 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940834#comment-13940834
] 

Andrew Wang commented on HDFS-6089:
-----------------------------------

Hmm, let me try to explain one more time. My concern wasn't about staleness here, but spamming
the edits dirs with a lot of small files.

I agree that having the ANN roll on a time basis is fine when SbNN and ANN are both up and
checkpointing is happening.

An issue arises if checkpoints aren't happening, either because the SbNN is down, or checkpoints
are otherwise broken (e.g. the "edit log op too big" bug, transfer timeouts with a large fsimage,
or some of the fallout from the PB-fication of the fsimage). In this scenario, the ANN will
keep rolling every 2mins, generating a lot of edit log segments that aren't being cleared
out. I've seen oblivious customers run for a month while checkpointing was nonfunctional,
and 2 min rolling would lead to an awful lot of files. This is one reason why the ANN autoroller
rolls on a size basis rather than time. It should generates fewer, larger segments, which
is more manageable.

This isn't a super major issue, but I thought I'd bring it up as a reason why we might prefer
a different solution.

> Standby NN while transitioning to active throws a connection refused error when the prior
active NN process is suspended
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6089
>                 URL: https://issues.apache.org/jira/browse/HDFS-6089
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-6089.000.patch, HDFS-6089.001.patch
>
>
> The following scenario was tested:
> * Determine Active NN and suspend the process (kill -19)
> * Wait about 60s to let the standby transition to active
> * Get the service state for nn1 and nn2 and make sure nn2 has transitioned to active.
> What was noticed that some times the call to get the service state of nn2 got a socket
time out exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message