hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5058) QJM should validate startLogSegment() more strictly
Date Tue, 06 Aug 2013 18:39:48 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13731070#comment-13731070
] 

Todd Lipcon commented on HDFS-5058:
-----------------------------------

Hi Fengdong. I think making the user experience of broken setups is a different task than
this JIRA, which is just a bug fix. I don't want to scope creep this, since it's an important
fix for data safety.

Additionally, always telling the admin to copy the data dir between nodes is dangerous --
once we're in an inconsistent state, an expert should really look at it to understand the
correct recovery. Giving resolution advice in an error message is risky, since we're already
in a bad state we may end up giving the wrong advice.
                
> QJM should validate startLogSegment() more strictly
> ---------------------------------------------------
>
>                 Key: HDFS-5058
>                 URL: https://issues.apache.org/jira/browse/HDFS-5058
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: qjm
>    Affects Versions: 3.0.0, 2.1.0-beta
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-5058.txt
>
>
> We've seen a small handful of times a case where one of the NNs in an HA cluster ends
up with an fsimage checkpoint that falls in the middle of an edit segment. We're not sure
yet how this happens, but one issue can happen as a result:
> - Node has fsimage_500. Cluster has edits_1-1000, edits_1001_inprogress
> - Node restarts, loads fsimage_500
> - Node wants to become active. It calls selectInputStreams(500). Currently, this API
logs a WARN that 500 falls in the middle of the 1-1000 segment, but continues and returns
no results.
> - Node calls startLogSegment(501).
> Currently, the QJM will accept this (incorrectly). The node then crashes when it first
tries to journal a real transaction, but it ends up leaving the edits_501_inprogress lying
around, potentially causing more issues later.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message