hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2955) HA: IllegalStateException during standby startup in getCurSegmentTxId
Date Thu, 16 Feb 2012 08:30:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209221#comment-13209221
] 

Aaron T. Myers commented on HDFS-2955:
--------------------------------------

bq. What is the behavior by standby, with this patch, if it has completely read the last segment
and is waiting for the new segment to be completed? I believe in that case it would anyway
return zero.

Not quite. With this patch, if the standby NN has never been in the active state, the metric
will always output 18179, probably because of some oddity with the way metrics output negative
values (since curSegmentTxId is initially set to HdfsConstants.INVALID_TXID, which is -12345.)
This is obviously incorrect. If the standby NN has previously been in the active state, this
metric will always output 2, which is also incorrect.

bq. We will end up reading from in_progress log for automatic failover to reduce the failover
times.

Maybe. I strongly suspect that the time for automatic failover will be greatly dominated by
the time to detect failure of the active and fence it, not the time it takes to read the most
recent edit log segment once we've decided to fail over, in which case this optimization of
reading in-progress edit logs will provide little benefit.

Regardless, this isn't how it's implemented now.

bq. This would be one less place to change when standby starts reading from in_progress.

Except that we should write a test that this metric outputs the correct values, in which case
this code might change anyway. We don't yet know how reading in-progress edit logs will be
implemented.

bq. Regarding testing, any HA test will run into it. I have a 100% hit rate on the actual
cluster

Sure, but none of the tests will _fail_ because of this error, will they? You'll see an error
in the NN log if you look, but only if. And even if tests were failing without this patch,
there's still no test asserting that the metric outputs the correct value in the case of the
standby NN.
                
> HA: IllegalStateException during standby startup in getCurSegmentTxId
> ---------------------------------------------------------------------
>
>                 Key: HDFS-2955
>                 URL: https://issues.apache.org/jira/browse/HDFS-2955
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>         Attachments: HDFS-2955-HDFS-1623.patch, HDFS-2955-HDFS-1623.patch
>
>
> During standby restarts, a new routine getTransactionsSinceLastLogRoll() has been introduced
for metrics which is calling getCurSegmentTxId(). checkstate() in getCurSegmentTxId() assumes
that log is opened for writing and this is not the case in standby.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message