hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3906) QJM: quorum timeout on failover with large log segment
Date Fri, 07 Sep 2012 21:30:07 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450998#comment-13450998

Todd Lipcon commented on HDFS-3906:

There are two ways to solve this:

1) Avoid log validation during the recovery step

This has the advantage of a faster failover

2) Increase the default timeouts

Originally the timeouts were set pretty low (~10sec) because I didn't think about the O(n)
nature of this step. It wouldn't be unreasonable to bump them to a couple minutes (only downside
would be a slower failure case when a quorum is actually down).

I'm going to make an attempt at solution #1, but if it gets too complex, will punt and do
solution #2.

> QJM: quorum timeout on failover with large log segment
> ------------------------------------------------------
>                 Key: HDFS-3906
>                 URL: https://issues.apache.org/jira/browse/HDFS-3906
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
> In doing some stress tests, I ran into an issue with failover if the current edit log
segment written by the old active is large. With a 327MB log segment containing 6.4M transactions,
the JN took ~11 seconds to read and validate it during the recovery step. This was longer
than the 10 second timeout for createNewEpoch, which caused the recovery to fail.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message