hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ekanth Sethuramalingam (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-14317) Standby does not trigger edit log rolling when in-progress edit log tailing is enabled
Date Thu, 28 Feb 2019 22:37:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780997#comment-16780997

Ekanth Sethuramalingam commented on HDFS-14317:

Thanks for the review Erik.
{quote} * The active parameter to waitForStandbyToCatchUpWithInProgressEdits() is not used.{quote}
Good catch. Will do.
{quote} * The until parameters could use a more descriptive name, like maxWaitSec{quote}
Will do.
{quote} * Why do we need to test with both nn0 and nn1 as active? Is there a difference between
these two situations?{quote}
I thought about it but followed the other test which circles between the namenodes. I am not
sure if I'll be missing something by not doing it. I'll remove this for now.
{quote} * It seems like there's a pretty long wait time involved. If we make the edit tail
period to something like 10ms, can we reduce the wait times?{quote}
I did think about this. Unfortunately, {{dfs.ha.tail-edits.period}} and {{dfs.ha.log-roll.period}}
are both in seconds.
{quote} * For the checkForLogRoll call on L406, it seems like waitFor() is a bit overkill
here. Can't we just sleep for 2s and then check if it has changed?{quote}
This allows for reuse, consistency (with other waits in the test class) and fast fail. Also,
I don't see this to be a huge cost to optimize for here. I can change it if you feel strongly
about this.

> Standby does not trigger edit log rolling when in-progress edit log tailing is enabled
> --------------------------------------------------------------------------------------
>                 Key: HDFS-14317
>                 URL: https://issues.apache.org/jira/browse/HDFS-14317
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.9.0, 3.0.0
>            Reporter: Ekanth Sethuramalingam
>            Assignee: Ekanth Sethuramalingam
>            Priority: Critical
>         Attachments: HDFS-14317.001.patch
> The standby uses the following method to check if it is time to trigger edit log rolling
on active.
> {code}
>   /**
>    * @return true if the configured log roll period has elapsed.
>    */
>   private boolean tooLongSinceLastLoad() {
>     return logRollPeriodMs >= 0 && 
>       (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ;
>   }
> {code}
> In doTailEdits(), lastLoadTimeMs is updated when standby is able to successfully tail
any edits
> {code}
>       if (editsLoaded > 0) {
>         lastLoadTimeMs = monotonicNow();
>       }
> {code}
> The default configuration for {{dfs.ha.log-roll.period}} is 120 seconds and {{dfs.ha.tail-edits.period}}
is 60 seconds. With in-progress edit log tailing enabled, tooLongSinceLastLoad() will almost
never return true resulting in edit logs not rolled for a long time until this configuration
{{dfs.namenode.edit.log.autoroll.multiplier.threshold}} takes effect.
> [In our deployment, this resulted in in-progress edit logs getting deleted. The sequence
of events is that standby was able to checkpoint twice while the in-progress edit log was
growing on active. When the NNStorageRetentionManager decided to cleanup old checkpoints and
edit logs, it cleaned up the in-progress edit log from active and QJM (as the txnid on in-progress
edit log was older than the 2 most recent checkpoints) resulting in irrecoverably losing
a few minutes worth of metadata].

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message