hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
Date Tue, 20 Jan 2015 20:33:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284357#comment-14284357

Daryn Sharp commented on HDFS-7587:

{{verifyQuota}} is already invoked so the quota counts shouldn't go out of sync.  {{updateSpaceConsumed}}
calls {{updateCount}}, which calls {{verifyQuota}} prior to invoking {{unprotectedUpdateCount}}.
 The quotas aren't going to change so it seems calling {{verifyQuota}} explicitly is wasted
processing time.

bq.  Otherwise, the quote counts will be incorrect if there is an exception thrown later on.

Do you have a scenario in mind?  Ie. what is "later on"?  Moving the file to UC and associating
the lease aren't going to throw checked exceptions.  They might throw a runtime exception.
 The NN has no concept of a transaction (no rollback), so we're fully committed to finishing
the op once we start updating datastructures.  In this patch, once the quota update is successful,
we're committed to moving the file to UC and assigning a lease.  If we think those final steps
will throw, then we're in trouble because we can't rollback.  Even if that were to happen,
an out of sync quota is better than a corrupted in-memory state and edit logs caused by the
NN throwing runtime exceptions that don't cause an abort.

> Edit log corruption can happen if append fails with a quota violation
> ---------------------------------------------------------------------
>                 Key: HDFS-7587
>                 URL: https://issues.apache.org/jira/browse/HDFS-7587
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Kihwal Lee
>            Assignee: Daryn Sharp
>            Priority: Blocker
>         Attachments: HDFS-7587.patch
> We have seen a standby namenode crashing due to edit log corruption. It was complaining
that {{OP_CLOSE}} cannot be applied because the file is not under-construction.
> When a client was trying to append to the file, the remaining space quota was very small.
This caused a failure in {{prepareFileForWrite()}}, but after the inode was already converted
for writing and a lease added. Since these were not undone when the quota violation was detected,
the file was left in under-construction with an active lease without edit logging {{OP_ADD}}.
> A subsequent {{append()}} eventually caused a lease recovery after the soft limit period.
This resulted in {{commitBlockSynchronization()}}, which closed the file with {{OP_CLOSE}}
being logged.  Since there was no corresponding {{OP_ADD}}, edit replaying could not apply

This message was sent by Atlassian JIRA

View raw message