hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7809) Block and lease recovery failure caused by snapshot issue
Date Wed, 18 Feb 2015 19:06:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326398#comment-14326398

Jing Zhao commented on HDFS-7809:

[This comment|https://issues.apache.org/jira/browse/HDFS-7056?focusedCommentId=14197363&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197363]
(#7) in HDFS-7056 should be the same issue with this one.

> Block and lease recovery failure caused by snapshot issue
> ---------------------------------------------------------
>                 Key: HDFS-7809
>                 URL: https://issues.apache.org/jira/browse/HDFS-7809
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>            Reporter: Kihwal Lee
>            Priority: Critical
> On a cluster running 2.5, we have observed a decommissioning failure due to a file that
had been under construction for 3 days.  It turned out that the file was abandoned and a lease
recovery was carried out by the name node 3 days ago.
> The block recovery failed because the name node threw a quota exception while serving
{{commitBlockSynchronization()}}. After this failure, no further attempt for recovery was
made, leaving the file in under-construction state forever.
> Furthermore, the nature of the recovery failure is very strange. Even though *snapshot
was never used* in the cluster, it was trying to record the diff and that required incrementing
{{nsquota}} by 1. The user happened to ran out of his {{nsquota}} at that time, so it failed
and caused {{commitBlockSynchronization()}} to fail.  We do see quota discrepancies occasionally.
Probably those were caused by something like this all along?
> Few observations:
> - Lease recovery did not complete, yet didn't get retried.
> - No snapshot was in use, but somehow it went through snapshot-related code path.
> - quota update during {{commitBlockSynchronization()}} should be done unconditionally.

This message was sent by Atlassian JIRA

View raw message