hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
Date Mon, 28 Jun 2010 21:10:54 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883296#action_12883296

Todd Lipcon commented on HDFS-1262:

bq. so it really is a glorified 'cleanup and close' which has the same behavior as if the
lease expired--nice and tidy imo. It does have the slight delay of lease recovery, though.

I think that makes sense - best to do recovery since we might have gotten halfway through
creating the pipeline, for example, and this will move the blocks back to finalized state
on the DNs. Performance shouldn't be a concern, since this is such a rare case.

bq. While in theory it could happen on the NN side, right now, the namenode RPC for create
happens and then all we do is start the streamer (hence i don't have a test case for it yet).

What happens if we have a transient network error? For example, let's say the client is on
the same machine as the NN, but it got partitioned from the network for a bit. When we call
create(), it succeeds, but then when we actually try to write the blocks, it fails temporarily.
This currently leaves a 0-length file, but does it also orphan the lease for that file?

> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Assignee: sam rash
>            Priority: Critical
>             Fix For: 0.20-append
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so until soft
lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append pipeline creation
failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase
master. HBase assumed the file wasn't open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned to the same
DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this JIRA is
that the NN can think it failed to open a file for append when the NN thinks the writer holds
a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can
open or recover the file until the DFS client shuts down.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message