hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sam rash (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1262) Failed pipeline creation during append leaves lease hanging on NN
Date Wed, 23 Jun 2010 16:57:51 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881748#action_12881748

sam rash commented on HDFS-1262:

i think something along the lines of option 2 sounds cleaner imo.

but i have another question, does the error you see have
"because current leaseholder is trying to recreate file"

it sounds like this code is executing:

        // We found the lease for this file. And surprisingly the original
        // holder is trying to recreate this file. This should never occur.
        if (lease != null) {
          Lease leaseFile = leaseManager.getLeaseByPath(src);
          if (leaseFile != null && leaseFile.equals(lease)) { 
            throw new AlreadyBeingCreatedException(
                                                 "failed to create file " + src + " for "
+ holder +
                                                 " on client " + clientMachine + 
                                                 " because current leaseholder is trying to
recreate file.");

and anytime I see a comment "this should never happen" it sounds to me like the handling of
that might be suboptimal.  is there any reason that a client shouldn't be able to open a file
in the same mode it already has it open?  NN-side, it's a basically a no-op, or a explicit
lease renewal.

any reason we can't make the above code do that?  (log something and return)

> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.20-append
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened was the
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so until soft
lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append pipeline creation
failed on all datanodes 6 times, causing the append() call to throw an exception back to HBase
master. HBase assumed the file wasn't open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned to the same
DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this JIRA is
that the NN can think it failed to open a file for append when the NN thinks the writer holds
a lease. Since the writer keeps renewing its lease, recovery never happens, and no one can
open or recover the file until the DFS client shuts down.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message