hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3584) Blocks are getting marked as corrupt with append operation under high load.
Date Mon, 02 Jul 2012 12:48:23 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405043#comment-13405043
] 

Uma Maheswara Rao G commented on HDFS-3584:
-------------------------------------------

Thanks Brahma and Amith for digging into it.

Seems like a bug.
We are triggering the recovery from append on leaseExpired check, that means that we are trusting
that, older client might have gone down. So, there is no renewal from clients and soft limit
expired. And append call is triggering the recovery himself and throwing the exception to
user, saying file not yet closed try again later. Here we are renewing the lease now from
append call itself.

{code}
 if (lease.expiredSoftLimit()) {
          LOG.info("startFile: recover lease " + lease + ", src=" + src +
              " from client " + pendingFile.getClientName());
          boolean isClosed = internalReleaseLease(lease, src, null, lease.expiredSoftLimit());
          if(!isClosed)
            throw new RecoveryInProgressException(
                "Failed to close file " + src +
                ". Lease recovery is in progress. Try again later.");
        }
{code}

and in internalReleaseLease:

{code}
    case UNDER_RECOVERY:
      final BlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock;
      // setup the last block locations from the blockManager if not known
      if (uc.getNumExpectedLocations() == 0) {
        uc.setExpectedLocations(blockManager.getNodes(lastBlock));
      }
      // start recovery of the last block for this file
      long blockRecoveryId = nextGenerationStamp();
      lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
      uc.initializeBlockRecovery(blockRecoveryId);
      leaseManager.renewLease(lease);
{code}

Here block recovery will happen in background in primary DN and will be returned.

But unfortunately now close call came from the old client and file got closed. Seems like
this happend under high load.
But block ids already bumped in DNs and will rejected as file closed with older genstamps
at NN side.
commitBlockSynchronization also failing due to this reason.

I think we need to block the older clients to close the file at this stage?

what if append call takes the new lease ownership and removes the older client lease?

close call anyway checking the lease expiration.

{code}
 try {
      pendingFile = checkLease(src, holder);
    } catch (LeaseExpiredException lee) {
      INodeFile file = dir.getFileINode(src);
      if (file != null && !file.isUnderConstruction()) {
        // This could be a retry RPC - i.e the client tried to close
        // the file, but missed the RPC response. Thus, it is trying
        // again to close the file. If the file still exists and
        // the client's view of the last block matches the actual
        // last block, then we'll treat it as a successful close.
        // See HDFS-3031.
        Block realLastBlock = file.getLastBlock();
        if (Block.matchingIdAndGenStamp(last, realLastBlock)) {
          NameNode.stateChangeLog.info("DIR* NameSystem.completeFile: " +
              "received request from " + holder + " to complete file " + src +
              " which is already closed. But, it appears to be an RPC " +
              "retry. Returning success.");
          return true;
        }
      }
      throw lee;
    }
{code}
I am not sure , I am missing some thing here.
would greatly appreciate your suggestions on this.
  
                
> Blocks are getting marked as corrupt with append operation under high load.
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-3584
>                 URL: https://issues.apache.org/jira/browse/HDFS-3584
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.1-alpha
>            Reporter: Brahma Reddy Battula
>
> Scenario:
> ========= 
> 1. There are 2 clients cli1 and cli2 cli1 write a file F1 and not closed
> 2. The cli2 will call append on unclosed file and triggers a leaserecovery
> 3. Cli1 is closed
> 4. Lease recovery is completed and with updated GS in DN and got BlockReport since there
is a mismatch in GS the block got corrupted
> 5. Now we got a CommitBlockSync this will also fail since the File is already closed
by cli1 and state in NN is Finalized

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message