From hdfs-dev-return-6875-apmail-hadoop-hdfs-dev-archive=hadoop.apache.org@hadoop.apache.org Fri Jun 8 22:48:24 2012 Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 06C55C807 for ; Fri, 8 Jun 2012 22:48:24 +0000 (UTC) Received: (qmail 95771 invoked by uid 500); 8 Jun 2012 22:48:23 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 95626 invoked by uid 500); 8 Jun 2012 22:48:23 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 95612 invoked by uid 99); 8 Jun 2012 22:48:23 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2012 22:48:23 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 0C0E5142859 for ; Fri, 8 Jun 2012 22:48:23 +0000 (UTC) Date: Fri, 8 Jun 2012 22:48:23 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: <1059204687.54742.1339195703051.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Created] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Todd Lipcon created HDFS-3519: --------------------------------- Summary: Checkpoint upload may interfere with a concurrent saveNamespace Key: HDFS-3519 URL: https://issues.apache.org/jira/browse/HDFS-3519 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha, 1.0.3 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical TestStandbyCheckpoints failed in [precommit build 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/] due to the following issue: - both nodes were in Standby state, and configured to checkpoint "as fast as possible" - NN1 starts to save its own namespace - NN2 starts to upload a checkpoint for the same txid. So, both threads are writing to the same file fsimage.ckpt_12, but the actual file contents correspond to the uploading thread's data. - NN1 finished its saveNamespace operation while NN2 was still uploading. So, it renamed the ckpt file. However, the contents of the file are still empty since NN2 hasn't sent any bytes - NN2 finishes the upload, and the rename() call fails, which causes the directory to be marked failed, etc. The result is that there is a file fsimage_12 which appears to be a finalized image but in fact is incompletely transferred. When the transfer completes, the problem "heals itself" so there wouldn't be persistent corruption unless the machine crashes at the same time. And even then, we'd still have the earlier checkpoint to restore from. This same race could occur in a non-HA setup if a user puts the NN in safe mode and issues saveNamespace operations concurrent with a 2NN checkpointing, I believe. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira