Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9D874200CB0 for ; Fri, 9 Jun 2017 00:58:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9C462160BE7; Thu, 8 Jun 2017 22:58:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E37E6160BD5 for ; Fri, 9 Jun 2017 00:58:22 +0200 (CEST) Received: (qmail 52511 invoked by uid 500); 8 Jun 2017 22:58:21 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 52500 invoked by uid 99); 8 Jun 2017 22:58:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jun 2017 22:58:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C89D5CA7D9 for ; Thu, 8 Jun 2017 22:58:20 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.761 X-Spam-Level: X-Spam-Status: No, score=-99.761 tagged_above=-999 required=6.31 tests=[KAM_LOTSOFHASH=0.25, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 0Iijd_KNMF3R for ; Thu, 8 Jun 2017 22:58:20 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 323385F640 for ; Thu, 8 Jun 2017 22:58:19 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 64ECAE00A7 for ; Thu, 8 Jun 2017 22:58:18 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 18F7721938 for ; Thu, 8 Jun 2017 22:58:18 +0000 (UTC) Date: Thu, 8 Jun 2017 22:58:18 +0000 (UTC) From: "Duo Xu (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HADOOP-14512) WASB atomic rename should not throw exception if the file is neither in src nor in dst when dong the rename MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 08 Jun 2017 22:58:23 -0000 Duo Xu created HADOOP-14512: ------------------------------- Summary: WASB atomic rename should not throw exception if the file is neither in src nor in dst when dong the rename Key: HADOOP-14512 URL: https://issues.apache.org/jira/browse/HADOOP-14512 Project: Hadoop Common Issue Type: Bug Components: fs/azure Reporter: Duo Xu During atomic rename operation, WASB creates a rename pending json file to document which files need to be renamed and the destination. Then WASB will read this file and rename all the files one by one. There is a recent customer incident in HBase showing a potential bug in the atomic rename implementation, For example, below is a rename pending json file, {code} { FormatVersion: "1.0", OperationUTCTime: "2017-04-29 06:08:57.465", OldFolderName: "hbase\/data\/default\/abc", NewFolderName: "hbase\/.tmp\/data\/default\/abc", FileList: [ ".tabledesc", ".tabledesc\/.tableinfo.0000000001", ".tmp", "08e698e0b7d4132c0456b16dcf3772af", "08e698e0b7d4132c0456b16dcf3772af\/.regioninfo", "08e698e0b7d4132c0456b16dcf3772af\/0\/617294e0737e4d37920e1609cf539a83", "08e698e0b7d4132c0456b16dcf3772af\/recovered.edits\/185.seqid", "08e698e0b7d4132c0456b16dcf3772af\/.regioninfo", "08e698e0b7d4132c0456b16dcf3772af\/0", "08e698e0b7d4132c0456b16dcf3772af\/0\/617294e0737e4d37920e1609cf539a83", "08e698e0b7d4132c0456b16dcf3772af\/recovered.edits", "08e698e0b7d4132c0456b16dcf3772af\/recovered.edits\/185.seqid" ] } {code} When HBase regionserver process (underlying is using WASB driver) was renaming "08e698e0b7d4132c0456b16dcf3772af\/.regioninfo", the regionserver process crashed or the VM got rebooted due to system maintenence. When the regionserver process started running again, it found the rename pending json file and tried to redo the rename operation. However, when it read the first file ".tabledesc" in the file list, it could not find this file in src folder and it also could not find the file in destination folder. It could not find it in src folder because the file had already been renamed/moved to the destination folder. It could not find it in destination folder because when HBase starts, it will clean up all the files under /hbase/.tmp. The current implementation will throw exceptions saying {code} else { throw new IOException( "Attempting to complete rename of file " + srcKey + "/" + fileName + " during folder rename redo, and file was not found in source " + "or destination."); } {code} This will cause HBase HMaster initialization failure and restart HMaster will not work because the same exception will throw again. My proposal is that if during the redo, WASB finds a file not in src and not in dst, WASB should just skip this file and process the next file rather than throw the error and let user manually fix it. Reasons are 1. Since the rename pending json file contains file A, if the file A is not in src, it must have been renamed. 2. if the file A is not in src and not in dst, the upper layer service must have removed it. One thing to note is that during the atomic rename, the folder is locked. So the only situation the file gets deleted is when VM reboots or service process crashes. When service process restarts, there might be some operations happening before the atomic rename redo, like the HBase example above. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org