Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA17C1722C for ; Thu, 2 Jul 2015 20:42:05 +0000 (UTC) Received: (qmail 6496 invoked by uid 500); 2 Jul 2015 20:42:05 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 6447 invoked by uid 500); 2 Jul 2015 20:42:05 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 6432 invoked by uid 99); 2 Jul 2015 20:42:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2015 20:42:05 +0000 Date: Thu, 2 Jul 2015 20:42:05 +0000 (UTC) From: "Enis Soztutar (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612488#comment-14612488 ] Enis Soztutar commented on HBASE-13832: --------------------------------------- bq. to get the same behavior you need to force running to false when you set syncException. so you prevent other procedure to be added. Not sure whether we gain by ensuring that running is set to false before the next execution for syncLoop. Wal store will abort when the master calls abort. Before this happens, concurrent calls to {{pushData()}} will still get the exception because the exception from sync is not cleared at all. So the semantics is that if {{snyc()}} + wal roll fails, we effectively start rejecting all requests for {{pushData()}}, which is kind of similar to making sure to check isRunning(). > Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low > ----------------------------------------------------------------------------------------------------------- > > Key: HBASE-13832 > URL: https://issues.apache.org/jira/browse/HBASE-13832 > Project: HBase > Issue Type: Sub-task > Components: master, proc-v2 > Affects Versions: 2.0.0, 1.1.0, 1.2.0 > Reporter: Stephen Yuan Jiang > Assignee: Matteo Bertozzi > Priority: Critical > Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1 > > Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, HBASE-13832-v2.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, hbase-13832-v3.patch > > > when the data node < 3, we got failure in WALProcedureStore#syncLoop() during master start. The failure prevents master to get started. > {noformat} > 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Sync slot failed, abort. > java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]], original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK], DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983- 490ece56c772,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951) > {noformat} > One proposal is to implement some similar logic as FSHLog: if IOException is thrown during syncLoop in WALProcedureStore#start(), instead of immediate abort, we could try to roll the log and see whether this resolve the issue; if the new log cannot be created or more exception from rolling the log, we then abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)