Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B5A8310B01 for ; Tue, 4 Feb 2014 19:50:23 +0000 (UTC) Received: (qmail 13178 invoked by uid 500); 4 Feb 2014 19:50:18 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 13022 invoked by uid 500); 4 Feb 2014 19:50:17 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 12997 invoked by uid 99); 4 Feb 2014 19:50:16 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Feb 2014 19:50:16 +0000 Date: Tue, 4 Feb 2014 19:50:16 +0000 (UTC) From: "Jing Zhao (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-5399) Revisit SafeModeException and corresponding retry policies MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891102#comment-13891102 ] Jing Zhao commented on HDFS-5399: --------------------------------- Thanks for the comment, Todd! bq. Maybe we should consider changing the extension so that, if we don't have a significant number of under-replicated blocks, we don't go through the extension? +1 for this. [~kihwal] has a similar proposal in HDFS-5145. Since DNs keep sending block report to SBN, and the NN will process all the pending DN msgs while starting the active services, maybe we can just simply skip the safemode extension even without checking the number of under-replicated blocks? bq. we should limit the number of retries as Jing proposed above I will create a jira and upload a patch for this. > Revisit SafeModeException and corresponding retry policies > ---------------------------------------------------------- > > Key: HDFS-5399 > URL: https://issues.apache.org/jira/browse/HDFS-5399 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.3.0 > Reporter: Jing Zhao > Assignee: Jing Zhao > > Currently for NN SafeMode, we have the following corresponding retry policies: > # In non-HA setup, for certain API call ("create"), the client will retry if the NN is in SafeMode. Specifically, the client side's RPC adopts MultipleLinearRandomRetry policy for a wrapped SafeModeException when retry is enabled. > # In HA setup, the client will retry if the NN is Active and in SafeMode. Specifically, the SafeModeException is wrapped as a RetriableException in the server side. Client side's RPC uses FailoverOnNetworkExceptionRetry policy which recognizes RetriableException (see HDFS-5291). > There are several possible issues in the current implementation: > # The NN SafeMode can be a "Manual" SafeMode (i.e., started by administrator through CLI), and the clients may not want to retry on this type of SafeMode. > # Client may want to retry on other API calls in non-HA setup. > # We should have a single generic strategy to address the mapping between SafeMode and retry policy for both HA and non-HA setup. A possible straightforward solution is to always wrap the SafeModeException in the RetriableException to indicate that the clients should retry. -- This message was sent by Atlassian JIRA (v6.1.5#6160)