Return-Path: X-Original-To: apmail-ambari-dev-archive@www.apache.org Delivered-To: apmail-ambari-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D070118B6A for ; Mon, 28 Dec 2015 19:03:24 +0000 (UTC) Received: (qmail 6520 invoked by uid 500); 28 Dec 2015 19:03:24 -0000 Delivered-To: apmail-ambari-dev-archive@ambari.apache.org Received: (qmail 6487 invoked by uid 500); 28 Dec 2015 19:03:24 -0000 Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list dev@ambari.apache.org Received: (qmail 6470 invoked by uid 99); 28 Dec 2015 19:03:24 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Dec 2015 19:03:24 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 9650F296A3B; Mon, 28 Dec 2015 19:03:23 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============8540556213246818832==" MIME-Version: 1.0 Subject: Re: Review Request 41691: Namenode start fails when time taken to get out of safemode is more than 20 minutes. Additional patch From: "Eugene Chekanskiy" To: "Alejandro Fernandez" , "Sumit Mohanty" , "Vitalyi Brodetskyi" , "Eugene Chekanskiy" Cc: "Dmitro Lisnichenko" , "Ambari" , "Apache Ambari" Date: Mon, 28 Dec 2015 19:03:23 -0000 Message-ID: <20151228190323.4182.97816@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "Eugene Chekanskiy" X-ReviewGroup: Ambari X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/41691/ X-Sender: "Eugene Chekanskiy" References: <20151228165841.4181.83100@reviews.apache.org> In-Reply-To: <20151228165841.4181.83100@reviews.apache.org> Reply-To: "Eugene Chekanskiy" X-ReviewRequest-Repository: ambari --===============8540556213246818832== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Dec. 28, 2015, 4:58 p.m., Apache Ambari wrote: > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py, line 45 > > > > > > I spoke to Aravindan about this. > > Consider what happens when the server time out value is. > > > > A. < 30 mins (default of 20): If NN takes more than 30 mins to come out of safemode, then the task will be aborted and the user will have to retry the step again (e.g., NameNode restart and wait again) > > > > B. 30 or higher: Then NN will wait up to 30 mins. If after 30 mins still in safemode, then the task will proceed. > > > > For a very large cluster, this can take much longer than 30 mins and we'll be in the same boat again. > > There are 2 other potential solutions: > > 1. Have a timeout value in ambari.properties that is specific for waiting to leave safemode > > 2. Pass in the value of the server timeout to the command. So if the user bumps it up to 40 mins, then NameNode can always wait up to x-5 mins. > > > > What do you think? > > Sumit Mohanty wrote: > The problem here is that any limit we can configure could be smaller than the time taken to come out of safe-mode. So we can define a new property to capture NN timeout but it will still be a guess work as to what the value should be. The long term solution seems to be a feature where the user can tell Ambari to abort or continue to wait for NN to come out of the safemode. Is it something that the EU does today?? (EU will allow users to retry, will it?) > > This specific JIRA is tracking the problem of the default timeout being out of sync with the default retry duration. So we should fix that and open a new Task to discuss the solution for how to track getting out of the safemode gracefully. > > Apache Ambari wrote: > yes, both solutions I proposed would handle this. #2 is easiest to do. #1 would need any NameNode restart operation to change the default timeout value of the task. Agree that moving some advanced safemode-leaving mechanisms need to be discussed in seperate task. It is not much changes, but there are lot options how we can handle this and how it can be configured. - Eugene ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41691/#review112008 ----------------------------------------------------------- On Dec. 23, 2015, 5:20 p.m., Dmitro Lisnichenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41691/ > ----------------------------------------------------------- > > (Updated Dec. 23, 2015, 5:20 p.m.) > > > Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, Sumit Mohanty, and Vitalyi Brodetskyi. > > > Bugs: AMBARI-14479 > https://issues.apache.org/jira/browse/AMBARI-14479 > > > Repository: ambari > > > Description > ------- > > Issue > Namenode safemode check timeout value of 30mins is more than the server timeout of 20mins for a task. Hence, the server kills the namenode startup script if it takes more than 20mins to get out of safemode. > > > Diffs > ----- > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py 1766c44 > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py 67db735 > ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py 399fd8d > > Diff: https://reviews.apache.org/r/41691/diff/ > > > Testing > ------- > > mvn clean test > > > Thanks, > > Dmitro Lisnichenko > > --===============8540556213246818832==--