Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ambari.apache.org
Content-Type: multipart/alternative;
 boundary="===============8540556213246818832=="
MIME-Version: 1.0
Subject: Re: Review Request 41691: Namenode start fails when time taken to get
 out of safemode is more than 20 minutes. Additional patch
From: "Eugene Chekanskiy" <echekanskiy@hortonworks.com>
To: "Alejandro Fernandez" <afernandez@hortonworks.com>,
 "Sumit Mohanty" <smohanty@hortonworks.com>,
 "Vitalyi Brodetskyi" <vbrodetskyi@hortonworks.com>,
 "Eugene Chekanskiy" <echekanskiy@hortonworks.com>
Cc: "Dmitro Lisnichenko" <dlysnichenko@hortonworks.com>,
 "Ambari" <dev@ambari.apache.org>, "Apache Ambari" <apache.ambari@gmail.com>
Date: Mon, 28 Dec 2015 19:03:23 -0000
Message-ID: <20151228190323.4182.97816@reviews.apache.org>
Auto-Submitted: auto-generated
Sender: "Eugene Chekanskiy" <noreply@reviews.apache.org>
References: <20151228165841.4181.83100@reviews.apache.org>
In-Reply-To: <20151228165841.4181.83100@reviews.apache.org>
Reply-To: "Eugene Chekanskiy" <echekanskiy@hortonworks.com>

--===============8540556213246818832==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit


> On Dec. 28, 2015, 4:58 p.m., Apache Ambari wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py, line 45
> > <https://reviews.apache.org/r/41691/diff/1/?file=1175382#file1175382line45>
> >
> >     I spoke to Aravindan about this.
> >     Consider what happens when the server time out value is.
> >     
> >     A. < 30 mins (default of 20): If NN takes more than 30 mins to come out of safemode, then the task will be aborted and the user will have to retry the step again (e.g., NameNode restart and wait again)
> >     
> >     B. 30 or higher: Then NN will wait up to 30 mins. If after 30 mins still in safemode, then the task will proceed.
> >     
> >     For a very large cluster, this can take much longer than 30 mins and we'll be in the same boat again.
> >     There are 2 other potential solutions:
> >     1. Have a timeout value in ambari.properties that is specific for waiting to leave safemode
> >     2. Pass in the value of the server timeout to the command. So if the user bumps it up to 40 mins, then NameNode can always wait up to x-5 mins.
> >     
> >     What do you think?
> 
> Sumit Mohanty wrote:
>     The problem here is that any limit we can configure could be smaller than the time taken to come out of safe-mode. So we can define a new property to capture NN timeout but it will still be a guess work as to what the value should be. The long term solution seems to be a feature where the user can tell Ambari to abort or continue to wait for NN to come out of the safemode. Is it something that the EU does today?? (EU will allow users to retry, will it?)
>     
>     This specific JIRA is tracking the problem of the default timeout being out of sync with the default retry duration. So we should fix that and open a new Task to discuss the solution for how to track getting out of the safemode gracefully.
> 
> Apache Ambari wrote:
>     yes, both solutions I proposed would handle this. #2 is easiest to do. #1 would need any NameNode restart operation to change the default timeout value of the task.

Agree that moving some advanced safemode-leaving mechanisms need to be discussed in seperate task. It is not much changes, but there are lot options how we can handle this and how it can be configured.


- Eugene


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41691/#review112008
-----------------------------------------------------------


On Dec. 23, 2015, 5:20 p.m., Dmitro Lisnichenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41691/
> -----------------------------------------------------------
> 
> (Updated Dec. 23, 2015, 5:20 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, Sumit Mohanty, and Vitalyi Brodetskyi.
> 
> 
> Bugs: AMBARI-14479
>     https://issues.apache.org/jira/browse/AMBARI-14479
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Issue
> Namenode safemode check timeout value of 30mins is more than the server timeout of 20mins for a task. Hence, the server kills the namenode startup script if it takes more than 20mins to get out of safemode.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py 1766c44 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py 67db735 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py 399fd8d 
> 
> Diff: https://reviews.apache.org/r/41691/diff/
> 
> 
> Testing
> -------
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Dmitro Lisnichenko
> 
>


--===============8540556213246818832==--