ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Chekanskiy" <echekans...@hortonworks.com>
Subject Re: Review Request 41691: Namenode start fails when time taken to get out of safemode is more than 20 minutes. Additional patch
Date Mon, 28 Dec 2015 19:03:23 GMT


> On Dec. 28, 2015, 4:58 p.m., Apache Ambari wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py,
line 45
> > <https://reviews.apache.org/r/41691/diff/1/?file=1175382#file1175382line45>
> >
> >     I spoke to Aravindan about this.
> >     Consider what happens when the server time out value is.
> >     
> >     A. < 30 mins (default of 20): If NN takes more than 30 mins to come out of
safemode, then the task will be aborted and the user will have to retry the step again (e.g.,
NameNode restart and wait again)
> >     
> >     B. 30 or higher: Then NN will wait up to 30 mins. If after 30 mins still in
safemode, then the task will proceed.
> >     
> >     For a very large cluster, this can take much longer than 30 mins and we'll be
in the same boat again.
> >     There are 2 other potential solutions:
> >     1. Have a timeout value in ambari.properties that is specific for waiting to
leave safemode
> >     2. Pass in the value of the server timeout to the command. So if the user bumps
it up to 40 mins, then NameNode can always wait up to x-5 mins.
> >     
> >     What do you think?
> 
> Sumit Mohanty wrote:
>     The problem here is that any limit we can configure could be smaller than the time
taken to come out of safe-mode. So we can define a new property to capture NN timeout but
it will still be a guess work as to what the value should be. The long term solution seems
to be a feature where the user can tell Ambari to abort or continue to wait for NN to come
out of the safemode. Is it something that the EU does today?? (EU will allow users to retry,
will it?)
>     
>     This specific JIRA is tracking the problem of the default timeout being out of sync
with the default retry duration. So we should fix that and open a new Task to discuss the
solution for how to track getting out of the safemode gracefully.
> 
> Apache Ambari wrote:
>     yes, both solutions I proposed would handle this. #2 is easiest to do. #1 would need
any NameNode restart operation to change the default timeout value of the task.

Agree that moving some advanced safemode-leaving mechanisms need to be discussed in seperate
task. It is not much changes, but there are lot options how we can handle this and how it
can be configured.


- Eugene


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41691/#review112008
-----------------------------------------------------------


On Dec. 23, 2015, 5:20 p.m., Dmitro Lisnichenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41691/
> -----------------------------------------------------------
> 
> (Updated Dec. 23, 2015, 5:20 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, Sumit Mohanty, and
Vitalyi Brodetskyi.
> 
> 
> Bugs: AMBARI-14479
>     https://issues.apache.org/jira/browse/AMBARI-14479
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Issue
> Namenode safemode check timeout value of 30mins is more than the server timeout of 20mins
for a task. Hence, the server kills the namenode startup script if it takes more than 20mins
to get out of safemode.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
1766c44 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py
67db735 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py 399fd8d 
> 
> Diff: https://reviews.apache.org/r/41691/diff/
> 
> 
> Testing
> -------
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Dmitro Lisnichenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message