ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez" <afernan...@hortonworks.com>
Subject Re: Review Request 38651: AMBARI-13194. Alert definition when DataNode data dirs are likely to become unmounted
Date Wed, 23 Sep 2015 16:21:33 GMT


> On Sept. 22, 2015, 10:31 p.m., Sumit Mohanty wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py,
line 77
> > <https://reviews.apache.org/r/38651/diff/1/?file=1081590#file1081590line77>
> >
> >     Will it result in an alert after Ambari upgrade? Not sure if requiring DN restart
to get rid of an alert is a good idea?
> 
> Alejandro Fernandez wrote:
>     No upgrade is needed to pickup added alert definitions in Ambari 2.1; ambari-server
actually loads them from the json file on start.
>     It checks if the history file exits, if the data dirs exist, and if it's possible
for the data dirs to have become unmounted.
>     One way to fix the missing history file or missing data dir is to restart DN, but
that's not necessarily required.
> 
> Sumit Mohanty wrote:
>     What I meant is when I upgrade from Ambari-2.1.0 to 2.1.2 then the history file will
not exist. Will we see an WARN alert?

The history file was added in either Ambari 1.7.0/2.0.0, and it is created the first time
that DataNode starts.
This means that existing clusters should not see any warnings; warnings only show up during
the installation of a brand new cluster.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38651/#review100084
-----------------------------------------------------------


On Sept. 22, 2015, 10:17 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38651/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2015, 10:17 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Jonathan Hurley, Nate
Cole, Sumit Mohanty, Srimanth Gunturi, and Sid Wagle.
> 
> 
> Bugs: AMBARI-13194
>     https://issues.apache.org/jira/browse/AMBARI-13194
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Ambari uses the dfs.datanode.data.dir.mount.file property in HDFS, whose value is typically
/etc/hadoop/conf/dfs_data_dir_mount.hist
> to track the mount points for each of the data dirs.
> 
> E.g.,
> {code}
> /hadoop01/data,/device1
> /hadoop02/data,/device2
> /hadoop03/data,/     # this one is on root, the others are all on mount points.
> {code}
> 
> Whenever a drive becomes unmounted, Ambari detects that it was previously on a mount
and will not create that data dir; HDFS can still tolerate the failure if dfs.datanode.failed.volumes.tolerated
is greater than 0.
> Now, if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted, then Ambari won't
have this knowledge, and will create the datadir (even if it's on the root partition).
> 
> To improve tracking, create an alert definition that checks the following
> * warning status if the /etc/hadoop/conf/dfs_data_dir_mount.hist file is deleted
> * critical status if at least one of the data dirs is mounted on the root partition,
and at least one data dir is on a mount
> 
> 
> Diffs
> -----
> 
>   ambari-common/src/main/python/resource_management/core/providers/system.py 213adc5

>   ambari-common/src/main/python/resource_management/libraries/functions/dfs_datanode_helper.py
a05e162 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/alerts.json 477fd95

>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/alerts/alert_datanode_unmounted_data_dir.py
PRE-CREATION 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_alert_datanode_unmounted_data_dir.py
PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/38651/diff/
> 
> 
> Testing
> -------
> 
> * Python unit tests passed
> * Verified that the alert worked on several hosts for all 3 types of statuses (WARNING,
CRITICAL, OK)
> * Also checked that it did not run on a host without DataNode, and it did run once I
added DataNode to that host
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message