ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-12252) Prevent datanode from creating an HDFS datadir when drive becomes unmounted
Date Thu, 02 Jul 2015 04:56:04 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-12252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alejandro Fernandez updated AMBARI-12252:
-----------------------------------------
    Description: 
This is related to AMBARI-7506

Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
that contains a mapping of HDFS data dirs to the last known mount point.

This is used to detect when a data dir becomes unmounted, in order to prevent HDFS from writing
to the root partition.

Consider the example of a data node configured with these volumes: 

/dev/sda -> / 
/dev/sdb -> /grid/0
/dev/sdc -> /grid/1
/dev/sdd -> /grid/2

Typically, each /grid/#/ directory contains a data folder.
Today, if a data directory becomes unmounted, then the directory will not exist and Ambari
will not create it automatically. Ambari will simply log a warning, and update its cache with
the new mount point, which is /  ; that is the underlying bug.

If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, then DataNode
will tolerate the failure, otherwise, the DataNode will die.

Because Ambari will already have "/" in its cache file, the fact that it used to be mounted
in a non-root drive is lost, so next time DataNode is restarted, Ambari will create the data
dir which is now mounted on the root partition; this is really bad because HDFS will now fill
up the root drive.

The admin can still remount the partition, but then needs to restart DataNode so Ambari can
update its cache.

The ideal way to fix this in Ambari 2.2 is as follows,
* Track which data dirs the admin wants mounted on a non-root partition. If the admin wishes
all data dirs to be on non-root mounts, but the initial install is incorrect, then this should
be reported as a problem. 
* Keep the history of the mount points in the database. Today, if the cache file is deleted
or the host reimaged, then this information is lost.
* Introduce a new state between FAILED and COMPLETED, such as COMPLETED_WITH_ERRORS, that
will allow tasks to look differently in the UI, so the user can clearly detect when a critical
but non fatal error happened.

  was:
This is related to AMBARI-7506

Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
that contains a mapping of HDFS data dirs to the last known mount point.

This is used to detect when a data dir becomes unmounted, in order to prevent HDFS from writing
to the root partition.

Consider the example of a data node configured with these volumes: 

/dev/sda -> / 
/dev/sdb -> /grid/0
/dev/sdc -> /grid/1
/dev/sdd -> /grid/2

Typically, each /grid/#/ directory contains a data folder.
Today, if a data directory becomes unmounted, then the directory will not exist and Ambari
will not create it automatically. Ambari will simply log a warning, and update its cache with
the new mount point, which is /  ; that is the underlying bug.

If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, then DataNode
will tolerate the failure, otherwise, the DataNode will die.

Because Ambari will already have "/" in its cache file, the fact that it used to be mounted
in a non-root drive is lost, so next time DataNode is restarted, Ambari will create the data
dir which is now mounted on the root partition; this is really bad because HDFS will now fill
up the root drive.

The admin can still remount the partition, but then needs to restart DataNode so Ambari can
update its cache.


> Prevent datanode from creating an HDFS datadir when drive becomes unmounted
> ---------------------------------------------------------------------------
>
>                 Key: AMBARI-12252
>                 URL: https://issues.apache.org/jira/browse/AMBARI-12252
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: AMBARI-12252.branch-2.1.patch, AMBARI-12252.patch
>
>
> This is related to AMBARI-7506
> Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
> that contains a mapping of HDFS data dirs to the last known mount point.
> This is used to detect when a data dir becomes unmounted, in order to prevent HDFS from
writing to the root partition.
> Consider the example of a data node configured with these volumes: 
> /dev/sda -> / 
> /dev/sdb -> /grid/0
> /dev/sdc -> /grid/1
> /dev/sdd -> /grid/2
> Typically, each /grid/#/ directory contains a data folder.
> Today, if a data directory becomes unmounted, then the directory will not exist and Ambari
will not create it automatically. Ambari will simply log a warning, and update its cache with
the new mount point, which is /  ; that is the underlying bug.
> If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, then
DataNode will tolerate the failure, otherwise, the DataNode will die.
> Because Ambari will already have "/" in its cache file, the fact that it used to be mounted
in a non-root drive is lost, so next time DataNode is restarted, Ambari will create the data
dir which is now mounted on the root partition; this is really bad because HDFS will now fill
up the root drive.
> The admin can still remount the partition, but then needs to restart DataNode so Ambari
can update its cache.
> The ideal way to fix this in Ambari 2.2 is as follows,
> * Track which data dirs the admin wants mounted on a non-root partition. If the admin
wishes all data dirs to be on non-root mounts, but the initial install is incorrect, then
this should be reported as a problem. 
> * Keep the history of the mount points in the database. Today, if the cache file is deleted
or the host reimaged, then this information is lost.
> * Introduce a new state between FAILED and COMPLETED, such as COMPLETED_WITH_ERRORS,
that will allow tasks to look differently in the UI, so the user can clearly detect when a
critical but non fatal error happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message