ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Hill (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMBARI-9902) Decommission DATANODE silently fails if in maintenance mode
Date Tue, 03 Mar 2015 17:49:04 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Hill updated AMBARI-9902:
------------------------------
    Description: 
If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE on
those hosts, it says that it succeeded but it did not actually decommission any nodes in HDFS.
 This can lead to data loss as the customer might assume that it's safe to remove those hosts
from the pool.

The request looks like:
{noformat}
         "RequestInfo": {
                "command": "DECOMMISSION",
                "context": "Decommission DataNode”,
                "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
                "operation_level": {
                    “level”: “CLUSTER”,
                    “cluster_name”: cluster_name
                },
            },
            "Requests/resource_filters": [{
                "service_name": “HDFS",
                "component_name": “NAMENODE",
            }],
{noformat}

The task output appears to work:

{noformat}
File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'),
'group': 'hadoop'}
Execute[''] {'user': 'hdfs'}
ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin',
'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False, 'path':
['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
{noformat}

But it didn't actually write any contents to the file.  If it had, this line would have been
in there:

{noformat}
Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
{noformat}

The command json file for the task has the right hosts list as a parameter:

{noformat}
"commandParams": {
        "service_package_folder": "HDP/2.0.6/services/HDFS/package",
        "update_exclude_file_only": "false",
        "script": "scripts/namenode.py",
        "hooks_folder": "HDP/2.0.6/hooks",
        "excluded_hosts": "slave-3.local,slave-1.local",
        "command_timeout": "600",
        "slave_type": "DATANODE",
        "script_type": "PYTHON"
    },
{noformat}

So something is filtering the list external to that.

If maintenance mode was not set, everything works as expected.  I don't believe there's a
legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to
be the expected course of action (set maintenance, decommission, remove) for dealing with
a problematic host.

  was:
If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE on
those hosts, it says that it succeeded but it did not actually decommission any nodes in HDFS.
 This can lead to data loss as the customer might assume that it's safe to remove those hosts
from the pool.

The request looks like:
{noformat}
         "RequestInfo": {
                "command": "DECOMMISSION",
                "context": "Decommission DataNode”),
                "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
                "operation_level": {
                    “level”: “CLUSTER”,
                    “cluster_name”: cluster_name
                },
            },
            "Requests/resource_filters": [{
                "service_name": “HDFS",
                "component_name": “NAMENODE",
            }],
{noformat}

The task output appears to work:

{noformat}
File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'),
'group': 'hadoop'}
Execute[''] {'user': 'hdfs'}
ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin',
'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False, 'path':
['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
{noformat}

But it didn't actually write any contents to the file.  If it had, this line would have been
in there:

{noformat}
Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
{noformat}

The command json file for the task has the right hosts list as a parameter:

{noformat}
"commandParams": {
        "service_package_folder": "HDP/2.0.6/services/HDFS/package",
        "update_exclude_file_only": "false",
        "script": "scripts/namenode.py",
        "hooks_folder": "HDP/2.0.6/hooks",
        "excluded_hosts": "slave-3.local,slave-1.local",
        "command_timeout": "600",
        "slave_type": "DATANODE",
        "script_type": "PYTHON"
    },
{noformat}

So something is filtering the list external to that.

If maintenance mode was not set, everything works as expected.  I don't believe there's a
legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to
be the expected course of action (set maintenance, decommission, remove) for dealing with
a problematic host.


> Decommission DATANODE silently fails if in maintenance mode
> -----------------------------------------------------------
>
>                 Key: AMBARI-9902
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9902
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>            Reporter: Greg Hill
>
> If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE
on those hosts, it says that it succeeded but it did not actually decommission any nodes in
HDFS.  This can lead to data loss as the customer might assume that it's safe to remove those
hosts from the pool.
> The request looks like:
> {noformat}
>          "RequestInfo": {
>                 "command": "DECOMMISSION",
>                 "context": "Decommission DataNode”,
>                 "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
>                 "operation_level": {
>                     “level”: “CLUSTER”,
>                     “cluster_name”: cluster_name
>                 },
>             },
>             "Requests/resource_filters": [{
>                 "service_name": “HDFS",
>                 "component_name": “NAMENODE",
>             }],
> {noformat}
> The task output appears to work:
> {noformat}
> File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'),
'group': 'hadoop'}
> Execute[''] {'user': 'hdfs'}
> ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin',
'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
> Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False,
'path': ['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
> {noformat}
> But it didn't actually write any contents to the file.  If it had, this line would have
been in there:
> {noformat}
> Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
> {noformat}
> The command json file for the task has the right hosts list as a parameter:
> {noformat}
> "commandParams": {
>         "service_package_folder": "HDP/2.0.6/services/HDFS/package",
>         "update_exclude_file_only": "false",
>         "script": "scripts/namenode.py",
>         "hooks_folder": "HDP/2.0.6/hooks",
>         "excluded_hosts": "slave-3.local,slave-1.local",
>         "command_timeout": "600",
>         "slave_type": "DATANODE",
>         "script_type": "PYTHON"
>     },
> {noformat}
> So something is filtering the list external to that.
> If maintenance mode was not set, everything works as expected.  I don't believe there's
a legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to
be the expected course of action (set maintenance, decommission, remove) for dealing with
a problematic host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message