ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Hill (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-9902) Decommission DATANODE silently fails if in maintenance mode
Date Tue, 03 Mar 2015 17:43:04 GMT
Greg Hill created AMBARI-9902:
---------------------------------

             Summary: Decommission DATANODE silently fails if in maintenance mode
                 Key: AMBARI-9902
                 URL: https://issues.apache.org/jira/browse/AMBARI-9902
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent
    Affects Versions: 1.7.0
            Reporter: Greg Hill


If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE on
those hosts, it says that it succeeded but it did not actually decommission any nodes in HDFS.
 This can lead to data loss as the customer might assume that it's safe to remove those hosts
from the pool.

The request looks like:
<noformat>
         "RequestInfo": {
                "command": "DECOMMISSION",
                "context": "Decommission DataNode”),
                "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
                "operation_level": {
                    “level”: “CLUSTER”,
                    “cluster_name”: cluster_name
                },
            },
            "Requests/resource_filters": [{
                "service_name": “HDFS",
                "component_name": “NAMENODE",
            }],
</noformat>

The task output appears to work:

<noformat>
File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'),
'group': 'hadoop'}
Execute[''] {'user': 'hdfs'}
ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin',
'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False, 'path':
['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
</noformat>

But it didn't actually write any contents to the file.  If it had, this line would have been
in there:

<noformat>
Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
</noformat>

The command json file for the task has the right hosts list as a parameter:

<noformat>
"commandParams": {
        "service_package_folder": "HDP/2.0.6/services/HDFS/package",
        "update_exclude_file_only": "false",
        "script": "scripts/namenode.py",
        "hooks_folder": "HDP/2.0.6/hooks",
        "excluded_hosts": "slave-3.local,slave-1.local",
        "command_timeout": "600",
        "slave_type": "DATANODE",
        "script_type": "PYTHON"
    },
</noformat>

So something is filtering the list external to that.

If maintenance mode was not set, everything works as expected.  I don't believe there's a
legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to
be the expected course of action (set maintenance, decommission, remove) for dealing with
a problematic host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message