ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Hill (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (AMBARI-9902) Decommission DATANODE silently fails if in maintenance mode
Date Mon, 11 May 2015 17:47:01 GMT

     [ https://issues.apache.org/jira/browse/AMBARI-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Hill resolved AMBARI-9902.
-------------------------------
    Resolution: Invalid

I needed to set the operation_level differently for this to work properly.

> Decommission DATANODE silently fails if in maintenance mode
> -----------------------------------------------------------
>
>                 Key: AMBARI-9902
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9902
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>            Reporter: Greg Hill
>
> If you set maintenance mode on multiple hosts, then attempt to decommission the DATANODE
on those hosts, it says that it succeeded but it did not actually decommission any nodes in
HDFS.  This can lead to data loss as the customer might assume that it's safe to remove those
hosts from the pool.
> The request looks like:
> {noformat}
>          "RequestInfo": {
>                 "command": "DECOMMISSION",
>                 "context": "Decommission DataNode”,
>                 "parameters": {"slave_type": “DATANODE", "excluded_hosts": “slave-3.local,slave-1.local"},
>                 "operation_level": {
>                     “level”: “CLUSTER”,
>                     “cluster_name”: cluster_name
>                 },
>             },
>             "Requests/resource_filters": [{
>                 "service_name": “HDFS",
>                 "component_name": “NAMENODE",
>             }],
> {noformat}
> The task output appears to work:
> {noformat}
> File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Template('exclude_hosts_list.j2'),
'group': 'hadoop'}
> Execute[''] {'user': 'hdfs'}
> ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/hadoop-client/bin',
'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'user': 'hdfs'}
> Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logoutput': False,
'path': ['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, 'user': 'hdfs', 'try_sleep': 0}
> {noformat}
> But it didn't actually write any contents to the file.  If it had, this line would have
been in there:
> {noformat}
> Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
> {noformat}
> The command json file for the task has the right hosts list as a parameter:
> {noformat}
> "commandParams": {
>         "service_package_folder": "HDP/2.0.6/services/HDFS/package",
>         "update_exclude_file_only": "false",
>         "script": "scripts/namenode.py",
>         "hooks_folder": "HDP/2.0.6/hooks",
>         "excluded_hosts": "slave-3.local,slave-1.local",
>         "command_timeout": "600",
>         "slave_type": "DATANODE",
>         "script_type": "PYTHON"
>     },
> {noformat}
> So something is filtering the list external to that.
> If maintenance mode was not set, everything works as expected.  I don't believe there's
a legitimate reason to disallow decommissioning nodes in maintenance mode, as that seems to
be the expected course of action (set maintenance, decommission, remove) for dealing with
a problematic host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message