Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ambari.apache.org
Date: Mon, 11 May 2015 17:47:01 +0000 (UTC)
From: "Greg Hill (JIRA)" <jira@apache.org>
To: dev@ambari.apache.org
Message-ID: <JIRA.12779127.1425404547000.75709.1431366421578@Atlassian.JIRA>
In-Reply-To: <JIRA.12779127.1425404547000@Atlassian.JIRA>
References: <JIRA.12779127.1425404547000@Atlassian.JIRA>
 <JIRA.12779127.1425404547003@arcas>
Subject: [jira] [Resolved] (AMBARI-9902) Decommission DATANODE silently
 fails if in maintenance mode
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/AMBARI-9902?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Hill resolved AMBARI-9902.
-------------------------------
    Resolution: Invalid

I needed to set the operation_level differently for this to work properly.

> Decommission DATANODE silently fails if in maintenance mode
> -----------------------------------------------------------
>
>                 Key: AMBARI-9902
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9902
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>            Reporter: Greg Hill
>
> If you set maintenance mode on multiple hosts, then attempt to decommissi=
on the DATANODE on those hosts, it says that it succeeded but it did not ac=
tually decommission any nodes in HDFS.  This can lead to data loss as the c=
ustomer might assume that it's safe to remove those hosts from the pool.
> The request looks like:
> {noformat}
>          "RequestInfo": {
>                 "command": "DECOMMISSION",
>                 "context": "Decommission DataNode=E2=80=9D,
>                 "parameters": {"slave_type": =E2=80=9CDATANODE", "exclude=
d_hosts": =E2=80=9Cslave-3.local,slave-1.local"},
>                 "operation_level": {
>                     =E2=80=9Clevel=E2=80=9D: =E2=80=9CCLUSTER=E2=80=9D,
>                     =E2=80=9Ccluster_name=E2=80=9D: cluster_name
>                 },
>             },
>             "Requests/resource_filters": [{
>                 "service_name": =E2=80=9CHDFS",
>                 "component_name": =E2=80=9CNAMENODE",
>             }],
> {noformat}
> The task output appears to work:
> {noformat}
> File['/etc/hadoop/conf/dfs.exclude'] {'owner': 'hdfs', 'content': Templat=
e('exclude_hosts_list.j2'), 'group': 'hadoop'}
> Execute[''] {'user': 'hdfs'}
> ExecuteHadoop['dfsadmin -refreshNodes'] {'bin_dir': '/usr/hdp/current/had=
oop-client/bin', 'conf_dir': '/etc/hadoop/conf', 'kinit_override': True, 'u=
ser': 'hdfs'}
> Execute['hadoop --config /etc/hadoop/conf dfsadmin -refreshNodes'] {'logo=
utput': False, 'path': ['/usr/hdp/current/hadoop-client/bin'], 'tries': 1, =
'user': 'hdfs', 'try_sleep': 0}
> {noformat}
> But it didn't actually write any contents to the file.  If it had, this l=
ine would have been in there:
> {noformat}
> Writing File['/etc/hadoop/conf/dfs.exclude'] because contents don't match
> {noformat}
> The command json file for the task has the right hosts list as a paramete=
r:
> {noformat}
> "commandParams": {
>         "service_package_folder": "HDP/2.0.6/services/HDFS/package",
>         "update_exclude_file_only": "false",
>         "script": "scripts/namenode.py",
>         "hooks_folder": "HDP/2.0.6/hooks",
>         "excluded_hosts": "slave-3.local,slave-1.local",
>         "command_timeout": "600",
>         "slave_type": "DATANODE",
>         "script_type": "PYTHON"
>     },
> {noformat}
> So something is filtering the list external to that.
> If maintenance mode was not set, everything works as expected.  I don't b=
elieve there's a legitimate reason to disallow decommissioning nodes in mai=
ntenance mode, as that seems to be the expected course of action (set maint=
enance, decommission, remove) for dealing with a problematic host.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)