ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hurley (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-18262) When Enabling NameNode HA Via the UI Wizard, the Second NN Fails to Start
Date Thu, 25 Aug 2016 16:18:21 GMT
Jonathan Hurley created AMBARI-18262:
----------------------------------------

             Summary: When Enabling NameNode HA Via the UI Wizard, the Second NN Fails to
Start
                 Key: AMBARI-18262
                 URL: https://issues.apache.org/jira/browse/AMBARI-18262
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.4.0
            Reporter: Jonathan Hurley
            Assignee: Jonathan Hurley
            Priority: Blocker
             Fix For: trunk


Caused by: AMBARI-18240

In enable namenode HA wizard, failure happened at "Start Additional NameNode" step.

The first NameNode starts...

{code}
 "href" : "https://172.22.115.113:8443/api/v1/clusters/cl1/requests/46/tasks/368",
  "Tasks" : {
    "attempt_cnt" : 1,
    "cluster_name" : "cl1",
    "command" : "START",
    "command_detail" : "NAMENODE START",
    "end_time" : 1472080011602,
    "error_log" : "/var/lib/ambari-agent/data/errors-368.txt",
    "exit_code" : 0,
    "host_name" : "nat-sp12-rnqs-amb-views-ha-6-5.openstacklocal",
    "id" : 368,
    "output_log" : "/var/lib/ambari-agent/data/output-368.txt",
    "request_id" : 46,
    "role" : "NAMENODE",
    "stage_id" : 0,
    "start_time" : 1472079963470,
    "status" : "COMPLETED",
    "stderr" : "2016-08-24 23:06:11,102 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-5.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 42, in get_value_from_jmx\n    return data_dict[\"beans\"][0][property]\nIndexError:
list index out of range\n2016-08-24 23:06:14,332 - Getting jmx metrics from NN failed. URL:
http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user,
quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate
-u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
1>/tmp/tmprdewEy 2>/tmp/tmpAmLket' returned 7. \n\n2016-08-24 23:06:22,280 - Getting
jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user,
quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate
-u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
1>/tmp/tmpHKH50b 2>/tmp/tmp6yyuWH' returned 7. \n\n2016-08-24 23:06:30,637 - Getting
jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user,
quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate
-u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
1>/tmp/tmpCXMjfH 2>/tmp/tmpq103ei' returned 7. \n\n2016-08-24 23:06:39,495 - Getting
jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user,
quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate
-u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
1>/tmp/tmpvdE9iJ 2>/tmp/tmpy9eAby' returned 7. \n\n2016-08-24 23:06:47,584 - Getting
jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user,
quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate
-u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
1>/tmp/tmp0Jx91E 2>/tmp/tmp6qu0gW' returned 7.",
{code}

The second does not:
{code}
{
  "href" : "https://172.22.115.113:8443/api/v1/clusters/cl1/requests/47/tasks/369",
  "Tasks" : {
    "attempt_cnt" : 1,
    "cluster_name" : "cl1",
    "command" : "START",
    "command_detail" : "NAMENODE START",
    "end_time" : 1472080160611,
    "error_log" : "/var/lib/ambari-agent/data/errors-369.txt",
    "exit_code" : 1,
    "host_name" : "nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal",
    "id" : 369,
    "output_log" : "/var/lib/ambari-agent/data/output-369.txt",
    "request_id" : 47,
    "role" : "NAMENODE",
    "stage_id" : 0,
    "start_time" : 1472080026015,
    "status" : "FAILED",
    "stderr" : "2016-08-24 23:07:13,642 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
(most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
line 42, in get_value_from_jmx\n    return data_dict[\"beans\"][0][property]\nIndexError:
list index out of range\nTraceback (most recent call last):\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py\",
line 420, in <module>\n    NameNode().execute()\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py\",
line 280, in execute\n    method(env)\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py\",
line 101, in start\n    upgrade_suspended=params.upgrade_suspended, env=env)\n  File \"/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py\",
line 89, in thunk\n    return fn(*args, **kwargs)\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py\",
line 184, in namenode\n    if is_this_namenode_active() is False:\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py\",
line 55, in wrapper\n    return function(*args, **kwargs)\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py\",
line 549, in is_this_namenode_active\n    raise Fail(format(\"The NameNode {namenode_id} is
not listed as Active or Standby, waiting...\"))\nresource_management.core.exceptions.Fail:
The NameNode nn2 is not listed as Active or Standby, waiting...",
{code}

When the UI enables NN HA first starts NN1 than NN2. At this stage both NNs are in 'standby'
mode. The active node will be elected only later ( I believe when ZKFC is installed and started)
thus I think the second NN start shouldn't be failed if no active name node was found:

1st NN start:
{code:title=nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal}
2016-08-24 23:08:20,037 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn1',
'nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal:50070')], unknown_namenodes = [(u'nn2', 'nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal:50070')]
2016-08-24 23:08:20,037 - No active NameNode was found after 5 retries. Will return current
NameNode HA states
2016-08-24 23:08:20,037 - Skipping Safemode check due to the following conditions: HA: True,
isActive: False, upgradeType: None
2016-08-24 23:08:20,037 - Skipping creation of HDFS directories since this is either not the
Active NameNode or we did not wait for Safemode to finish.

Command completed successfully!
{code}

2nd NN start:
{code:title=nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal}
2016-08-24 23:10:51,011 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn1',
'nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal:50070'), (u'nn2', 'nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal:50070')],
unknown_namenodes = []
2016-08-24 23:10:51,012 - No active NameNode was found after 5 retries. Will return current
NameNode HA states

Command failed after 1 tries
{code}

Since the 2nd NN start failed the wizard does not continue with installing ZKFC and rest of
the steps.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message