Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 01E8C200B13 for ; Wed, 15 Jun 2016 15:56:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 00AA4160A57; Wed, 15 Jun 2016 13:56:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 51718160A4D for ; Wed, 15 Jun 2016 15:56:10 +0200 (CEST) Received: (qmail 40516 invoked by uid 500); 15 Jun 2016 13:56:09 -0000 Mailing-List: contact issues-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list issues@ambari.apache.org Received: (qmail 40493 invoked by uid 99); 15 Jun 2016 13:56:09 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jun 2016 13:56:09 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 641E52C14F9 for ; Wed, 15 Jun 2016 13:56:09 +0000 (UTC) Date: Wed, 15 Jun 2016 13:56:09 +0000 (UTC) From: "Jonathan Hurley (JIRA)" To: issues@ambari.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AMBARI-17236) Namenode start step failed during EU with RetriableException MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 15 Jun 2016 13:56:11 -0000 [ https://issues.apache.org/jira/browse/AMBARI-17236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hurley updated AMBARI-17236: ------------------------------------- Attachment: AMBARI-17236.patch > Namenode start step failed during EU with RetriableException > ------------------------------------------------------------ > > Key: AMBARI-17236 > URL: https://issues.apache.org/jira/browse/AMBARI-17236 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.4.0 > Reporter: Jonathan Hurley > Assignee: Jonathan Hurley > Priority: Critical > Fix For: 2.4.0 > > Attachments: AMBARI-17236.patch > > > *Steps* > # Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with custom service users) > # Upgrade Ambari to 2.4.0.0-644 > # Register HDP-2.4.2.0 and install the bits > # Start Express Upgrade > Observed below error during start of NameNode: > {code} > Traceback (most recent call last): > File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 414, in > NameNode().execute() > File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 257, in execute > method(env) > File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 679, in restart > self.start(env, upgrade_type=upgrade_type) > File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start > upgrade_suspended=params.upgrade_suspended, env=env) > File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk > return fn(*args, **kwargs) > File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 216, in namenode > create_hdfs_directories() > File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 283, in create_hdfs_directories > mode=0777, > File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__ > self.env.run() > File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run > self.run_action(resource, action) > File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action > provider_action() > File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 458, in action_create_on_execute > self.action_delayed("create") > File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 455, in action_delayed > self.get_hdfs_resource_executor().action_delayed(action_name, self) > File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 246, in action_delayed > self._assert_valid() > File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 230, in _assert_valid > self.target_status = self._get_file_status(target) > File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 291, in _get_file_status > list_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', ignore_status_codes=['404'], assertable_result=False) > File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 191, in run_command > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs'' returned status_code=403. > { > "RemoteException": { > "exception": "RetriableException", > "javaClassName": "org.apache.hadoop.ipc.RetriableException", > "message": "NameNode still not started" > } > } > {code} > So, the heart of this issue is that, depending on topology and upgrade type, we might not wait for NN to be out of Safe Mode after starting. However, we are always creating directories, regardless of topology/upgrade: > {code} > # Always run this on non-HA, or active NameNode during HA. > if is_active_namenode: > create_hdfs_directories() > create_ranger_audit_hdfs_directories() > {code} > NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it didn't throw a retryable exception: > {code} > [hdfs@c6403 root]$ hadoop fs -mkdir /foo > mkdir: Cannot create directory /foo. Name node is in safe mode. > {code} > So, it seems like we need to wait for NN to be out of Safe Mode no matter what. -- This message was sent by Atlassian JIRA (v6.3.4#6332)