ambari-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Nayak M <snay...@gmail.com>
Subject Re: All processes are waiting during Cluster install
Date Sun, 13 Jul 2014 19:36:42 GMT
Hi Sumit,

"I restarted the process" meant - I restarted the deployment from the 
UI(Using Retry button in the browser).

You were right. The task 10 was stuck at *mysql-connector-java* 
installation :)

2014-07-13 20:05:32,755 - Repository['HDP-2.1'] {'action': ['create'], 
'mirror_list': None, 'base_url': 
'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.3.0', 
'components': ['HDP', 'main'], 'repo_file_name': 'HDP'}
2014-07-13 20:05:32,761 - File['/etc/yum.repos.d/HDP.repo'] {'content': 
InlineTemplate(...)}
2014-07-13 20:05:32,762 - Package['hive'] {}
2014-07-13 20:05:32,780 - Installing package hive ('/usr/bin/yum -d 0 -e 
0 -y install hive')
2014-07-13 20:08:32,772 - Package['mysql-connector-java'] {}
2014-07-13 20:08:32,802 - Installing package mysql-connector-java 
('/usr/bin/yum -d 0 -e 0 -y install mysql-connector-java')

I also have noticed, if the network is slow, the install succeeds for 
few components and fails for few. On retry(from UI), the install will 
continue (from the failure point) and the previously failed component 
will succeed. Again the cycle continues till all the components are 
installed. Is there any way I can increase the timeout of python script? 
Or can we have a fix in Ambari for below condition :

  "/*If the error is due to python script timeout, restart the process*/" ?

The network was slow due to some reason. The installation failed and the 
below error was displayed (Screenshot attached)

*Details of error :*

*ERROR :* Python script has been killed due to timeout.

File */var/lib/ambari-agent/data/errors-181.txt*  don't contain any data.

Content of */var/lib/ambari-agent/data/output-181.txt*

2014-07-14 00:07:01,673 - Package['unzip'] {}
2014-07-14 00:07:01,770 - Skipping installing existent package unzip
2014-07-14 00:07:01,772 - Package['curl'] {}
2014-07-14 00:07:01,872 - Skipping installing existent package curl
2014-07-14 00:07:01,874 - Package['net-snmp-utils'] {}
2014-07-14 00:07:01,966 - Skipping installing existent package 
net-snmp-utils
2014-07-14 00:07:01,967 - Package['net-snmp'] {}
2014-07-14 00:07:02,060 - Skipping installing existent package net-snmp
2014-07-14 00:07:02,064 - Group['hadoop'] {}
2014-07-14 00:07:02,069 - Modifying group hadoop
2014-07-14 00:07:02,141 - Group['users'] {}
2014-07-14 00:07:02,142 - Modifying group users
2014-07-14 00:07:02,222 - Group['users'] {}
2014-07-14 00:07:02,224 - Modifying group users
2014-07-14 00:07:02,306 - User['ambari-qa'] {'gid': 'hadoop', 'groups': 
[u'users']}
2014-07-14 00:07:02,307 - Modifying user ambari-qa
2014-07-14 00:07:02,380 - File['/tmp/changeUid.sh'] {'content': 
StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2014-07-14 00:07:02,385 - Execute['/tmp/changeUid.sh ambari-qa 
/tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa

2>/dev/null'] {'not_if': 'test $(id -u ambari-qa) -gt 1000'}
2014-07-14 00:07:02,454 - Skipping Execute['/tmp/changeUid.sh ambari-qa 
/tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa

2>/dev/null'] due to not_if
2014-07-14 00:07:02,456 - User['hbase'] {'gid': 'hadoop', 'groups': 
[u'hadoop']}
2014-07-14 00:07:02,456 - Modifying user hbase
2014-07-14 00:07:02,528 - File['/tmp/changeUid.sh'] {'content': 
StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2014-07-14 00:07:02,531 - Execute['/tmp/changeUid.sh hbase 
/home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/hadoop/hbase 
2>/dev/null'] {'not_if': 'test $(id -u hbase) -gt 1000'}
2014-07-14 00:07:02,600 - Skipping Execute['/tmp/changeUid.sh hbase 
/home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/hadoop/hbase 
2>/dev/null'] due to not_if
2014-07-14 00:07:02,602 - Group['nagios'] {}
2014-07-14 00:07:02,602 - Modifying group nagios
2014-07-14 00:07:02,687 - User['nagios'] {'gid': 'nagios'}
2014-07-14 00:07:02,689 - Modifying user nagios
2014-07-14 00:07:02,757 - User['oozie'] {'gid': 'hadoop'}
2014-07-14 00:07:02,758 - Modifying user oozie
2014-07-14 00:07:02,826 - User['hcat'] {'gid': 'hadoop'}
2014-07-14 00:07:02,828 - Modifying user hcat
2014-07-14 00:07:02,897 - User['hcat'] {'gid': 'hadoop'}
2014-07-14 00:07:02,898 - Modifying user hcat
2014-07-14 00:07:02,964 - User['hive'] {'gid': 'hadoop'}
2014-07-14 00:07:02,965 - Modifying user hive
2014-07-14 00:07:03,032 - User['yarn'] {'gid': 'hadoop'}
2014-07-14 00:07:03,034 - Modifying user yarn
2014-07-14 00:07:03,099 - Group['nobody'] {}
2014-07-14 00:07:03,100 - Modifying group nobody
2014-07-14 00:07:03,178 - Group['nobody'] {}
2014-07-14 00:07:03,179 - Modifying group nobody
2014-07-14 00:07:03,260 - User['nobody'] {'gid': 'hadoop', 'groups': 
[u'nobody']}
2014-07-14 00:07:03,261 - Modifying user nobody
2014-07-14 00:07:03,330 - User['nobody'] {'gid': 'hadoop', 'groups': 
[u'nobody']}
2014-07-14 00:07:03,332 - Modifying user nobody
2014-07-14 00:07:03,401 - User['hdfs'] {'gid': 'hadoop', 'groups': 
[u'hadoop']}
2014-07-14 00:07:03,403 - Modifying user hdfs
2014-07-14 00:07:03,471 - User['mapred'] {'gid': 'hadoop', 'groups': 
[u'hadoop']}
2014-07-14 00:07:03,473 - Modifying user mapred
2014-07-14 00:07:03,544 - User['zookeeper'] {'gid': 'hadoop'}
2014-07-14 00:07:03,545 - Modifying user zookeeper
2014-07-14 00:07:03,616 - User['storm'] {'gid': 'hadoop', 'groups': 
[u'hadoop']}
2014-07-14 00:07:03,618 - Modifying user storm
2014-07-14 00:07:03,688 - User['falcon'] {'gid': 'hadoop', 'groups': 
[u'hadoop']}
2014-07-14 00:07:03,689 - Modifying user falcon
2014-07-14 00:07:03,758 - User['tez'] {'gid': 'hadoop', 'groups': 
[u'users']}
2014-07-14 00:07:03,760 - Modifying user tez
2014-07-14 00:07:04,073 - Repository['HDP-2.1'] {'action': ['create'], 
'mirror_list': None, 'base_url': 
'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.3.0', 
'components': ['HDP', 'main'], 'repo_file_name': 'HDP'}
2014-07-14 00:07:04,084 - File['/etc/yum.repos.d/HDP.repo'] {'content': 
InlineTemplate(...)}
2014-07-14 00:07:04,086 - Package['oozie'] {}
2014-07-14 00:07:04,177 - Installing package oozie ('/usr/bin/yum -d 0 
-e 0 -y install oozie')

--
Suraj Nayak

On Sunday 13 July 2014 09:13 PM, Sumit Mohanty wrote:
> By "I restarted the process." do you mean that you restarted installation?
>
> Can you share the command logs for tasks (e.g. 10, 42, 58, etc.)? 
> These would help debug why the tasks are still active.
>
> If you look at the Ambari UI and look at the past requests (top left) 
> then the task specific UI will show you the hosts and the local file 
> names on the host. The files are named as 
> /var/lib/ambari-agent/data/output-10.txt and 
> /var/lib/ambari-agent/data/errors-10.txt for task id 10.
>
> What I can surmise based on the above is that the agents are still 
> stuck on executing the older tasks. Thus they cannot execute new 
> commands sent by Ambari Server when you retried installation. I 
> suggest looking at the command logs and see why they are stuck. 
> Restarting ambari server may not help as you may need to restart 
> agents if they are stuck executing the tasks.
>
> -Sumit
>
>
> On Sun, Jul 13, 2014 at 8:00 AM, Suraj Nayak M <snayakm@gmail.com 
> <mailto:snayakm@gmail.com>> wrote:
>
>     Hi,
>
>     I am trying to install HDP2.1 using Ambari on 4 nodes. 2 NN and 2
>     Slaves. The install failed due to python script timeout. I
>     restarted the process. From past 2hrs there is no progress in the
>     installation. Is it safe to kill the ambari server and restart the
>     process ? How can I terminate the ongoing process in Ambari
>     gracefully ?
>
>     Below is tail of the Ambari-Server logs.
>
>     20:12:08,530  WARN [qtp527311109-183] HeartBeatHandler:369 -
>     Operation failed - may be retried. Service component host:
>     HIVE_CLIENT, host: slave2.hdp.somedomain.com
>     <http://slave2.hdp.somedomain.com> Action id1-1
>     20:12:08,530  INFO [qtp527311109-183] HeartBeatHandler:375 -
>     Received report for a command that is no longer active.
>     CommandReport{role='HIVE_CLIENT', actionId='1-1', status='FAILED',
>     exitCode=999, clusterName='HDP2_CLUSTER1', serviceName='HIVE',
>     taskId=57, roleCommand=INSTALL, configurationTags=null,
>     customCommand=null}
>     20:12:08,530  WARN [qtp527311109-183] ActionManager:143 - The task
>     57 is not in progress, ignoring update
>     20:12:08,966  WARN [qtp527311109-183] ActionManager:143 - The task
>     26 is not in progress, ignoring update
>     20:12:12,319  WARN [qtp527311109-183] ActionManager:143 - The task
>     58 is not in progress, ignoring update
>     20:12:12,605  WARN [qtp527311109-183] ActionManager:143 - The task
>     42 is not in progress, ignoring update
>     20:12:14,872  WARN [qtp527311109-183] ActionManager:143 - The task
>     10 is not in progress, ignoring update
>     20:12:19,039  WARN [qtp527311109-184] ActionManager:143 - The task
>     26 is not in progress, ignoring update
>     20:12:22,382  WARN [qtp527311109-183] ActionManager:143 - The task
>     58 is not in progress, ignoring update
>     20:12:22,655  WARN [qtp527311109-183] ActionManager:143 - The task
>     42 is not in progress, ignoring update
>     20:12:24,919  WARN [qtp527311109-184] ActionManager:143 - The task
>     10 is not in progress, ignoring update
>     20:12:29,086  WARN [qtp527311109-184] ActionManager:143 - The task
>     26 is not in progress, ignoring update
>     20:12:32,576  WARN [qtp527311109-183] ActionManager:143 - The task
>     58 is not in progress, ignoring update
>     20:12:32,704  WARN [qtp527311109-183] ActionManager:143 - The task
>     42 is not in progress, ignoring update
>     20:12:34,955  WARN [qtp527311109-183] ActionManager:143 - The task
>     10 is not in progress, ignoring update
>     20:12:39,132  WARN [qtp527311109-183] ActionManager:143 - The task
>     26 is not in progress, ignoring update
>     20:12:42,629  WARN [qtp527311109-184] ActionManager:143 - The task
>     58 is not in progress, ignoring update
>     20:12:42,754  WARN [qtp527311109-184] ActionManager:143 - The task
>     42 is not in progress, ignoring update
>     20:12:45,137  WARN [qtp527311109-183] ActionManager:143 - The task
>     10 is not in progress, ignoring update
>     20:12:49,320  WARN [qtp527311109-183] ActionManager:143 - The task
>     26 is not in progress, ignoring update
>     20:12:52,962  WARN [qtp527311109-184] ActionManager:143 - The task
>     58 is not in progress, ignoring update
>     20:12:53,093  WARN [qtp527311109-184] ActionManager:143 - The task
>     42 is not in progress, ignoring update
>     20:12:55,184  WARN [qtp527311109-184] ActionManager:143 - The task
>     10 is not in progress, ignoring update
>     20:12:59,366  WARN [qtp527311109-184] ActionManager:143 - The task
>     26 is not in progress, ignoring update
>     20:13:03,013  WARN [qtp527311109-184] ActionManager:143 - The task
>     58 is not in progress, ignoring update
>     20:13:03,257  WARN [qtp527311109-184] ActionManager:143 - The task
>     42 is not in progress, ignoring update
>     20:13:05,231  WARN [qtp527311109-184] ActionManager:143 - The task
>     10 is not in progress, ignoring update
>
>
>     --
>     Thanks
>     Suraj Nayak
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You. 


Mime
View raw message