ambari-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Nayak M <snay...@gmail.com>
Subject Re: All processes are waiting during Cluster install
Date Sun, 13 Jul 2014 20:34:01 GMT
Sid,

Thanks for your suggestion.

*mysql-connector-java* was the initial error. That was solved after a 
long wait. (I will try your suggestion in my next install :-) )
*
*Below are my attempts to the successful install.
*
**Try-1* : Started cluster Install. Few components failed. 
(mysql-connector-java process was running via agents)
*Try-2* : Used Retry option from UI. All processes were waiting. After a 
long time (mysql-connector-java process finished) all the process which 
were on wait were started. Few components installed successfully and 
failed due to python script timeout error.
*Try-3* : Used Retry option from UI. The failed component install 
succeeded. Again python script timeout during Oozie client 
install(Screenshot attached in previous mail).
*Try-4* : Success. (There were some warning due to JAVA_HOME, which am 
solving now)

Can I increase the timeout period of Python script which was failing 
often during the install ?

--
Suraj Nayak

On Monday 14 July 2014 01:29 AM, Siddharth Wagle wrote:
> Try a yum clean all and a "yum install *mysql-connector-java*" from 
> command line on the hosts with any HIVE, OOZIE components.
>
> Then retry from UI.
>
> -Sid
>
>
> On Sun, Jul 13, 2014 at 12:36 PM, Suraj Nayak M <snayakm@gmail.com 
> <mailto:snayakm@gmail.com>> wrote:
>
>     Hi Sumit,
>
>     "I restarted the process" meant - I restarted the deployment from
>     the UI(Using Retry button in the browser).
>
>     You were right. The task 10 was stuck at *mysql-connector-java*
>     installation :)
>
>     2014-07-13 20:05:32,755 - Repository['HDP-2.1'] {'action':
>     ['create'], 'mirror_list': None, 'base_url':
>     'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.3.0',
>     'components': ['HDP', 'main'], 'repo_file_name': 'HDP'}
>     2014-07-13 20:05:32,761 - File['/etc/yum.repos.d/HDP.repo']
>     {'content': InlineTemplate(...)}
>     2014-07-13 20:05:32,762 - Package['hive'] {}
>     2014-07-13 20:05:32,780 - Installing package hive ('/usr/bin/yum
>     -d 0 -e 0 -y install hive')
>     2014-07-13 20:08:32,772 - Package['mysql-connector-java'] {}
>     2014-07-13 20:08:32,802 - Installing package mysql-connector-java
>     ('/usr/bin/yum -d 0 -e 0 -y install mysql-connector-java')
>
>     I also have noticed, if the network is slow, the install succeeds
>     for few components and fails for few. On retry(from UI), the
>     install will continue (from the failure point) and the previously
>     failed component will succeed. Again the cycle continues till all
>     the components are installed. Is there any way I can increase the
>     timeout of python script? Or can we have a fix in Ambari for below
>     condition :
>
>      "/*If the error is due to python script timeout, restart the
>     process*/" ?
>
>     The network was slow due to some reason. The installation failed
>     and the below error was displayed (Screenshot attached)
>
>     *Details of error :*
>
>     *ERROR :* Python script has been killed due to timeout.
>
>     File */var/lib/ambari-agent/data/errors-181.txt* don't contain any
>     data.
>
>     Content of */var/lib/ambari-agent/data/output-181.txt*
>
>     2014-07-14 00:07:01,673 - Package['unzip'] {}
>     2014-07-14 00:07:01,770 - Skipping installing existent package unzip
>     2014-07-14 00:07:01,772 - Package['curl'] {}
>     2014-07-14 00:07:01,872 - Skipping installing existent package curl
>     2014-07-14 00:07:01,874 - Package['net-snmp-utils'] {}
>     2014-07-14 00:07:01,966 - Skipping installing existent package
>     net-snmp-utils
>     2014-07-14 00:07:01,967 - Package['net-snmp'] {}
>     2014-07-14 00:07:02,060 - Skipping installing existent package
>     net-snmp
>     2014-07-14 00:07:02,064 - Group['hadoop'] {}
>     2014-07-14 00:07:02,069 - Modifying group hadoop
>     2014-07-14 00:07:02,141 - Group['users'] {}
>     2014-07-14 00:07:02,142 - Modifying group users
>     2014-07-14 00:07:02,222 - Group['users'] {}
>     2014-07-14 00:07:02,224 - Modifying group users
>     2014-07-14 00:07:02,306 - User['ambari-qa'] {'gid': 'hadoop',
>     'groups': [u'users']}
>     2014-07-14 00:07:02,307 - Modifying user ambari-qa
>     2014-07-14 00:07:02,380 - File['/tmp/changeUid.sh'] {'content':
>     StaticFile('changeToSecureUid.sh'), 'mode': 0555}
>     2014-07-14 00:07:02,385 - Execute['/tmp/changeUid.sh ambari-qa
>     /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa
>     2>/dev/null'] {'not_if': 'test $(id -u ambari-qa) -gt 1000'}
>     2014-07-14 00:07:02,454 - Skipping Execute['/tmp/changeUid.sh
>     ambari-qa
>     /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa
>     2>/dev/null'] due to not_if
>     2014-07-14 00:07:02,456 - User['hbase'] {'gid': 'hadoop',
>     'groups': [u'hadoop']}
>     2014-07-14 00:07:02,456 - Modifying user hbase
>     2014-07-14 00:07:02,528 - File['/tmp/changeUid.sh'] {'content':
>     StaticFile('changeToSecureUid.sh'), 'mode': 0555}
>     2014-07-14 00:07:02,531 - Execute['/tmp/changeUid.sh hbase
>     /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/hadoop/hbase
>     2>/dev/null'] {'not_if': 'test $(id -u hbase) -gt 1000'}
>     2014-07-14 00:07:02,600 - Skipping Execute['/tmp/changeUid.sh
>     hbase
>     /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/hadoop/hbase
>     2>/dev/null'] due to not_if
>     2014-07-14 00:07:02,602 - Group['nagios'] {}
>     2014-07-14 00:07:02,602 - Modifying group nagios
>     2014-07-14 00:07:02,687 - User['nagios'] {'gid': 'nagios'}
>     2014-07-14 00:07:02,689 - Modifying user nagios
>     2014-07-14 00:07:02,757 - User['oozie'] {'gid': 'hadoop'}
>     2014-07-14 00:07:02,758 - Modifying user oozie
>     2014-07-14 00:07:02,826 - User['hcat'] {'gid': 'hadoop'}
>     2014-07-14 00:07:02,828 - Modifying user hcat
>     2014-07-14 00:07:02,897 - User['hcat'] {'gid': 'hadoop'}
>     2014-07-14 00:07:02,898 - Modifying user hcat
>     2014-07-14 00:07:02,964 - User['hive'] {'gid': 'hadoop'}
>     2014-07-14 00:07:02,965 - Modifying user hive
>     2014-07-14 00:07:03,032 - User['yarn'] {'gid': 'hadoop'}
>     2014-07-14 00:07:03,034 - Modifying user yarn
>     2014-07-14 00:07:03,099 - Group['nobody'] {}
>     2014-07-14 00:07:03,100 - Modifying group nobody
>     2014-07-14 00:07:03,178 - Group['nobody'] {}
>     2014-07-14 00:07:03,179 - Modifying group nobody
>     2014-07-14 00:07:03,260 - User['nobody'] {'gid': 'hadoop',
>     'groups': [u'nobody']}
>     2014-07-14 00:07:03,261 - Modifying user nobody
>     2014-07-14 00:07:03,330 - User['nobody'] {'gid': 'hadoop',
>     'groups': [u'nobody']}
>     2014-07-14 00:07:03,332 - Modifying user nobody
>     2014-07-14 00:07:03,401 - User['hdfs'] {'gid': 'hadoop', 'groups':
>     [u'hadoop']}
>     2014-07-14 00:07:03,403 - Modifying user hdfs
>     2014-07-14 00:07:03,471 - User['mapred'] {'gid': 'hadoop',
>     'groups': [u'hadoop']}
>     2014-07-14 00:07:03,473 - Modifying user mapred
>     2014-07-14 00:07:03,544 - User['zookeeper'] {'gid': 'hadoop'}
>     2014-07-14 00:07:03,545 - Modifying user zookeeper
>     2014-07-14 00:07:03,616 - User['storm'] {'gid': 'hadoop',
>     'groups': [u'hadoop']}
>     2014-07-14 00:07:03,618 - Modifying user storm
>     2014-07-14 00:07:03,688 - User['falcon'] {'gid': 'hadoop',
>     'groups': [u'hadoop']}
>     2014-07-14 00:07:03,689 - Modifying user falcon
>     2014-07-14 00:07:03,758 - User['tez'] {'gid': 'hadoop', 'groups':
>     [u'users']}
>     2014-07-14 00:07:03,760 - Modifying user tez
>     2014-07-14 00:07:04,073 - Repository['HDP-2.1'] {'action':
>     ['create'], 'mirror_list': None, 'base_url':
>     'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.3.0',
>     'components': ['HDP', 'main'], 'repo_file_name': 'HDP'}
>     2014-07-14 00:07:04,084 - File['/etc/yum.repos.d/HDP.repo']
>     {'content': InlineTemplate(...)}
>     2014-07-14 00:07:04,086 - Package['oozie'] {}
>     2014-07-14 00:07:04,177 - Installing package oozie ('/usr/bin/yum
>     -d 0 -e 0 -y install oozie')
>
>     --
>     Suraj Nayak
>
>
>     On Sunday 13 July 2014 09:13 PM, Sumit Mohanty wrote:
>>     By "I restarted the process." do you mean that you restarted
>>     installation?
>>
>>     Can you share the command logs for tasks (e.g. 10, 42, 58, etc.)?
>>     These would help debug why the tasks are still active.
>>
>>     If you look at the Ambari UI and look at the past requests (top
>>     left) then the task specific UI will show you the hosts and the
>>     local file names on the host. The files are named as
>>     /var/lib/ambari-agent/data/output-10.txt and
>>     /var/lib/ambari-agent/data/errors-10.txt for task id 10.
>>
>>     What I can surmise based on the above is that the agents are
>>     still stuck on executing the older tasks. Thus they cannot
>>     execute new commands sent by Ambari Server when you retried
>>     installation. I suggest looking at the command logs and see why
>>     they are stuck. Restarting ambari server may not help as you may
>>     need to restart agents if they are stuck executing the tasks.
>>
>>     -Sumit
>>
>>
>>     On Sun, Jul 13, 2014 at 8:00 AM, Suraj Nayak M <snayakm@gmail.com
>>     <mailto:snayakm@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I am trying to install HDP2.1 using Ambari on 4 nodes. 2 NN
>>         and 2 Slaves. The install failed due to python script
>>         timeout. I restarted the process. From past 2hrs there is no
>>         progress in the installation. Is it safe to kill the ambari
>>         server and restart the process ? How can I terminate the
>>         ongoing process in Ambari gracefully ?
>>
>>         Below is tail of the Ambari-Server logs.
>>
>>         20:12:08,530  WARN [qtp527311109-183] HeartBeatHandler:369 -
>>         Operation failed - may be retried. Service component host:
>>         HIVE_CLIENT, host: slave2.hdp.somedomain.com
>>         <http://slave2.hdp.somedomain.com> Action id1-1
>>         20:12:08,530  INFO [qtp527311109-183] HeartBeatHandler:375 -
>>         Received report for a command that is no longer active.
>>         CommandReport{role='HIVE_CLIENT', actionId='1-1',
>>         status='FAILED', exitCode=999, clusterName='HDP2_CLUSTER1',
>>         serviceName='HIVE', taskId=57, roleCommand=INSTALL,
>>         configurationTags=null, customCommand=null}
>>         20:12:08,530  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 57 is not in progress, ignoring update
>>         20:12:08,966  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 26 is not in progress, ignoring update
>>         20:12:12,319  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 58 is not in progress, ignoring update
>>         20:12:12,605  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 42 is not in progress, ignoring update
>>         20:12:14,872  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 10 is not in progress, ignoring update
>>         20:12:19,039  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 26 is not in progress, ignoring update
>>         20:12:22,382  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 58 is not in progress, ignoring update
>>         20:12:22,655  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 42 is not in progress, ignoring update
>>         20:12:24,919  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 10 is not in progress, ignoring update
>>         20:12:29,086  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 26 is not in progress, ignoring update
>>         20:12:32,576  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 58 is not in progress, ignoring update
>>         20:12:32,704  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 42 is not in progress, ignoring update
>>         20:12:34,955  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 10 is not in progress, ignoring update
>>         20:12:39,132  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 26 is not in progress, ignoring update
>>         20:12:42,629  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 58 is not in progress, ignoring update
>>         20:12:42,754  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 42 is not in progress, ignoring update
>>         20:12:45,137  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 10 is not in progress, ignoring update
>>         20:12:49,320  WARN [qtp527311109-183] ActionManager:143 - The
>>         task 26 is not in progress, ignoring update
>>         20:12:52,962  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 58 is not in progress, ignoring update
>>         20:12:53,093  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 42 is not in progress, ignoring update
>>         20:12:55,184  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 10 is not in progress, ignoring update
>>         20:12:59,366  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 26 is not in progress, ignoring update
>>         20:13:03,013  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 58 is not in progress, ignoring update
>>         20:13:03,257  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 42 is not in progress, ignoring update
>>         20:13:05,231  WARN [qtp527311109-184] ActionManager:143 - The
>>         task 10 is not in progress, ignoring update
>>
>>
>>         --
>>         Thanks
>>         Suraj Nayak
>>
>>
>>
>>     CONFIDENTIALITY NOTICE
>>     NOTICE: This message is intended for the use of the individual or
>>     entity to which it is addressed and may contain information that
>>     is confidential, privileged and exempt from disclosure under
>>     applicable law. If the reader of this message is not the intended
>>     recipient, you are hereby notified that any printing, copying,
>>     dissemination, distribution, disclosure or forwarding of this
>>     communication is strictly prohibited. If you have received this
>>     communication in error, please contact the sender immediately and
>>     delete it from your system. Thank You. 
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You. 


Mime
View raw message