ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Lysnichenko (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMBARI-15389) Intermittent YARN service check failures during and post EU
Date Fri, 11 Mar 2016 17:05:38 GMT
Dmitry Lysnichenko created AMBARI-15389:
-------------------------------------------

             Summary: Intermittent YARN service check failures during and post EU
                 Key: AMBARI-15389
                 URL: https://issues.apache.org/jira/browse/AMBARI-15389
             Project: Ambari
          Issue Type: Bug
            Reporter: Dmitry Lysnichenko
            Assignee: Dmitry Lysnichenko
         Attachments: AMBARI-15389.patch


Build # - Ambari 2.2.1.1 - #63

Observed this issue in a couple of EU runs recently where YARN service check reports failure
a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check reported failure
during EU itself; a retry of the operation led to service check being successful

b. In another test post EU when YARN service check was run, it reported failure; afterwards
when I ran it again - success

Looks like there is some corner condition which causes this issue to be hit

{code}
stderr:   /var/lib/ambari-agent/data/errors-822.txt

Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
line 142, in <module>
ServiceCheck().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line
219, in execute
method(env)
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py",
line 104, in service_check
user=params.smokeuser,
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt /etc/security/keytabs/smokeuser.headless.keytab
ambari-qa@EXAMPLE.COM; yarn org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command
ls -num_containers 1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'
returned 2. ######## Hortonworks #############
This is MOTD message, added for testing in qe infra
16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: http://host:8188/ws/v1/timeline/
16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
16/03/03 02:33:51 INFO distributedshell.Client: Running Client
16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at host-9-5.test/127.0.0.254:8050
16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=3
16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, nodeId=host:25454,
nodeAddresshost:8042, nodeRackName/default-rack, nodeNumContainers1
16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, nodeId=host-9-5.test:25454,
nodeAddresshost-9-5.test:8042, nodeRackName/default-rack, nodeNumContainers0
16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, nodeId=host-9-1.test:25454,
nodeAddresshost-9-1.test:8042, nodeRackName/default-rack, nodeNumContainers0
16/03/03 02:33:53 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.083333336,
queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, queueName=default,
userAcl=SUBMIT_APPLICATIONS
16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of resources in this cluster
10240
16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty of resources
in this cluster 1
16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from local filesystem
and add to local environment
16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the application master
16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java
-Xmx10m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory
10 --container_vcores 1 --num_containers 1 --priority 0 1><LOG_DIR>/AppMaster.stdout
2><LOG_DIR>/AppMaster.stderr
16/03/03 02:33:53 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 290 for ambari-qa
on 127.0.0.235:8020
16/03/03 02:33:53 INFO distributedshell.Client: Got dt for hdfs://host-9-1.test:8020; Kind:
HDFS_DELEGATION_TOKEN, Service: 127.0.0.235:8020, Ident: (HDFS_DELEGATION_TOKEN token 290
for ambari-qa)
16/03/03 02:33:53 INFO distributedshell.Client: Submitting application to ASM
16/03/03 02:33:54 INFO impl.YarnClientImpl: Submitted application application_1456970141888_0011
16/03/03 02:33:55 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:33:56 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:33:57 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:33:58 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:33:59 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:34:00 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:34:01 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:34:02 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:34:03 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED,
distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:34:04 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED,
appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
16/03/03 02:34:05 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED,
appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
16/03/03 02:34:06 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED,
appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
16/03/03 02:34:07 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED,
appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
16/03/03 02:34:08 INFO distributedshell.Client: Got application report from ASM for, appId=11,
clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235,
appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=FINISHED,
distributedFinalState=FAILED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/,
appUser=ambari-qa
16/03/03 02:34:08 INFO distributedshell.Client: Application did finished unsuccessfully. YarnState=FINISHED,
DSFinalStatus=FAILED. Breaking monitoring loop
16/03/03 02:34:08 ERROR distributedshell.Client: Application failed to complete successfully
stdout:   /var/lib/ambari-agent/data/output-822.txt

2016-03-03 02:33:47,974 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf
2016-03-03 02:33:48,013 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf
2016-03-03 02:33:48,018 - checked_call['/usr/bin/kinit -kt /etc/security/keytabs/smokeuser.headless.keytab
ambari-qa@EXAMPLE.COM; yarn org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command
ls -num_containers 1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar']
{'path': '/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin', 'user': 'ambari-qa'}
{code}







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message