ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-22644) Node Managers fail to start after Spark2 is patched due to CNF YarnShuffleService
Date Thu, 18 Jan 2018 18:27:03 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330933#comment-16330933
] 

Hudson commented on AMBARI-22644:
---------------------------------

FAILURE: Integrated in Jenkins build Ambari-trunk-Commit #8613 (See [https://builds.apache.org/job/Ambari-trunk-Commit/8613/])
AMBARI-22644 - Node Managers fail to start after Spark2 is patched due (rlevas: [https://gitbox.apache.org/repos/asf?p=ambari.git&a=commit&h=7749e655e74c7bb4e3ada6b92943730c5e1b6e76])
* (edit) ambari-server/src/main/resources/stacks/HDP/2.6/upgrades/config-upgrade.xml
* (edit) ambari-server/src/main/resources/stacks/HDP/2.5/services/YARN/configuration/yarn-site.xml
* (edit) ambari-server/src/main/resources/stacks/HDP/3.0/services/YARN/configuration/yarn-site.xml
* (edit) ambari-server/src/main/resources/common-services/YARN/2.1.0.2.0/package/scripts/params_linux.py
* (edit) ambari-server/src/main/resources/common-services/YARN/3.0.0.3.0/package/scripts/params_linux.py


> Node Managers fail to start after Spark2 is patched due to CNF YarnShuffleService
> ---------------------------------------------------------------------------------
>
>                 Key: AMBARI-22644
>                 URL: https://issues.apache.org/jira/browse/AMBARI-22644
>             Project: Ambari
>          Issue Type: Bug
>    Affects Versions: 2.6.1
>            Reporter: Vivek Sharma
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.6.2
>
>
> *STR*
> # Deploy HDP-2.6.4.0 cluster with Ambari-2.6.1.0-114
> # Apply HBase patch Upgrade on the cluster (this step is optional)
> # Then apply Spark2 patch Upgrade on the cluster
> # Restart Node Managers
> *Result*
> NM restart fails with below error:
> {code}
> 2017-12-10 07:17:02,559 INFO  impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(606))
- NodeManager metrics system shutdown complete.
> 2017-12-10 07:17:02,559 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549))
- Error starting NodeManager
> org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
>         at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:291)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:546)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:594)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>         at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:197)
>         at org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:165)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:348)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:131)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         ... 8 more
> 2017-12-10 07:17:02,562 INFO  nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> {code}
> The spark properties are correctly being written out as per AMBARI-22525.
> Initially, we had defined Spark properties for ATS like this:
> {code}
>     <name>yarn.nodemanager.aux-services.spark_shuffle.classpath</name>
>     <value>{{stack_root}}/${hdp.version}/spark/aux/*</value>
> {code}
> When YARN upgrades without Spark, we run into AMBARI-22525. Seems like the shuffle classes
are installed as part of RPM dependencies, but not the SparkATSPlugin.
> So:
> - If we use YARN's version for the Spark classes, then ATS can't find SparkATSPlugin
since that is not part of YARN.
> - If we use Spark's version for the classes, then Spark can never upgrade without YARN
since NodeManager can't find the new Spark classes. 
> However, it seems like shuffle and ATS use different properties. We changed all 3 properties
in AMBARI-22525:
> {code}
> yarn.nodemanager.aux-services.spark2_shuffle.classpath
> yarn.nodemanager.aux-services.spark_shuffle.classpath
> yarn.timeline-service.entity-group-fs-store.group-id-plugin-classpath
> {code}
> It seems like what need to do is change the spark shuffle stuff back to hdp.version,
but leave ATS using the new version since we're guaranteed to have Spark installed on the
ATS machine. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message