spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
Date Sat, 29 Nov 2014 22:14:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228952#comment-14228952
] 

Josh Rosen commented on SPARK-4498:
-----------------------------------

In addition to exploring the "missing DisassociatedEvent" theory, it might also be worthwhile
to brainstorm whether problems at other steps in the cleanup process could cause an application
to fail to be removed.  I'm not sure that a single missing DisassociatedEvent could explain
the "blip" behavior observed here, where an entire group of applications fail to be marked
as completed / failed.

In the DisassociatedEvent handler, we index into {{addressToApp}} to determine which app corresponded
to the DisassociatedEvent:

{code}
    case DisassociatedEvent(_, address, _) => {
      // The disconnected client could've been either a worker or an app; remove whichever
it was
      logInfo(s"$address got disassociated, removing it.")
      addressToWorker.get(address).foreach(removeWorker)
      addressToApp.get(address).foreach(finishApplication)
      if (state == RecoveryState.RECOVERING && canCompleteRecovery) { completeRecovery()
}
    }
{code}

If the {{addressToApp}} entry was empty / wrong, then we wouldn't properly clean up the app.
 However, I don't think that there should be any problems here because each application actor
system should have its own distinct address and Akka's {{Address}} class properly implements
hashCode / equals.  Even if drivers run on the same host, their actor systems should have
different port numbers.

Continuing along:

{code}
  def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) {
    if (apps.contains(app)) {
      logInfo("Removing app " + app.id)
{code}

Is there any way that the {{apps}} HashSet could fail to contain {{app}}?  I don't think so:
{{ApplicationInfo}} doesn't override equals/hashCode, but I don't think that's a problem since
we only create one ApplicationInfo per app, so the default object identity comparison should
be fine.  We should probably log an error if we call {{removeApplication}} on an application
that has already been removed, though.  (Also, why do we need the {{apps}} HashSet when we
could just use {{idToApp.values}}?)

> Standalone Master can fail to recognize completed/failed applications
> ---------------------------------------------------------------------
>
>                 Key: SPARK-4498
>                 URL: https://issues.apache.org/jira/browse/SPARK-4498
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Spark Core
>    Affects Versions: 1.1.1, 1.2.0
>         Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3
x86_64 x86_64 x86_64 GNU/Linux
>  - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
>  - Python 2.7.3
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece
>  - All client code is PySpark
>            Reporter: Harry Brundage
>            Priority: Blocker
>         Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt
>
>
> We observe the spark standalone master not detecting that a driver application has completed
after the driver process has shut down indefinitely, leaving that driver's resources consumed
indefinitely. The master reports applications as Running, but the driver process has long
since terminated. The master continually spawns one executor for the application. It boots,
times out trying to connect to the driver application, and then dies with the exception below.
The master then spawns another executor on a different worker, which does the same thing.
The application lives until the master (and workers) are restarted. 
> This happens to many jobs at once, all right around the same time, two or three times
a day, where they all get suck. Before and after this "blip" applications start, get resources,
finish, and are marked as finished properly. The "blip" is mostly conjecture on my part, I
have no hard evidence that it exists other than my identification of the pattern in the Running
Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
: the applications started before the blip at 1.9 hours ago still have active drivers. All
the applications started 1.9 hours ago do not, and the applications started less than 1.9
hours ago (at the top of the table) do in fact have active drivers.
> Deploy mode:
>  - PySpark drivers running on one node outside the cluster, scheduled by a cron-like
application, not master supervised
>  
> Other factoids:
>  - In most places, we call sc.stop() explicitly before shutting down our driver process
>  - Here's the sum total of spark configuration options we don't set to the default:
> {code}
>     "spark.cores.max": 30
>     "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
>     "spark.eventLog.enabled": true
>     "spark.executor.memory": "7g"
>     "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
>     "spark.io.compression.codec": "lzf"
>     "spark.ui.killEnabled": true
> {code}
>  - The exception the executors die with is this:
> {code}
> 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM,
HUP, INT]
> 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
> 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
> 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions:
Set(spark, azkaban)
> 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
> 14/11/19 19:42:37 INFO Remoting: Starting remoting
> 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dn13.chi.shopify.com:37682]
> 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port
37682.
> 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@spark-etl1.chi.shopify.com:58849].
Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters.
Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849
> 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE)
cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception
in doAs
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
> 	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
> 	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
> 	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> 	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException:
Futures timed out after [30 seconds]
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> 	... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> 	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> 	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> 	at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> 	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> 	at scala.concurrent.Await$.result(package.scala:107)
> 	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
> 	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> 	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
> 	... 7 more
> {code}
> Cluster history:
>  - We run spark versions built from apache/spark#master snapshots. We did not observe
this behaviour on {{7eb9cbc273d758522e787fcb2ef68ef65911475f}} (sorry its so old), but now
observe it on {{c6e0c2ab1c29c184a9302d23ad75e4ccd8060242}}. We can try new versions to assist
debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message