spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Williams (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-11161) Viewing the web UI for the first time unpersists a cached RDD
Date Fri, 16 Oct 2015 22:11:05 GMT
Ryan Williams created SPARK-11161:
-------------------------------------

             Summary: Viewing the web UI for the first time unpersists a cached RDD
                 Key: SPARK-11161
                 URL: https://issues.apache.org/jira/browse/SPARK-11161
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, Web UI
    Affects Versions: 1.5.1
            Reporter: Ryan Williams
            Priority: Minor


This one is a real head-scratcher. [Here's a screencast|http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif]:

!http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif!

The three windows, left-to-right, are: 
* a {{spark-shell}} on YARN with dynamic allocation enabled, at rest with one executor. [Here's
an example app's environment|https://gist.github.com/ryan-williams/6dd3502d5d0de2f030ac].
* [Spree|https://github.com/hammerlab/spree], opened to the above app's "Storage" tab.
* my YARN resource manager, showing a link to the web UI running on the driver.

At the start, nothing has been run in the shell, and I've not visited the web UI.

I run a simple job in the shell and cache a small RDD that it computes:

{code}
sc.parallelize(1 to 100000000, 100).map(_ % 100 -> 1).reduceByKey(_+_, 100).setName("foo").cache.count
{code}

As the second stage runs, you can see the partitions show up as cached in Spree.

After the job finishes, a few requested executors continue to fill in, which you can see in
the console at left or the nav bar of Spree in the middle.

Once that has finished, everything is at rest with the RDD "foo" 100% cached.

Then, I click the YARN RM's "ApplicationMaster" link which loads the web UI on the driver
for the first time.

Immediately, the console prints some activity, including that RDD 2 has been removed:

{code}
15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 172.29.46.15:33156
in memory (size: 1517.0 B, free: 7.2 GB)
15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997
in memory (size: 1517.0 B, free: 12.2 GB)
15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 2
15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 172.29.46.15:33156
in memory (size: 1666.0 B, free: 7.2 GB)
15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997
in memory (size: 1666.0 B, free: 12.2 GB)
15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 1
15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned shuffle 0
15/10/16 21:43:13 INFO storage.BlockManager: Removing RDD 2
15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned RDD 2
{code}

Accordingly, Spree shows that the RDD has been unpersisted, and I can see in the event log
(not pictured in the screencast) that an Unpersist event has made its way through the various
SparkListeners:

{code}
{"Event":"SparkListenerUnpersistRDD","RDD ID":2}
{code}

Simply loading the web UI causes an RDD unpersist event to fire!

I can't nail down exactly what's causing this, and I've seen evidence that there are other
sequences of events that can also cause it:

* I've repro'd the above steps ~20 times. The RDD always gets unpersisted when I've not visited
the web UI until the RDD is cached, and when the app is dynamically allocating executors.
* One time, I observed the unpersist to fire without my even visiting the web UI at all. Other
times I wait a long time before visiting the web UI, so that it is clear that the loading
of the web UI is causal, and it always is, but apparently there's another way for the unpersist
to happen, seemingly rarely, without visiting the web UI.
* I tried a couple of times without dynamic allocation and could not reproduce it.
* I've tried a couple of times with dynamic allocation and starting with a higher minimum
number of executors than 1 and have been unable to reproduce it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message