spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: "Spree": Live-updating web UI for Spark
Date Mon, 27 Jul 2015 23:17:55 GMT
+1 to awesome work. I saw it this morning and it solves a annoyance/problem
I have had for a while, and even thought about maybe contributing something
for. I am excited to give it a try.

Pedro

On Mon, Jul 27, 2015 at 2:59 PM, Ryan Williams <
ryan.blake.williams@gmail.com> wrote:

> Hi dev@spark, I wanted to quickly ping about Spree
> <http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/>,
> a live-updating web UI for Spark that I released on Friday (along with some
> supporting infrastructure), and mention a couple things that came up while
> I worked on it that are relevant to this list.
>
> This blog post
> <http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/>
> and github <https://github.com/hammerlab/spree/> have lots of info about
> functionality, implementation details, and installation instructions, but
> the tl;dr is:
>
>    - You register a SparkListener called JsonRelay
>    <https://github.com/hammerlab/spark-json-relay> via the
>    spark.extraListeners conf (thanks @JoshRosen!).
>    - That listener ships SparkListenerEvents to a server called slim
>    <https://github.com/hammerlab/slim> that stores them in Mongo.
>       - Really what it stores are a bunch of stats similar to those
>       maintained by JobProgressListener.
>     - A Meteor <https://www.meteor.com/> app displays live-updating views
>    of what’s in Mongo.
>
> Feel free to read about it / try it! but the rest of this email is just
> questions about Spark APIs and plans.
> JsonProtocol scoping
>
> The most awkward thing about Spree is that JsonRelay declares itself to
> be in org.apache.spark
> <https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L1>
> so that it can use JsonProtocol.
>
> Will JsonProtocol be private[spark] forever, on purpose, or is it just
> not considered stable enough yet, so you want to discourage direct use? I’m
> relatively impartial at this point since I’ve done the hacky thing and it
> works for my purposes, but thought I’d ask in case there are interesting
> perspectives on the ideal scope for it going forward.
> @DeveloperApi trait SparkListener
>
> Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness
> of SparkListener
> <https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L131-L132>.
> I assumed I was doing something frowny by having JsonRelay implement the
> SparkListener interface. However, I just noticed that I’m actually
> extending SparkFirehoseListener
> <https://github.com/apache/spark/blob/v1.4.1/core/src/main/java/org/apache/spark/SparkFirehoseListener.java>,
> which is *not* @DeveloperApi afaict, so maybe I’m ok there after all?
>
> Are there other SparkListener implementations of note in the wild (seems
> like “no”)? Is that an API that people can and should use externally (seems
> like “yes” to me)? I saw @vanzin recently imply on this list that the
> answers may be “no” and “no”
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Slight-API-incompatibility-caused-by-SPARK-4072-tp13257.html>
> .
> Augmenting JsonProtocol
>
> JsonRelay does two things that JsonProtocol does not:
>
>    - adds an appId field to all events; this makes it possible/easy for
>    downstream things (slim, in this case) to handle information about
>    multiple Spark applications.
>    - JSON-serializes SparkListenerExecutorMetricsUpdate events. This was
>    added to JsonProtocol in SPARK-9036
>    <https://issues.apache.org/jira/browse/SPARK-9036> (though it’s unused
>    in the Spark repo currently), but I’ll have to leave my version in as long
>    as I want to support Spark <= 1.4.1.
>       - From one perspective, JobProgressListener was sort of “cheating”
>       by using these events that were previously not accessible via
>       JsonProtocol.
>
> It seems like making an effort to let external tools get the same kinds of
> data as the internal listeners is a good principle to try to maintain,
> which is also relevant to the scoping questions about JsonProtocol above.
>
> Should JsonProtocol add appIds to all events itself? Should Spark make it
> easier for downstream things to to process events from multiple Spark
> applications? JsonRelay currently pulls the app ID out of the SparkConf
> that it is instantiated with
> <https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L16>;
> it works, but also feels hacky and like maybe I’m doing things I’m not
> supposed to.
> Thrift SparkListenerEvent Implementation?
>
> A few months ago I built a first version of this project involving a
> SparkListener called Spear <https://github.com/hammerlab/spear> that
> aggregated stats from SparkListenerEvents *and* wrote those stats to
> Mongo, combining JsonRelay and slim from above.
>
> Spear used a couple of libraries (Rogue
> <https://github.com/foursquare/rogue> and Spindle
> <https://github.com/foursquare/spindle>) to define schemas in thrift,
> generate Scala for those classes, and do all the Mongo querying in a nice,
> type-safe way.
>
> Unfortunately for me, all of the Mongo queries were synchronous in that
> implementation, which led to events being dropped
> <https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40>
> when I tested it on large jobs (thanks a lot to @squito for helping debug
> that). I got spooked and decided all possible work had to be removed from
> the Spark driver, hence JsonRelay doing the most minimal possible thing
> and sending the events to a network address.
>
> I also decided to rewrite my JobProgressListener-equivalent piece in Node
> rather than Scala because I was finding the type-safety a little
> constraining and Rogue and Spindle are no longer maintained.
>
> Anyways, a couple of things I want to call out, given this back-story:
>
>    - There is a bunch of boilerplate related to implementing my Mongo
>    schemas sitting abandoned in Spear, in thrift
>    <https://github.com/hammerlab/spear/blob/5f2affe80fae833a120eb63d51154c3c00ee57b0/src/main/thrift/spark.thrift>
>    and case classes
>    <https://github.com/hammerlab/spear/blob/6f9efa9aab0ce00fe229083ded237199aafd9b74/src/main/scala/org/hammerlab/spear/SparkIDL.scala>
>    .
>       - One thing on my roadmap at the time was to replicate the
>       SparkListenerEvents themselves in thrift and/or case classes.
>       - They seem like things you’d want to be able to be able to send in
>       and out of various services/languages easily.
>       - Writing the raw events to Mongo is a related goal for slim,
>       covered by slim#35 <https://github.com/hammerlab/slim/issues/35>.
>     - It’s probably not super difficult to take something like Spear and
>    make it use non-blocking Mongo queries.
>       - This might make it quick enough to not have to drop events.
>       - It would simplify the infrastructure story by folding slim into
>       the listener that is registered with the driver.
>
> Possible Upstreaming
>
> I don’t know what the interest level for getting any of this functionality
> (e.g. live-updating web UI, event-stats persistence to Mongo) upstreamed is
> and don’t have time to work on that, so I won’t go into much detail here
> but if anyone’s interested I have some ideas about how some of it could
> happen (e.g. DDP <https://www.meteor.com/ddp> server implementation in
> java/scala serving the web UI; Spree currently involves two JS servers so
> some rewriting of things would probably have to happen), why it might be
> good to do, and why it might be not good or not worth it (e.g. Spark should
> make sure it’s possible and easy to do sophisticated things like this
> outside of the Spark repo, putting more work in the driver process is a bad
> idea, etc.).
>
> OK, that’s my brain dump, I’d love to hear peoples’ thoughts on any/all of
> this, otherwise thanks for the APIs and sorry for having to cheat them a
> bit! :)
>
> -Ryan
> ​
>



-- 
Pedro Rodriguez
UCBerkeley 2014 | Computer Science
SnowGeek <http://SnowGeek.org>
pedro-rodriguez.com
ski.rodriguez@gmail.com
208-340-1703

Mime
View raw message