hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Integrating Flink's web UI with YARN Timeline Server
Date Thu, 10 Dec 2015 19:56:02 GMT

> On 10 Dec 2015, at 16:28, Stephan Ewen <sewen@apache.org> wrote:
> 
> Hi!
> 
> We are looking into options to integrate Apache Flink's monitoring web
> frontend with the YARN Timeline Server. Flink has its own web frontend for
> monitoring and analyzing running jobs. The web frontend shows a lot of
> Flink specific stuff, in addition to tast start and end times.
> 
> When Flink runs on YARN, the web frontend server lives as part of the App
> Master, and its metrics are kept only on the App Master. The metrics and
> web frontend are gone once the job finishes and the App Master quits.
> 
> I am wondering now is we could store Flink's monitoring data in the YARN
> Timeline Server and visualize it from there, to make past job's data
> accessible.
> 

Stephan: I've been doing exactly that in Scala for Spark

-every app attempt is its own "entity" in ATS; It's ID == the attempt ID
-every event to be recorded is serailized to another piece of json and queued for posting
-various bits of metadata are added/updated with every post ( app it, name, user, start/end/update
times, and a version counter to ease checking for updates)

There's some batching of posts and requests to ease load and handle outages.

to read the stuff in, I grab the metadata which is then rendered in the (existing) spark history
server as an app/attempt (with state=incomplete/complete)
-this is done with a query for the metadata only of entities of the given type (spark-app),
time interval (since the last incomplete app onwards inclusive), 
-when the user wants the actual history of an attempt the entire history of that app is retrieved
in a single GET and then the json events played back (an O(history) payload and playback process,
can be fairly expensive on the ATS side as well as the clients)

Hard parts are
1. The extra complexity to deal with some queuing & buffering of posts and then dropping
surplus packets if needed. the ATS client does some of this (2.7.1+), but to avoid OOM in
the client I'm still doing my own
2. A kerberos aware REST client for ATS, as there isn't one in Hadoop itself.
3. Kerberos in general. Obviously.


The good news for #2 and perhaps #3 is that you can take mine, which is split into a generic
Jersey+SPNEGO+delegation-token-aware client and a timeline client and use them

https://github.com/steveloughran/spark/tree/stevel/feature/SPARK-1537-ATS/yarn/src/history/main/scala/org/apache/spark/deploy/history/yarn

you should be able to unwind any spark lib dependencies, which will primarily be around logging
& scalatest. Do not attempt to write your kerberos/token REST client. You will gain nothing.
Take that one, email me direct for questions.


> 
> I have seen that the Timeline Server allows applications to store some
> generic data. I have not fully understood what it allows, though.
> To illustrate what we are looking for, let me give you a bit of background
> into how Flink's web frontend is structured.
> 
> Flink's web frontend is structured in a very simple way, so that after a
> job is done, no dynamic data or handlers are needed on the server side any
> more. Everything is static files and JSON objects, at specific relative URLs
> 
> (1) A set of static HTML / JS / CSS files that implement the visualization
> 
> (2) Some JSON objects with static data (once the job is complete), under
> pre-defined URL.
> For example, the path
> "<app-id-root>/jobs/7684be6004e4e955c2a558a9bc463f65/exception" would
> return the static response '{ "root-exception": "java.io.IOException: File
> already exists:/tmp/abzs/2\n\tat
> org.apache.flink.core.fs.local.LocalFileSystem. ...", ...}'
> 
> 
> In some sense, what this would need is a Key/Value store where the key is a
> URL and the value is a JSON object or small file.
> A post-run hook in Flink would call a set of POST requests to store the
> JSON objects and files under the URLs. That's it. Calling the index.html
> then at the specific URL of that job would run Flink's rendering of the
> job's metrics and times.
> 
> I know it is a bit of a long shot, but would the timeline server support
> something like this?
> 
> Greetings,
> Stephan


Mime
View raw message