hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Lu <...@hortonworks.com>
Subject Re: Integrating Flink's web UI with YARN Timeline Server
Date Thu, 10 Dec 2015 22:19:55 GMT
Hi Stephan,

Yes the YARN ATS is definitely worth trying. One thing to notice is that there are continuous
efforts to improve the scalability of ATS. You may want to take a look at some recent efforts
on ATS 1.5 (YARN-4233), where a few new writer APIs are proposed for better scalability. In
general, the new ATS v1.5 APIs allow applications to selectively put some detailed timeline
data onto HDFS, and only read them when required. The only price on using these API is the
application needs to provide a simple plugin that tells ATS which read request is for those
detailed timeline data. The Tez community is actively working on using timeline v1.5 and we’re
targeting 2.8 to add this feature in.

As a in-progress work, the timeline v2 project is in a separate feature branch (feature-YARN-2928).
While the v2 API may not land very soon to the main line codebase, you’re more than welcome
to check this branch out and let us know your suggestions on the APIs.

If there’s any problems please feel free to open a JIRA or ping us. Anyone on those JIRAs
would be fine. Thanks!

Li Lu

On Dec 10, 2015, at 11:56, Steve Loughran <stevel@hortonworks.com<mailto:stevel@hortonworks.com>>
wrote:


On 10 Dec 2015, at 16:28, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
wrote:

Hi!

We are looking into options to integrate Apache Flink's monitoring web
frontend with the YARN Timeline Server. Flink has its own web frontend for
monitoring and analyzing running jobs. The web frontend shows a lot of
Flink specific stuff, in addition to tast start and end times.

When Flink runs on YARN, the web frontend server lives as part of the App
Master, and its metrics are kept only on the App Master. The metrics and
web frontend are gone once the job finishes and the App Master quits.

I am wondering now is we could store Flink's monitoring data in the YARN
Timeline Server and visualize it from there, to make past job's data
accessible.


Stephan: I've been doing exactly that in Scala for Spark

-every app attempt is its own "entity" in ATS; It's ID == the attempt ID
-every event to be recorded is serailized to another piece of json and queued for posting
-various bits of metadata are added/updated with every post ( app it, name, user, start/end/update
times, and a version counter to ease checking for updates)

There's some batching of posts and requests to ease load and handle outages.

to read the stuff in, I grab the metadata which is then rendered in the (existing) spark history
server as an app/attempt (with state=incomplete/complete)
-this is done with a query for the metadata only of entities of the given type (spark-app),
time interval (since the last incomplete app onwards inclusive),
-when the user wants the actual history of an attempt the entire history of that app is retrieved
in a single GET and then the json events played back (an O(history) payload and playback process,
can be fairly expensive on the ATS side as well as the clients)

Hard parts are
1. The extra complexity to deal with some queuing & buffering of posts and then dropping
surplus packets if needed. the ATS client does some of this (2.7.1+), but to avoid OOM in
the client I'm still doing my own
2. A kerberos aware REST client for ATS, as there isn't one in Hadoop itself.
3. Kerberos in general. Obviously.


The good news for #2 and perhaps #3 is that you can take mine, which is split into a generic
Jersey+SPNEGO+delegation-token-aware client and a timeline client and use them

https://github.com/steveloughran/spark/tree/stevel/feature/SPARK-1537-ATS/yarn/src/history/main/scala/org/apache/spark/deploy/history/yarn

you should be able to unwind any spark lib dependencies, which will primarily be around logging
& scalatest. Do not attempt to write your kerberos/token REST client. You will gain nothing.
Take that one, email me direct for questions.



I have seen that the Timeline Server allows applications to store some
generic data. I have not fully understood what it allows, though.
To illustrate what we are looking for, let me give you a bit of background
into how Flink's web frontend is structured.

Flink's web frontend is structured in a very simple way, so that after a
job is done, no dynamic data or handlers are needed on the server side any
more. Everything is static files and JSON objects, at specific relative URLs

(1) A set of static HTML / JS / CSS files that implement the visualization

(2) Some JSON objects with static data (once the job is complete), under
pre-defined URL.
For example, the path
"<app-id-root>/jobs/7684be6004e4e955c2a558a9bc463f65/exception" would
return the static response '{ "root-exception": "java.io.IOException: File
already exists:/tmp/abzs/2\n\tat
org.apache.flink.core.fs.local.LocalFileSystem. ...", ...}'


In some sense, what this would need is a Key/Value store where the key is a
URL and the value is a JSON object or small file.
A post-run hook in Flink would call a set of POST requests to store the
JSON objects and files under the URLs. That's it. Calling the index.html
then at the specific URL of that job would run Flink's rendering of the
job's metrics and times.

I know it is a bit of a long shot, but would the timeline server support
something like this?

Greetings,
Stephan



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message