flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@apache.org>
Subject Re: Enhance Flink's monitoring capabilities
Date Sat, 22 Nov 2014 23:03:52 GMT
Hi Nils,

Flink's current monitoring is quite limited and basically restricted to
status updates of the parallel tasks (scheduled, started, finished,
canceled, failed, etc.).
There is also some code lying around to collect system stats such as CPU,
memory, and network utilization. However, it is not used right now, AFAIK.
In case of a long running job, it is hard to figure out what is going on
and whether a program makes progress or not.

Having a monitoring infrastructure which allows to add, collect, and query
new metrics with low effort would be a great addition to Flink.
>From what I know, JMX was explicitly designed for this purpose and seems to
be a good fit. Since it is a Java standard, other tools can easily connect
and retrieve monitoring data.

As a starting point, I would focus to get an early prototype that uses JMX
to collect a single metric such as number of tuples processed by a Map
function.
Having such a showcase, would help to have a good discussion about how to
implement the monitoring infrastructure.
The question of metrics to collect is orthogonal to that. If we have a good
system to collect and gather stats, these can be added one by one.

Cheers, Fabian


2014-11-21 18:32 GMT+01:00 Nils E <nilsengelbach@gmail.com>:

> Hello together,
>
> I am trying to enhance Flink's monitoring capabilities in style of the GSoC
> 2014 Proposal by Rajika Kumarasiri [1].
>
> Short abstract:
> He suggested to use the Java standard, the Java Mangement Extensions(JMX).
> The idea is to put an MBean-Server in the JobManager, so that the
> JobManager itself and all Taskmanagers in the cluster can register their
> MBeans to this server via RMI.
> Different monitoring stages (No, standard, full) reduce the affect on the
> system performance.
> The JMX service should be accessible in an improved web-component using an
> RESTful API.
> He also suggested the use of the SIGAR[2] JNI library to gather the system
> information.
> In my opinion this point is discussible. In Java 7 they introduced Platform
> MXBeans[3] which already cover the basic system information, and so in my
> eyes the use of a JNI library might be a little overkill. But of course
> this depends on the aimed depth of monitoring.
>
> So the primary question:
> What parameters/system properties/utilizations/work loads should be
> monitored in your opinions?
>
> Have a nice weekend!
> Nils
>
> [1]
>
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Rajika-Kumarasiri
> [2] https://support.hyperic.com/display/SIGAR/Home
> [3]
>
> https://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message