flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: Enhance Flink's monitoring capabilities
Date Sun, 23 Nov 2014 22:28:55 GMT

On 23 Nov 2014, at 00:03, Fabian Hueske <fhueske@apache.org> wrote:

> Hi Nils,
> Flink's current monitoring is quite limited and basically restricted to
> status updates of the parallel tasks (scheduled, started, finished,
> canceled, failed, etc.).
> There is also some code lying around to collect system stats such as CPU,
> memory, and network utilization. However, it is not used right now, AFAIK.
> In case of a long running job, it is hard to figure out what is going on
> and whether a program makes progress or not.
> Having a monitoring infrastructure which allows to add, collect, and query
> new metrics with low effort would be a great addition to Flink.
> From what I know, JMX was explicitly designed for this purpose and seems to
> be a good fit. Since it is a Java standard, other tools can easily connect
> and retrieve monitoring data.
> As a starting point, I would focus to get an early prototype that uses JMX
> to collect a single metric such as number of tuples processed by a Map
> function.
> Having such a showcase, would help to have a good discussion about how to
> implement the monitoring infrastructure.
> The question of metrics to collect is orthogonal to that. If we have a good
> system to collect and gather stats, these can be added one by one.


I don't have experience with JMX, but I agree with Fabian that the architecture of this monitoring
service is very important and should come first. It should be flexible enough to easily support
the collection of metrics by any operator and the user.

Every task manager needs expose this service to collect (and aggregate) data, which then would
be collected at a central instance (e.g. the JobManager). I am not sure at this point, but
it might be worthwhile to think about separating this central monitoring service from the
JobManager in order to reduce JobManager load and have more flexibility, e.g. running it as
a central history server to monitor multiple JobManager instances (for example in YARN setups).

– Ufuk
View raw message