flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: [DISCUSS] FLIP-7 Expose metrics to WebInterface
Date Tue, 02 Aug 2016 14:29:31 GMT
Regarding transfer:
I think objects are fine, as long as they are not user-defined objects. We
can limit it to String and subclasses of Number.

Regarding traversal of groups:
I am still thinking here in terms of the paradigm that the metrics should
impact the regular system as little as possible. Shifting work to the
"query/dump" action is good in that sense, unless that means permanent
re-construction of the name.
The metric query endpoint could (should) be a separate actor from the
TaskManager, in my opinion. That also solves the issue of blocking the
TaskManager actor.

BTW: Can the Dumper be simply a special reporter that understands the
component metric groups and does not use scope formats?

On Tue, Aug 2, 2016 at 3:50 PM, Chesnay Schepler <chesnay@apache.org> wrote:

> Thank you for your feedback :)
>
> Regarding names:
>
>    The Dumper does not create a MetricSnapshot. The Dumper creates a
>    list of key-value pairs; metric_name:value.
>    A (single) MetricSnapshot exists in the WebRuntimeMonitor, into
>    which the dumped list is inserted.
>
>    So the dumper creates a snapshot but not a MetricSnapshot, and the
>    WebRuntimeMonitor contains a MetricSnapshot which isn't really a
>    snapshot but more a storage.
>
>    The naming isn't the best.
>
>    I'm not sure if "Service" really fits the bill; I associate a
>    service with separate thread running in the background.
>
> Regarding merging of metrics:
>
>    We are not merging any metrics right now. While Counters are easy to
>    merge, for Gauge's we may have to let the user choose in the
>    WebInterface how they should be aggregated.
>
>    This is /not really/ a problem; in the sense that we don't have
>    different versions overwriting each other:
>
>      * JM/TM metrics don't have to be merged
>      * task metrics can be kept on a per subtask/operator level for now
>        (the prototype exposes them as
>        "<subtask_index>_<operator_name>_<metric_name>")
>      * job metrics are currently only gathered on the JM; so no merging
>        here either
>
> Regarding transfer:
>
>    Should we transfer numbers as numbers, or also as strings? I'm
>    concerned about the efficiency of the whole thing; if we send some
>    metrics as strings and some as numbers we have to decide for every
>    metric which option we should take. That's why i was wondering
>    whether to send everything as objects or everything as strings.
>
> Regarding traversal of groups:
>
>    Yes, we would save on startup/teardown time if we traversed the
>    groups instead. However the dumping itself should become more
>    expensive this way; and since this is done by the TaskManager thread
>    i wanted to keep it as simple as possible.
>
>    Also, there is currently no way to access the metrics contained in a
>    group. We would have to add another method to the
>    AbstractMetricGroup, which i would prefer not to do as it can lead
>    to concurrency issues during teardown.
>
>
>
> On 02.08.2016 15:05, Till Rohrmann wrote:
>
>> The metrics transfer design document looks good to me. Thanks for your
>> work
>> Chesnay :-)
>>
>> I think the benefit of registering the metrics at the MetricDumper is that
>> we don't have to walk through the hierarchy of metric groups to collect
>> the
>> metric values. Indeed, this comes with increased costs at start-up. But
>> I'm
>> not sure what's the concrete impact on job performance in these cases.
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 2, 2016 at 8:34 PM, Stephan Ewen <sewen@apache.org> wrote:
>>
>> Hi!
>>>
>>> Thanks for writing this up. I think it looks quite reasonable (I hope I
>>> understood that design correctly)
>>>
>>> There is one point of confusions left for me, though: The MetricDumper
>>> and
>>> MetricSnapshot: I think it is just the names that confuse me here.
>>> It looks like they define a way to query the metrics in the Metric
>>> Registry
>>> in a standard schema (independent of the scope formats).
>>> Should the "dumper" maybe be called "MetricsQueryService" or so (the
>>> query
>>> service returns a MetricSnapshot, if I understand correctly).
>>>
>>> It would be great if the "query service" would not need metrics to be
>>> registered - saves us some effort during startup / teardown. It looks
>>> as if the query service could just use the the root-most component metric
>>> groups to walk the tree of whatever metric is currently there and put it
>>> into the current snapshot.
>>>
>>> One open questions that I have is: How do you know how to merge the
>>> metrics
>>> from the subtasks, for example in case you want a metric across subtasks.
>>>
>>> In general, not transferring objects (only strings / numbers) would be
>>> preferable, because the WebMonitor may run in an environment where no
>>> user-code classloader can be used.
>>> It may run in the dispatcher (which must be trusted and cannot execute
>>> user
>>> code).
>>>
>>> Greetings,
>>> Stephan
>>>
>>>
>>>
>>> On Thu, Jul 28, 2016 at 3:12 PM, Chesnay Schepler <chesnay@apache.org>
>>> wrote:
>>>
>>> Hello,
>>>>
>>>> I just created a new FLIP which aims at exposing our metrics to the
>>>> WebInterface.
>>>>
>>>>
>>>>
>>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-7%3A+Expose+metrics+to+WebInterface
>>>
>>>> Looking forward to feedback :)
>>>>
>>>> Regards,
>>>> Chesnay Schepler
>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message