reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (REEF-1732) Build Metrics System
Date Tue, 05 Jun 2018 00:07:00 GMT


ASF GitHub Bot commented on REEF-1732:

mandyshieh commented on issue #1460: [REEF-1732] Build Metrics System
   I spent the day looking into EventCounters, and wanted to add a comment to share my findings.
While I could get it to collect some data, I don't think it is completely in line with what
we set out to achieve when we started building a metrics system for Reef:
   - First and foremost, EventSource/EvenCounters seem to be designed for the user to define
the ETW provider, so only for Windows applications. I couldn't find any examples of how event
capturing was used in different OS's. 
   - The above comment mentioned the viewing tool kept history; in the examples I found PerfView
to be the primary tool used to view all the collected events over a timespan (is this what
you meant?) The tool collects events on a specified interval, and calculates statistics on
the execution times of events captured during that interval (mean, standard deviation, etc.)
However, if I wanted to take these values and send them somewhere else, for example AML training
service, I haven't found a way to do it easily.
   - EventSource/EventCounter seem pretty confined to numbers and strings, I'm not sure there
is an easy way to collect metrics of user defined types.
   There were two original goals of this PR, one was to maintain a timeseries of updates of
a single metrics, and the other was to support any type of metric that the user would specify.
For the first one, I feel a push model is more suitable, and with help and many iterations
it is finally not needing additional locks that might be a hit to performance while still
preserving the entire history of updates.
   Finally, to the last point, I feel this would probably be a sink implementation, where
the metrics collected by MetricsService would be sent somewhere that is queryable by the user.
Let me know what you think.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Build Metrics System
> --------------------
>                 Key: REEF-1732
>                 URL:
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF
>            Reporter: Julia
>            Assignee: Julia
>            Priority: Major
>         Attachments: IMRU Metrics System.docx
> IMRU Metrics is to provide metrics data to the system so that it can be shown to the
user for monitoring or diagnosis. The goal is to build an E2E flow with simple/basic metrics
data. We can then add more data later. 
> * IMetricsProvider - there are multiple sources of metrics data:
>   1.Task metrics. This is in particular for IMRU task such as current iteration, progress.
Each task can send task state back to driver and let driver to aggregate it. Alternatively,
as UpdateTask knows current iterations and progress, to start with, we can just get task status
from update task. The task metrics can be provided by task function like IUpdateFunction and
send to driver by task host as TaskMessage with heartbeat. 
>   2. Driver metrics – For IMRU driver, it can be system state such as WaitingForEvaluator
or TasksRunning, current retry number, etc. Those driver states are maintained inside IMRU
>  3. IMRUDriver will implement IMetricsProvider and supply metrics data. 
> * IMetricsSink – the metrics data will be output somewhere so that it can be consumed
by a monitoring tool. An interface IMetricsSink will be defined to sink metrics data. An implementation
of the interface can store the data to a remote storage. Multiple sinks can be injected. 
> * MetricsManager – It schedules a timer to get metrics from IMetricsProviders and output
the metrics data with IMetricsSinks
> Attached file shows the diagram of the design. 

This message was sent by Atlassian JIRA

View raw message