airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergei Iakhnin <lle...@gmail.com>
Subject Re: Task Instance Anomaly Detection
Date Wed, 15 Nov 2017 15:38:25 GMT
I use the TICK stack - https://github.com/influxdata/. You can read more in
our paper - https://www.biorxiv.org/content/early/2017/09/08/185736
Basically Telegraf collects metrics (including statsd metrics from Airflow;
Airflow would benefit from more of these), sends them to Influxdb,
Kapacitor has rules on top for anomaly detection, Chronograf and Grafana
for visualization. If the resolution is automatable (service restarts,
etc.) I have an agent that uses Saltstack's HTTP API to communicate with a
configuration management server which takes action to fix the issue. If the
issue is not automatable then send notifications via email and Slack.


On Wed, Nov 15, 2017 at 4:23 PM Andrew Maguire <andrewm4894@gmail.com>
wrote:

> Hi All,
>
> Just wondering what some of the best options are to do more advance
> alerting and anomaly detection on task metrics within airflow.
>
> Currently we have a job that sends metrics for each task run to Anodot
> <https://www.anodot.com/> which is a really cool tool.
>
> However as our dags tend to have many tasks and i'm sending about 6 or so
> metrics for each dag run from the airflow database, i've blown through the
> 50k monthly metrics our Anodot licence covers.
>
> So just wondering what might be a more native way to do task monitoring in
> Airflow if there is one.
>
> Main use case here is to catch cases where even though a job is still
> running its behaviour has changed significantly which may be a sign of
> something that needs investigation.
>
> Cheers,
> Andy
>
-- 

Sergei

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message