airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Germain TANGUY <germain.tan...@dailymotion.com.INVALID>
Subject Data lineage feedback
Date Thu, 06 Jun 2019 13:19:20 GMT
Hello everyone,
We were wondering if some of you are using a data lineage solution ?

We are aware of the experimental inlets/outlets with Apache Atlas<https://airflow.readthedocs.io/en/stable/lineage.html>,
does someone have feedback to share ?
Does someone have experience with others solutions outside airflow (as all the workflow are
not necessarily an airflow DAG)?

In my current company, we have hundreds of DAGs that run every day, many of which depend on
data built by another DAG (DAGs are often chained through sensors on partitions or files in
buckets, not trigger_dag).  When one of the DAGs fails, downstream DAGs will also start failing
once their retries expire; similarly when we discover a bug in data, we want to mark that
data as tainted so the challenge resides in determining impacted downstream DAGs (possibly
having to convert periodicities) and then clear them.

Rebuilding from scratch is not ideal but we haven't found something that suits our needs so
our idea is to implement the following :

  1.  Build a status table that describes the state of each artifact produced by our DAG (valid,
estimated, tainted...etc), we think this can be down through "on_success_callback" of airflow.
  2.  Create a graph of our artefacts class model and the pipelines producing their (time-based)
instances so that an airflow sensor can easily know what the status of the parent artefacts
is through an API. We would use this sensor before running the task that creates each artefact.
  3.  From this we can handle the failure use-case: we can create an API that takes a DAG
and an execution date as input and returns the list of tasks to clear and DAGRun to start
downstream
  4.  From this we can handle the tainted/backfill use-case : we can build an on-the-fly @once
DAG which will update the data status table to taint all the downstream data sources build
from a corrupted one and then clear all dependent DAGs from the corrupted one (then wait to
be reprocessed..).

Any experience shared will be much appreciate.

Thank you!

Germain T.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message