airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rich <>
Subject Re: Data lineage feedback
Date Thu, 06 Jun 2019 13:45:14 GMT
Great questions. This makes two of us thinking about data lineage inside and/or outside of

Thank you for the questions Germain. 


> On Jun 6, 2019, at 9:19 AM, Germain TANGUY <>
> Hello everyone,
> We were wondering if some of you are using a data lineage solution ?
> We are aware of the experimental inlets/outlets with Apache Atlas<>,
does someone have feedback to share ?
> Does someone have experience with others solutions outside airflow (as all the workflow
are not necessarily an airflow DAG)?
> In my current company, we have hundreds of DAGs that run every day, many of which depend
on data built by another DAG (DAGs are often chained through sensors on partitions or files
in buckets, not trigger_dag).  When one of the DAGs fails, downstream DAGs will also start
failing once their retries expire; similarly when we discover a bug in data, we want to mark
that data as tainted so the challenge resides in determining impacted downstream DAGs (possibly
having to convert periodicities) and then clear them.
> Rebuilding from scratch is not ideal but we haven't found something that suits our needs
so our idea is to implement the following :
>  1.  Build a status table that describes the state of each artifact produced by our DAG
(valid, estimated, tainted...etc), we think this can be down through "on_success_callback"
of airflow.
>  2.  Create a graph of our artefacts class model and the pipelines producing their (time-based)
instances so that an airflow sensor can easily know what the status of the parent artefacts
is through an API. We would use this sensor before running the task that creates each artefact.
>  3.  From this we can handle the failure use-case: we can create an API that takes a
DAG and an execution date as input and returns the list of tasks to clear and DAGRun to start
>  4.  From this we can handle the tainted/backfill use-case : we can build an on-the-fly
@once DAG which will update the data status table to taint all the downstream data sources
build from a corrupted one and then clear all dependent DAGs from the corrupted one (then
wait to be reprocessed..).
> Any experience shared will be much appreciate.
> Thank you!
> Germain T.

View raw message