airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teresa Martyny <teresa.mart...@omadahealth.com>
Subject Re: Data lineage feedback
Date Sat, 15 Jun 2019 13:21:42 GMT
Hi Germain,
Triggers didn't meet our needs because most of our dags were waiting for
more than one data dependency.

We were using sensors for dags that depended on other dags until about a
month ago. The resource requirements to manage all those sensors required
pooling and resulted in deadlocks sometimes. We tried reschedule mode to
address the resource consumption but it has a bug resulting in negative try
numbers, which kept taking our whole system down.

Then we added a feature to Airflow to allow for Dags to have upstream dag
dependencies the same way that tasks have upstream task dependencies. So
when a Dag checks to see if it should run yet based on the schedule, it
also is checking if the upstream dependencies are complete. This has
allowed us to remove all dag sensors and pools (we still have pooling on
resources like our OLAP db).

We are planning on offering to contribute that feature back to Airflow at
some point.

Some nuances: it results in dagruns not being generated for Dags when their
upstream dag dependencies are never met. This is ok for us because we have
monitoring on all of our deliverables and for failed tasks, so we get
insight into the branches that never ran. When a Dag fails and other dags
depend on it, if the Dag runs again and succeeds, the dependent dags will
kick off a dag run as expected and the prior run would be missing for those
dags.

The end results are that our Airflow instance is doing a lot less work and
when something fails we only have to clear that one task to get things
going again instead of hundreds of downstream tasks.

We considered checking the data directly (hitting the DB or S3 bucket) to
assess completion, but that would require us to add an indicator of data
load completeness and would involve a bunch of requests that felt like
overkill.

It's been a big relief since we implemented it. We threw the feature into
our own code while we decided if we liked it. We can prioritize the ticket
we have to put it into a fork if that would benefit folks.

Hang in there,
Teresa

On Thu, Jun 6, 2019, 6:19 AM Germain TANGUY
<germain.tanguy@dailymotion.com.invalid> wrote:

> Hello everyone,
> We were wondering if some of you are using a data lineage solution ?
>
> We are aware of the experimental inlets/outlets with Apache Atlas<
> https://airflow.readthedocs.io/en/stable/lineage.html>, does someone have
> feedback to share ?
> Does someone have experience with others solutions outside airflow (as all
> the workflow are not necessarily an airflow DAG)?
>
> In my current company, we have hundreds of DAGs that run every day, many
> of which depend on data built by another DAG (DAGs are often chained
> through sensors on partitions or files in buckets, not trigger_dag).  When
> one of the DAGs fails, downstream DAGs will also start failing once their
> retries expire; similarly when we discover a bug in data, we want to mark
> that data as tainted so the challenge resides in determining impacted
> downstream DAGs (possibly having to convert periodicities) and then clear
> them.
>
> Rebuilding from scratch is not ideal but we haven't found something that
> suits our needs so our idea is to implement the following :
>
>   1.  Build a status table that describes the state of each artifact
> produced by our DAG (valid, estimated, tainted...etc), we think this can be
> down through "on_success_callback" of airflow.
>   2.  Create a graph of our artefacts class model and the pipelines
> producing their (time-based) instances so that an airflow sensor can easily
> know what the status of the parent artefacts is through an API. We would
> use this sensor before running the task that creates each artefact.
>   3.  From this we can handle the failure use-case: we can create an API
> that takes a DAG and an execution date as input and returns the list of
> tasks to clear and DAGRun to start downstream
>   4.  From this we can handle the tainted/backfill use-case : we can build
> an on-the-fly @once DAG which will update the data status table to taint
> all the downstream data sources build from a corrupted one and then clear
> all dependent DAGs from the corrupted one (then wait to be reprocessed..).
>
> Any experience shared will be much appreciate.
>
> Thank you!
>
> Germain T.
>

-- 
This email may contain material that is confidential and/or privileged for 
the sole use of the intended recipient. Any review, reliance, or 
distribution by others or forwarding without express permission is strictly 
prohibited. If you are not the intended recipient, please contact the 
sender and delete all copies. Also note that email is not an appropriate 
way to send protected health information to Omada Health employees. Please 
use your discretion when responding to this email.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message