airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: Data lineage feedback
Date Sat, 15 Jun 2019 11:21:38 GMT
Hi All,

Lineage is a big subject so it’s hard to cover all angles. I’ll describe what we are doing
(or attempt to do) in ING Bank.

For us it’s important to capture every step what happens with data. This is required from
a regulatory perspective, security perspective, model performance perspective. We are a big
company with a lot of legacy and thus we deploy several technologies that capture this to
a certain extend. The main technologies are Apache Atlas and IBM’s IGC. Federation happens
(will happen) through Egeria.

To capture events you can do this push or pull. I see many companies opt for pull first, but
eventually shifting to a hybrid scenario. Push to capture as quickly as possible' pull to
close gaps and to further enrich. Storing lineage data is often done in a graph database.
I do see companies sometimes reinvent the wheel here. A lot of thought has been put into Apache
Atlas, but it’s documentation and UI is a bit lacking (hence our involvement with Amundsen).

My team started with Apache Atlas and with push as Apache Atlas delivers many connectors out
of the box (Hive, HDFS, Kylin etc. there is a custom somewhat immature for Spark). In case
you go outside of this environment lineage is gone, as airflow orchestrates workflows it can
fill much of this gap. This is where Airflow’s inlets and outlets come in. 

The inlets and outlets, as a concept it isn’t mature, a way to track lineage inside Airflow
but also, with the right backend, outside. Backends can be Apache Atlas, but also for example
Lyft’s Amundsen. Using inlets/outlets might seem a bit awkward at first, but can actually
speed up development of DAGs and generalize some patterns. You could for example create a
DAG that independent of the table it is getting can remove PII data based on the metadata
that is associated with it. Using inlets and outlets also allows a different paradigm in Airflow,
I think we discussed this as dataflow in the past, so you can set dependencies on data rather
than on tasks.

I hope this helps. The inlets and outlets feature can really use some help and use cases driving
it. The meta part of it can be a bit daunting at first, but I really believe when we get this
right it can really ease a lot of development and will put Airflow at the next level where
data is going.

Cheers
Bolke


Verstuurd vanaf mijn iPad

> Op 6 jun. 2019 om 15:45 heeft Jason Rich <jasonrich85@icloud.com.invalid> het volgende
geschreven:
> 
> Great questions. This makes two of us thinking about data lineage inside and/or outside
of airflow. 
> 
> Thank you for the questions Germain. 
> 
> Cheers,
> Jason
> 
>> On Jun 6, 2019, at 9:19 AM, Germain TANGUY <germain.tanguy@dailymotion.com.invalid>
wrote:
>> 
>> Hello everyone,
>> We were wondering if some of you are using a data lineage solution ?
>> 
>> We are aware of the experimental inlets/outlets with Apache Atlas<https://airflow.readthedocs.io/en/stable/lineage.html>,
does someone have feedback to share ?
>> Does someone have experience with others solutions outside airflow (as all the workflow
are not necessarily an airflow DAG)?
>> 
>> In my current company, we have hundreds of DAGs that run every day, many of which
depend on data built by another DAG (DAGs are often chained through sensors on partitions
or files in buckets, not trigger_dag).  When one of the DAGs fails, downstream DAGs will also
start failing once their retries expire; similarly when we discover a bug in data, we want
to mark that data as tainted so the challenge resides in determining impacted downstream DAGs
(possibly having to convert periodicities) and then clear them.
>> 
>> Rebuilding from scratch is not ideal but we haven't found something that suits our
needs so our idea is to implement the following :
>> 
>> 1.  Build a status table that describes the state of each artifact produced by our
DAG (valid, estimated, tainted...etc), we think this can be down through "on_success_callback"
of airflow.
>> 2.  Create a graph of our artefacts class model and the pipelines producing their
(time-based) instances so that an airflow sensor can easily know what the status of the parent
artefacts is through an API. We would use this sensor before running the task that creates
each artefact.
>> 3.  From this we can handle the failure use-case: we can create an API that takes
a DAG and an execution date as input and returns the list of tasks to clear and DAGRun to
start downstream
>> 4.  From this we can handle the tainted/backfill use-case : we can build an on-the-fly
@once DAG which will update the data status table to taint all the downstream data sources
build from a corrupted one and then clear all dependent DAGs from the corrupted one (then
wait to be reprocessed..).
>> 
>> Any experience shared will be much appreciate.
>> 
>> Thank you!
>> 
>> Germain T.
> 

Mime
View raw message