airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Jones <>
Subject Triggering backfills as part of a pipeline?
Date Fri, 06 Apr 2018 23:26:27 GMT

We are using Airflow to build up a DWH from many source systems but I have a couple of edge
cases that I’m not quite sure how to proceed with.

Each table from a source system is supplied as a CSV. With each set of files delivered (daily)
we receive a manifest/description file containing a list of files that have been sent. The
number of files delivered could vary as supplying systems become un-available for reporting
periods, connectivity or for some other outage reason, or when no “transactions”/events
occurred. Due to an outage style event, A CSV can contain data that covers “transactions”/events
for a date range not just a single day (not normally, but possible).

Additionally correction extracts could be supplied that traditionally would result in an update
but due to some technology choices, updates become difficult, so ideally we are looking to
follow the same patterns described here :

We have some code to re-partition each of extract by an event/transaction date before they
can be consumed by any follow-on workflows/DAGs.

I think what I’m trying to do is trigger a follow on DAG/subDAG for each daily partition
generated (but until we've inspected the data, the date range of snapshots/partitions to be
generated isn't known), this seems to be analogous to kicking off a backfill, but I’m not
sure how/if this can be done as part of the actual pipeline. We can put this logic into our
PySpark scripts but it seems we would loose the nice features of the UI describing each stage
of the ETL pipeline and MI regarding each task or to detect failure etc.

I suppose what I'm asking is whether there is an Airflow idiomatic way of doing this. We would
like to retain the ability to perform a backfill and know that we have deterministic behaviour
given a set of inputs (a manifest file + set of source files)

Hopefully this makes enough sense to people reading.

Thanks in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message