airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <>
Subject Re: ETL best practices for airflow
Date Wed, 19 Oct 2016 06:17:56 GMT
Thanks Max,

I think it always helps when new people start using software to see what
their issues are.

Some of it was also taken from the video on best practices in nov. 2015 on
this page:


I made some more progress yesterday, but ran into issue 137. I think I
solved it by depends_on_past,
but I'm going to rely on the LatestOnlyOperator instead (it's better) and
then work out something better
from there.



On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <> wrote:

> This is an amazing thread to follow! I'm really interested to watch best
> practices documentation emerge out of the community.
> Gerard, I enjoyed reading your docs and would love to see this grow. I've
> been meaning to write a series of blog posts on the subject for quite some
> time. It seems like you have a really good start. We could integrate this
> as a "Best Practice" section to our current documentation once we build
> consensus about the content.
> Laura, please post on this mailing list once the talk is up as a video, I'd
> love to watch it.
> A related best practice I'd like to write about is the idea of applying
> some concepts of functional programing to ETL. The idea is to use immutable
> datasets/datablocks systematically as sources to your computations, in ways
> that any task instance sources from immutable datasets that are persisted
> in your backend. That allows to satisfy the guarantee that re-running any
> chunk of ETL at different point in time should lead to the exact same
> result. It also usually means that you need to 1-do incremental loads, and
> 2- "snapshot" your dimension/referential/small tables in time to make sure
> that running the ETL from 26 days ago sources from the dimension snapshot
> as it was back then and yields the exact same result.
> Anyhow, it's a complex and important subject I should probably write about
> in a structured way sometime.
> Max
> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <>
> wrote:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message