airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <>
Subject Re: ETL best practices for airflow
Date Mon, 17 Oct 2016 20:32:45 GMT
Hi all,

Today I was trying to work out a very basic example and very quickly ran
into an hour of trying to solve a problem that ought to be really easy.
I didn't expect that.  I posted about this on and someone helped
me out there.

All the simple database operators (mysql, postgres, mssql, etc) run a
simple query against a database which I attempted to test. But I couldn't
work out how
to pass a date window to my sql script based on execution date.  Then I
noticed that the "parameters"  dict isn't being templated.

So I couldn't do:

oper_1 = PostgresOperator(
    parameters={"window_start_date": "{{ ds }}", "window_end_date": "{{
tomorrow_ds }}"},

But instead I'd have to work that into the SQL template looking something
like this (without specifying parameters dict at all):

      o.create_dtm >= '{{ ds }}'
AND   o.create_dtm <  '{{ tomorrow_ds }}'

What I don't like about this approach is that if you have other parameters
coming from somewhere else, you end up with different ways of resolving
these variables:

oper_1 = PostgresOperator(

and resolving client_id in postgres like this:

      o.create_dtm >= '{{ ds }}'
AND   o.create_dtm <  '{{ tomorrow_ds }}'
AND   client_id = %(client_id)s

Which mixes up variable substitution methods in the same query, which looks
really ugly.

If there's a really nice way to access "ds" and "tomorrow_ds" when the
operator is inited, then this can become the method of choice.
One obvious way to do that is to also template "parameters" for all the
database operators:

    template_fields = ('sql','parameters')

The reason why I prefer the specific substitution method for databases is
that it provides a single way to do this. The macro methods don't work well
in this
case, because that particular method of coding is subject to SQL injection
attacks and shouldn't be taught as a good way of doing things. Another
reason is that
the regular substitution methods for the database of choice allow for unit
testing code in other frameworks beyond airflow and easier code reuse
should another
scheduler be chosen in the future.

Should I create a bug or a PR for this?



On Mon, Oct 17, 2016 at 12:40 AM, Boris Tyukin <>

> I really look forward to it, Gerard! I've read what you you wrote so far
> and I really liked it - please keep up the great job!
> I am hoping to see some best practices for the design of incremental loads
> and using timestamps from source database systems (not being on UTC so
> still confused about it in Airflow). Also practical use of subdags and
> dynamic generation of tasks using some external metadata (maybe describe in
> details something similar that wepay did
> On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <>
> wrote:
> > Hi all,
> >
> > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > tracking airflow since. Right now it looks like we're going to adopt
> > airflow at the company I'm currently working at.
> >
> > In preparation for that, I've done a bit of research work how airflow
> > pipelines should fit together, how important ETL principles are covered
> and
> > decided to write this up on a documentation site. The airflow
> documentation
> > site contains everything on how all airflow works and the constructs that
> > you have available to build pipelines, but it can still be a challenge
> for
> > newcomers to figure out how to put those constructs together to use it
> > effectively.
> >
> > The articles I found online don't go into a lot of detail either. Airflow
> > is built around an important philosophy towards ETL and there's a risk
> that
> > newcomers simply pick up a really great tool and start off in the wrong
> way
> > when using it.
> >
> >
> > This weekend, I set off to write some documentation to try to fill this
> > gap. It starts off with a generic understanding of important ETL
> principles
> > and I'm currently working on a practical step-by-step example that
> adheres
> > to these principles with DAG implementations in airflow; i.e. showing how
> > it can all fit together.
> >
> > You can find the current version here:
> >
> >
> >
> >
> > Looking forward to your comments. If you have better ideas how I can make
> > this contribution, don't hesitate to contact me with your suggestions.
> >
> > Best regards,
> >
> > Gerard
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message