airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: Airflow best practices
Date Mon, 20 Mar 2017 15:52:39 GMT
Gerard, this is an outstanding resource! We need to make sure people can
discover it easily.

I scanned the docs and tend to agree with everything mentioned. Though I'm
unclear on whether and how to fold some/most of this in the Airflow docs. I
guess the theoretical boundary might be that the Airflow docs describe
Airflow's features and operation, and your doc is more about "how to make
good use" of these features, which is more fuzzy and opinionated but oh so
useful!

Personally I'd welcome the "ETL principles" and "Gotcha" sections almost as
is in the main docs under a new "Best practice" section, I'm not sure what
others think.

Max

On Sat, Mar 18, 2017 at 1:58 AM, Gerard Toonstra <gtoonstra@gmail.com>
wrote:

> I've mailed about this before and this is a great opportunity for another
> shameless plug.
>
> I consider airflow much more than a scheduler... there's a philosophy
> behind it that users should understand,
> then it makes using airflow much more effective. This site is already
> reachable from the wiki and shows an
> end-to-end example how one could do ETL work:
>
> https://gtoonstra.github.io/etl-with-airflow/
>
> If anyone wants to contribute, let me know. If someone finds a better place
> for it instead of my personal github, I'm very
> much willing to put it in a more central location somewhere.
>
>
> It would also be great to receive comments on the current material, you
> can't really write a true " best practices"
> document if there's no one around to establish consensus with :).
>
> Rgds,
>
> Gerard
>
>
>
> On Fri, Mar 17, 2017 at 6:23 PM, siddharth anand <sanand@apache.org>
> wrote:
>
> > FYI, we have some best practices in confluent and possibly in other
> places
> > as well. I'd recommend adding to that rather than relying on email. Email
> > can be used to mail the link :-)
> > -s
> >
> > On Fri, Mar 17, 2017 at 9:33 AM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > Forwarding an email that should have been on this mailing list:
> > >
> > > ---------- Forwarded message ----------
> > > From: Maxime Beauchemin <xxxxxxxxxxxx@gmail.com>
> > > Date: Fri, Mar 17, 2017 at 8:53 AM
> > > Subject: Re: Airflow best practices
> > > To: Shreyas Joshi <shreyasjoshis@github.com>
> > >
> > >
> > > Hi Shreyas,
> > >
> > > Simple Airflow scripts are simply "configuration as code" and probably
> > > don't need to be abstracted out. The DSL is pretty expressive and
> there's
> > > usually a way to write your script so that it mostly contains code
> > specific
> > > to your pipeline (without much boilerplate).
> > >
> > > For more advanced pipelines and complicated patterns, say dynamically
> > > building pipelines, it makes a lot of sense to create abstractions
> > > (modules, functions, classes, ...). Airflow isn't opinionated as to how
> > you
> > > use its primitives, the only [perhaps odd] requirement is that your DAG
> > > objects should be in global module scope, somewhere in your
> DAGS_FOLDER,
> > so
> > > that they can be discovered by Airflow's "DAG crawler".
> > >
> > > At Airbnb we have a lot of abstractions that generate Airflow objects.
> > Some
> > > examples of that include our AB testing framework, common data quality
> > > enforcement patterns (stage the data, run DQ checks, exchange the
> > partition
> > > to production), and pretty much every other Airflow script. People
> create
> > > the logic they need to create their pipeline, a lot of it is "as
> dynamic
> > as
> > > it needs to be". It's pretty common for people to write their own
> > operators
> > > as well, packaged with their modules.
> > >
> > > We should put some more complex examples out somewhere to show people
> the
> > > kinds of things that can be done, though usually programmers using
> > Airflow
> > > realize quickly the kinds of things they can do, I'm sure you did
> > already!
> > >
> > > Max
> > >
> > > On Fri, Mar 17, 2017 at 6:46 AM, Shreyas Joshi XXXXX@github.com
> > > <shreyasjoshis@github.com>> wrote:
> > >
> > > > Hello Maxime,
> > > >
> > > > I am a data engineer at Github and we have been using Airflow for the
> > > last
> > > > few months. I noticed that in many of the example DAGs the code is
> > simply
> > > > at the module level with no functions etc. Is this a recommended
> > pattern
> > > > with Airflow DAGs? If so- I’d be very curious to know what the
> > rationale
> > > > behind this recommendation is.
> > > >
> > > > Thanks,
> > > > Shreyas
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message