airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <>
Subject Re: Article: The Rise of the Data Engineer
Date Wed, 25 Jan 2017 01:37:06 GMT
Glad to hear the article resonated with you! I just now got interviewed on
a podcast on this very subject, it should be up sometime this week:

It's less structured than the article, but you can hear me babble about
data engineering and say semi-outrageous things about data scientists if
you have the patience of sitting through it :)

I totally agree about SQL, it's the one solid constant in this ever
changing space.

Screw SCDs!


On Tue, Jan 24, 2017 at 3:39 PM, Rob Goretsky <>

> Maxime,
> Just wanted to thank you for writing this article - much like the original
> articles by Jeff Hammerbacher and DJ Patil coining the term "Data
> Scientist", I feel this article stands as a great explanation of what the
> title of "Data Engineer" means today..  As someone who has been working in
> this role before the title existed, many of the points here rang true about
> how the technology and tools have evolved..
> I started my career working with graphical ETL tools (Informatica) and
> could never shake the feeling that I could get a lot more done, with a more
> maintainable set of processes, if I could just write reusable functions in
> any programming language and then keep them in a shared library.  Instead,
> what the GUI tools forced upon us were massive Wiki documents laying out
> 'the 9 steps you need to follow perfectly in order to build a proper
> Informatica workflow' , that developers would painfully need to follow
> along with, rather than being able to encapsulate the things that didn't
> change in one central 'function' to pass in parameters for the things that
> varied from the defaults.
> I also spent a lot of time early in my career trying to design data
> warehouse tables using the Kimball methodology with star schemas and all
> dimensions extracted out to separate dimension tables.  As columnar storage
> formats with compression became available (Vertica/Parquet/etc), I started
> gravitating more towards the idea that I could just store the raw string
> dimension data in the fact table directly, denormalized, but it always felt
> like I was breaking the 'purist' rules on how to design data warehouse
> schemas 'the right way'..  So in that regard, thanks for validating my
> feeling that its ok to keep denormalized dimension data directly in fact
> tables - it definitely makes our queries easier to write, and as you
> mentioned, has the added benefit of helping you avoid all of that SCD fun!
> We're about to put Airflow into production at my company ( for a
> handful of DAGs to start, so it will be running alongside our existing
> Informatica server running 500+ workflows nightly..  But I can already see
> the writing on the wall - it's really hard for us to find talented
> engineers with Informatica experience along with more general computer
> engineering backgrounds (many seem to have specialized in purely
> Informatica) -  so our newer engineers come in with strong Python/SQL
> backgrounds and have been gravitating towards building newer jobs in
> Airflow...
> One item that I think deserves addition to this article is the continuing
> prevalence of SQL.   Many technologies have changed, but SQL has persisted
> (pun intended?).  We went through a phase for a few years where it looked
> like the tide was turning to MapReduce, Pig, or other languages for
> accessing and aggregating data..  But now it seems even the "NoSQL" data
> stores have added SQL layers on top, and we have more SQL engines for
> Hadoop than I can count.   SQL is easy to learn but tougher to master, so
> to me the two main languages in any modern Data Engineer's toolbelt are SQL
> and a scripting language (Python/Ruby)..   I think it's amazing that with
> so much change in every aspect of how we do data warehousing, SQL has stood
> the test of time...
> Anyways, thanks again for writing this up, I'll definitely be sharing it
> with my team!
> -Rob
> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
>> wrote:
> > Hey I just published an article about the "Data Engineer" role in modern
> > organizations and thought it could be of interest to this community.
> >
> >
> > 91be18f1e603#.5rkm4htnf
> >
> > Max
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message