airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Goretsky <robert.goret...@gmail.com>
Subject Re: Article: The Rise of the Data Engineer
Date Tue, 24 Jan 2017 23:39:33 GMT
Maxime,
Just wanted to thank you for writing this article - much like the original
articles by Jeff Hammerbacher and DJ Patil coining the term "Data
Scientist", I feel this article stands as a great explanation of what the
title of "Data Engineer" means today..  As someone who has been working in
this role before the title existed, many of the points here rang true about
how the technology and tools have evolved..

I started my career working with graphical ETL tools (Informatica) and
could never shake the feeling that I could get a lot more done, with a more
maintainable set of processes, if I could just write reusable functions in
any programming language and then keep them in a shared library.  Instead,
what the GUI tools forced upon us were massive Wiki documents laying out
'the 9 steps you need to follow perfectly in order to build a proper
Informatica workflow' , that developers would painfully need to follow
along with, rather than being able to encapsulate the things that didn't
change in one central 'function' to pass in parameters for the things that
varied from the defaults.

I also spent a lot of time early in my career trying to design data
warehouse tables using the Kimball methodology with star schemas and all
dimensions extracted out to separate dimension tables.  As columnar storage
formats with compression became available (Vertica/Parquet/etc), I started
gravitating more towards the idea that I could just store the raw string
dimension data in the fact table directly, denormalized, but it always felt
like I was breaking the 'purist' rules on how to design data warehouse
schemas 'the right way'..  So in that regard, thanks for validating my
feeling that its ok to keep denormalized dimension data directly in fact
tables - it definitely makes our queries easier to write, and as you
mentioned, has the added benefit of helping you avoid all of that SCD fun!

We're about to put Airflow into production at my company (MLB.com) for a
handful of DAGs to start, so it will be running alongside our existing
Informatica server running 500+ workflows nightly..  But I can already see
the writing on the wall - it's really hard for us to find talented
engineers with Informatica experience along with more general computer
engineering backgrounds (many seem to have specialized in purely
Informatica) -  so our newer engineers come in with strong Python/SQL
backgrounds and have been gravitating towards building newer jobs in
Airflow...

One item that I think deserves addition to this article is the continuing
prevalence of SQL.   Many technologies have changed, but SQL has persisted
(pun intended?).  We went through a phase for a few years where it looked
like the tide was turning to MapReduce, Pig, or other languages for
accessing and aggregating data..  But now it seems even the "NoSQL" data
stores have added SQL layers on top, and we have more SQL engines for
Hadoop than I can count.   SQL is easy to learn but tougher to master, so
to me the two main languages in any modern Data Engineer's toolbelt are SQL
and a scripting language (Python/Ruby)..   I think it's amazing that with
so much change in every aspect of how we do data warehousing, SQL has stood
the test of time...

Anyways, thanks again for writing this up, I'll definitely be sharing it
with my team!

-Rob









On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Hey I just published an article about the "Data Engineer" role in modern
> organizations and thought it could be of interest to this community.
>
> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> 91be18f1e603#.5rkm4htnf
>
> Max
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message