airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Van Klaveren <>
Subject Re: Article: The Rise of the Data Engineer
Date Wed, 25 Jan 2017 15:48:02 GMT
There's also monetdb and greenplum, depending on your data size, which,
which support columnar tables if you want to get your feet wet. If your data is actually more
array-like, you might try out scidb.

Per this email thread, it almost sounds like a slack team/discourse for data engineering might
be useful.

> On Jan 25, 2017, at 7:28 AM, Rob Goretsky <> wrote:
> @Gerard - I mentioned Vertica just as one of the first examples of a system
> that offers columnar storage.  You might actually see a significant benefit
> using columnar storage with even a smaller table, as small as a few GB -
> Columnar storage works well if you have wide fact tables with many columns
> and often query on just a few of those columns.  The downside to columnar
> storage is that if you often SELECT *, or many, of the columns from the
> table at once, it will actually be slower than if you had stored the data
> in traditional 'row-based' storage.  Also, updates and deletes can be
> slower with columnar storage, so it works best if you have wide,
> INSERT-only fact tables.   That said, I think there are better options than
> Vertica on the market today for getting your feet wet with columnar
> storage.  If AWS is an option for you, then Redshift offers this out of the
> box, and would let you run your POC for as little as $0.25 an hour.
> Parquet is basically columnar storage for Hadoop..  Other more traditional
> data warehouse vendors like Netezza and Teradata also offer columnar
> storage as an option ...
>> On Wed, Jan 25, 2017 at 9:16 AM, Boris Tyukin <> wrote:
>> Max, really really nice post and I like your style of writing - please
>> continue sharing your experience and inspire many of us working in more
>> traditional environments ;) I shared your post with our leadership and
>> hopefully we will have data engineers soon on our team! As far as UI vs.
>> coding, I am not sure I fully agree as we look at software development
>> history, we will see times when programming was the only answer and
>> required hardcore professionals like you but then commercial applications
>> which were very visual and lowered requirements to the skillset need.
>> Informatica, SSIS and others became hugely popular and many people swear
>> they save time if you know how to use them. I am pretty sure we will see
>> new tools in Big Data arena as well (AtScale is one example) that make
>> things easier for less skilled developers and users.
>> It is also good timing for me as my company evaluating Informatica Big Data
>> Management addon (which competes with Talend Big Data) - I am not sold yet
>> on why we would need it if we can do much more with Python and Spark and
>> Hive. But the key point Informatica folks make is to lower the requirements
>> for the skills of developers and to leverage existing skills with
>> Informatica and SQL. I think this is important because this is exactly why
>> SQL is still a huge player in Big Data world - people love SQL, they can do
>> a lot with SQL and they want to use their SQL experience they've built over
>> their carrier.
>> The dimensional modelling question you raised is also very interesting but
>> very arguable. I was thinking about it before and still did not come to
>> believe that flat tables is a way to go. You said it yourself that there is
>> still a place for highly accurate (certified) enterprise wide warehouse and
>> one still need to spend a lot of time thinking about use cases and design
>> to support them. I am not sure I like the abundance of de-normilized tables
>> in Big Data world but I do see your point about SCDs and all the pain to
>> maintain a traditional star schema DW. But dimensional modelling is not
>> really about maintenance or making life easier for ETL developers - IMHO it
>> is about structuring data to simply business and data analytics. It is
>> about rigorous process to conform data from multiple source systems. It is
>> about data quality and trust. Finally it is about better performing DW (by
>> nature of RDBMS which are very good at joining tables by foreign keys) -
>> the last benefit though is not relevant in Hadoop since we can reprocess or
>> query data more efficiently.
>> Gerard, why would you do that? if you have the skills already with SQL
>> Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak
>> machine), you should be fine with SQL Server. The only issue you cannot
>> support fast BI queries. But you have enterprise license, you can easily
>> dump your table in tabular in memory cube and most of your queries will be
>> running in under 2 seconds. Vertica is cool but the learning curve is
>> pretty steep and it really shines on big de-normalized tables as join
>> performance might is not that good. I work with a large healthcare vendor
>> and they have Tb size tables in their Vertica db - most of them are flatten
>> out but they still have dimensions and facts, just less then you would
>> normally have with traditional star schema design.
>> On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <>
>> wrote:
>>> You mentioned Vertica and Parquet. Is it recommended to use these newer
>>> tools even when the DWH is not BigData
>>> size (about 150G in size) ?
>>> So there are a couple of good benefits, but are there any downsides and
>>> disadvantages you have to take into account
>>> comparing Vertica vs. SQL Server for example?
>>> If you really recommend Vertica over SQL Server, I'm looking at doing a
>> PoC
>>> here to see where it goes...
>>> Rgds,
>>> Gerard
>>> On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <
>>> wrote:
>>>> Maxime,
>>>> Just wanted to thank you for writing this article - much like the
>>> original
>>>> articles by Jeff Hammerbacher and DJ Patil coining the term "Data
>>>> Scientist", I feel this article stands as a great explanation of what
>> the
>>>> title of "Data Engineer" means today..  As someone who has been working
>>> in
>>>> this role before the title existed, many of the points here rang true
>>> about
>>>> how the technology and tools have evolved..
>>>> I started my career working with graphical ETL tools (Informatica) and
>>>> could never shake the feeling that I could get a lot more done, with a
>>> more
>>>> maintainable set of processes, if I could just write reusable functions
>>> in
>>>> any programming language and then keep them in a shared library.
>>> Instead,
>>>> what the GUI tools forced upon us were massive Wiki documents laying
>> out
>>>> 'the 9 steps you need to follow perfectly in order to build a proper
>>>> Informatica workflow' , that developers would painfully need to follow
>>>> along with, rather than being able to encapsulate the things that
>> didn't
>>>> change in one central 'function' to pass in parameters for the things
>>> that
>>>> varied from the defaults.
>>>> I also spent a lot of time early in my career trying to design data
>>>> warehouse tables using the Kimball methodology with star schemas and
>> all
>>>> dimensions extracted out to separate dimension tables.  As columnar
>>> storage
>>>> formats with compression became available (Vertica/Parquet/etc), I
>>> started
>>>> gravitating more towards the idea that I could just store the raw
>> string
>>>> dimension data in the fact table directly, denormalized, but it always
>>> felt
>>>> like I was breaking the 'purist' rules on how to design data warehouse
>>>> schemas 'the right way'..  So in that regard, thanks for validating my
>>>> feeling that its ok to keep denormalized dimension data directly in
>> fact
>>>> tables - it definitely makes our queries easier to write, and as you
>>>> mentioned, has the added benefit of helping you avoid all of that SCD
>>> fun!
>>>> We're about to put Airflow into production at my company ( for
>> a
>>>> handful of DAGs to start, so it will be running alongside our existing
>>>> Informatica server running 500+ workflows nightly..  But I can already
>>> see
>>>> the writing on the wall - it's really hard for us to find talented
>>>> engineers with Informatica experience along with more general computer
>>>> engineering backgrounds (many seem to have specialized in purely
>>>> Informatica) -  so our newer engineers come in with strong Python/SQL
>>>> backgrounds and have been gravitating towards building newer jobs in
>>>> Airflow...
>>>> One item that I think deserves addition to this article is the
>> continuing
>>>> prevalence of SQL.   Many technologies have changed, but SQL has
>>> persisted
>>>> (pun intended?).  We went through a phase for a few years where it
>> looked
>>>> like the tide was turning to MapReduce, Pig, or other languages for
>>>> accessing and aggregating data..  But now it seems even the "NoSQL"
>> data
>>>> stores have added SQL layers on top, and we have more SQL engines for
>>>> Hadoop than I can count.   SQL is easy to learn but tougher to master,
>> so
>>>> to me the two main languages in any modern Data Engineer's toolbelt are
>>> SQL
>>>> and a scripting language (Python/Ruby)..   I think it's amazing that
>> with
>>>> so much change in every aspect of how we do data warehousing, SQL has
>>> stood
>>>> the test of time...
>>>> Anyways, thanks again for writing this up, I'll definitely be sharing
>> it
>>>> with my team!
>>>> -Rob
>>>> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
>>>>> wrote:
>>>>> Hey I just published an article about the "Data Engineer" role in
>>> modern
>>>>> organizations and thought it could be of interest to this community.
>>>>> 91be18f1e603#.5rkm4htnf
>>>>> Max

  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message