airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <gtoons...@gmail.com>
Subject Re: ETL best practices for airflow
Date Sat, 22 Oct 2016 23:07:08 GMT
Hi all,

So I worked out a full pipeline for a toy data warehouse on postgres:

https://gtoonstra.github.io/etl-with-airflow/fullexample.html

https://github.com/gtoonstra/etl-with-airflow/tree/master/
examples/full-example

It demonstrates pretty much all listed principles for ETL work except for
alerting and monitoring.
Just some work TBD on the DDL and a full code review on naming conventions.

Things I ran into:
- Issue 137, max_active_runs doesn't work after clearing tasks, it does in
the very first run.
- parameters for standard PostgresqlOperator are not templated, so couldn't
use the core operator.
- it's a good idea to specify "depends_on_past" when using sensors,
otherwise sensors could
  exhaust available processing slots.
- a better strategy to process a large backfill if the desired schedule is
1 day. Processing 700+
  days is going to take a lot of time and overhead when processing per
month is an option.
  Is a duplicate of the DAG with a different interval a better choice, or
are there strategies to
  detect this in an operator and use the output of that to specify the date
window boundaries?
- when pooling is active, scheduling takes a lot more time. Even when the
pool is 10 and the number
   of instances 7, it takes longer for the instances to actually run.

Looking forward to your comments on how some approaches could be improved.

Rgds,

Gerard


On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <gtoonstra@gmail.com>
wrote:

>
> Thanks Max,
>
> I think it always helps when new people start using software to see what
> their issues are.
>
> Some of it was also taken from the video on best practices in nov. 2015 on
> this page:
>
> https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be
>
> ----
>
> I made some more progress yesterday, but ran into issue 137. I think I
> solved it by depends_on_past,
> but I'm going to rely on the LatestOnlyOperator instead (it's better) and
> then work out something better
> from there.
>
> Rgds,
>
> Gerard
>
>
> On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> This is an amazing thread to follow! I'm really interested to watch best
>> practices documentation emerge out of the community.
>>
>> Gerard, I enjoyed reading your docs and would love to see this grow. I've
>> been meaning to write a series of blog posts on the subject for quite some
>> time. It seems like you have a really good start. We could integrate this
>> as a "Best Practice" section to our current documentation once we build
>> consensus about the content.
>>
>> Laura, please post on this mailing list once the talk is up as a video,
>> I'd
>> love to watch it.
>>
>> A related best practice I'd like to write about is the idea of applying
>> some concepts of functional programing to ETL. The idea is to use
>> immutable
>> datasets/datablocks systematically as sources to your computations, in
>> ways
>> that any task instance sources from immutable datasets that are persisted
>> in your backend. That allows to satisfy the guarantee that re-running any
>> chunk of ETL at different point in time should lead to the exact same
>> result. It also usually means that you need to 1-do incremental loads, and
>> 2- "snapshot" your dimension/referential/small tables in time to make sure
>> that running the ETL from 26 days ago sources from the dimension snapshot
>> as it was back then and yields the exact same result.
>>
>> Anyhow, it's a complex and important subject I should probably write about
>> in a structured way sometime.
>>
>> Max
>>
>> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <boris@boristyukin.com>
>> wrote:
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message