airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Barbosa <rrbarb...@gmail.com>
Subject Initial load best practices
Date Tue, 24 Jul 2018 10:08:26 GMT
I am creating a number of small ETL tasks to pull data into a Redshift
instance. For every task, I want a one-time bootstrapping task to create
the target table (and maybe even the schema) and do the initial load.

Preferably, I would like the DAG to perform this initial load, instead
doing it manually. This would be particularly useful when testing the DAG
in different environments (e.g., local vs production).

I see two ways of doing this:

1. Have a bootstrap operator in the DAG that `create if not exists` the
table, and rely on the DAG start date to backfill the data in the initial
load. E.g. `bootstrap` -> `fetch` -> `load`.
2. Implement an operator that does the initial load if the table does not
exist, or a incremental load if it does. The start date for the initial
upload is a parameter for the operator. Similar to what is described in
this post
<https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71#a102>
in
"parameterize workflow".

How do you typically handle this situation? What are the best practices for
an initial load?

Best,
Rafael

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message