airflow-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Potiuk <ja...@potiuk.com>
Subject Re: Best Practice: dynamic dags with external dependencies
Date Mon, 21 Jun 2021 20:24:36 GMT
I think this is a great approach in general. You could use files
(stored in the same shared volume as DAGs) for that.

However I'd also point out one more extension (or different angle) of
that kind of approach.

Some of our users (also my team had the same experiences) learned that
it is actually easier to generate not the config files but ...... the
resulting DAGs directly. It's surprisingly easy to generate a nice
looking and correct python code (for example using Jinja templates)
and sometimes (depending on your case) it might be easier to generate
directly the python code of the DAGS, rather than config files that
will be read by the pre-defined DAGs. And you can even add parsing and
validation of generated code as part of your automated CI pipeline.

As counter-intuitive as it feels initially, it has very nice
properties - the logic of the DAG can be more "diverse" (you can for
example handle different cases by different templates and choose them
on-the-flight), the resulting DAG code might be cleaner as it does not
have to handle all the paths, it can be formatted by "black"
automatically (for example), you can generate variable number of DAGs
files this way etc. etc. You do not have to synchronise DAG code and
DAG config over time (there is JUST DAG code eventually). Adding
configuration to DAG is actually half-way to make your workflows
"declarative" (you write imperative code but somehow you need to make
it follows the "declarative" config). Airflow's0 premise is more
"imperative" in nature and generating the code provides a "shortcut"
to the power of it.

Just a thought that you might consider.

J.

On Mon, Jun 21, 2021 at 10:23 PM Daniel Standish <dpstandish@gmail.com> wrote:
>
> The only hurdle to overcome with this approach is getting the file into every running
container (depending on your infra setup).  E.g. if worker 1 picks up the "update config"
task and updates a config file locally, it would not be accessible in the scheduler container,
or worker 2.
>
> Do you have a network drive mounted into every container so that once the config file
is updated it is then immediately available to all containers?  Or some other solution?
>
> What I have done in this scenario is have the "update config" dag update an airflow variable.
 Then the dynamic dag reads from that variable to generate the tasks.  This avoids the file
problem I describe above.  It does make a call to the metastore but in practice that does
not seem to be a problem.
>
> Another thing I have thought about is generate the config file during deployments and
bake it into the image but that requires more setup than the variable approach so I did not
go that route.
>
> Having one "config update" dag for all such processes like this seems like a pretty good
way to go. But for me right now I update the config variable within the dag that uses the
config.
>
> On Mon, Jun 21, 2021 at 12:55 PM Dan Andreescu <dandreescu@wikimedia.org> wrote:
>>
>> Hi, this is a question about best practices, as we build our AirFlow instance and
establish coding conventions.
>>
>> We have a few jobs that follow this pattern:
>>
>> An external API defines a list of items.  Calls to this API are slow, let's say on
the order of minutes.
>> For each item in this list, we want to launch a sequence of tasks.
>>
>> So far reading and playing with AirFlow, we figure this might be a good approach:
>>
>> A separate "Generator" DAG calls the API and generates a config file with the list
of items.
>> The "Actual" DAG parses at DAG parsing time, reads the config file and generates
a dynamic DAG accordingly.
>>
>> Are there other preferred ways to do this kind of thing?  Thanks in advance!
>>
>>
>> Dan Andreescu
>> Wikimedia Foundation



-- 
+48 660 796 129

Mime
View raw message