airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Tyukin <bo...@boristyukin.com>
Subject Re: Best practices for dynamically generated tasks and dags
Date Fri, 21 Oct 2016 17:22:18 GMT
thanks Laura, it helps! i was hoping you would reply :) very good points
about UI / logs / restarts - I think at this point I really like #2 option
myself.

I still wonder if people do something creative to generate complex DAGs
outside of a DAG folder - so this would be an example when it takes
significant time to poll metadata/databases to generate all the tasks. I do
not know if it is possible as I am not strong with Python (actually I have
been learning Python as I am learning Airflow!) The idea is to have an
outside py to generate static .py file for a DAG/s and place these
generated py files under airflow dag_folder once a day or on some schedule.
Is anyone doing this or I am over-complicating things and #2 should just
work?

I think in my case it might take a good minute to parse out metadata files
and some database tables to actually generate DAG tasks. Also I imagine it
will produce a heck of log records since scheduler polls dag folders every
minute and this process will repeat again itself in a minute - so it will
be like doing this non-stop unless I change airflow scheduler settings.



On Fri, Oct 21, 2016 at 11:39 AM, Laura Lorenz <llorenz@industrydive.com>
wrote:

> We've been evolving from type 1 you describe to a pull/poll version of the
> type 2 you describe. For type 1, it is really hard to tell what's going on
> (all the UI views become useless because they are so huge). Having one big
> dag also means you can't turn off the scheduler for individual parts, and
> the whole DAG fails if one task does, so if you can functionally separate
> them I think that gives you more configuration options. Our biggest DAG now
> is more like 22*10 tasks, which is still too big in our opinions. We
> leverage ExternalTaskSensors to link dags together which is more of a
> pull/poll paradigm, but you could use a TriggerDagRunOperator if you wanted
> more of a push/trigger paradigm which is what I hea ryou saying in type 2.
>
> To your second question, our DAGs are dynamic based on the results of an
> API call we embed in the DAG and our scheduler is on a 5-second timelapse
> for each attemp to refill the DagBag. I think because of the frequency of
> the scheduler polling the files, because our API call is relatively fast,
> we are working with DAGs that are on a 24 hour schedule_interval, and the
> resultant DAG structure is not too large or complicated, we haven't had any
> issues with that or done anything special. I think it's just the fact of
> the matter that if you give the scheduler a lot of work to do to determine
> the DAG shape, it will take a while.
>
> Laura
>
> On Fri, Oct 21, 2016 at 10:52 AM, Boris Tyukin <boris@boristyukin.com>
> wrote:
>
> > Guys, would you mind to chime in and share your experience?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message