airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Tyukin <bo...@boristyukin.com>
Subject Re: variable scope with dynamic dags
Date Wed, 22 Mar 2017 19:50:55 GMT
thanks Jeremiah, this is exactly what was bugging me. I am going to rewrite
that code and look at persistent storage. your explanation helped, thanks!

On Wed, Mar 22, 2017 at 2:29 PM, Jeremiah Lowin <jlowin@apache.org> wrote:

> In vanilla Python, your DAGs will all reference the same object, so when
> your DAG file is parsed and 200 DAGs are created, there will still only be
> 1 60MB dict object created (I say vanilla because there are obviously ways
> to create copies of the object).
>
> HOWEVER, you should assume that each Airflow TASK is being run in a
> different process, and each process is going to load your DAG file when it
> runs. If resource use is a concern, I suggest you look at out-of-core or
> persistent storage for the object so you don't need to load the whole thing
> every time.
>
> On Wed, Mar 22, 2017 at 11:20 AM Boris Tyukin <boris@boristyukin.com>
> wrote:
>
> > hi Jeremiah, thanks for the explanation!
> >
> > i am very new to Python so was surprised that it works and my external
> > dictionary object was still accessible to all dags generated. I think it
> > makes sense but I would like to confirm one thing and I do not know how
> to
> > test it myself.
> >
> > do you think that large dictionary object will still be loaded to memory
> > only once even if I generate 200 dags that will be accessing it? so
> > basically they will just use a reference to it or they would create a
> copy
> > of the same 60Mb structure.
> >
> > I hope my question makes sense :)
> >
> > On Wed, Mar 22, 2017 at 10:54 AM, Jeremiah Lowin <jlowin@apache.org>
> > wrote:
> >
> > > At the risk of oversimplifying things, your DAG definition file is
> loaded
> > > *every* time a DAG (or any task in that DAG) is run. Think of it as a
> > > literal Python import of your dag-defining module: any variables are
> > loaded
> > > along with the DAGs, which are then executed. That's why your dict is
> > > always available. This will work with Celery since it follows the same
> > > approach, parsing your DAG file to run each task.
> > >
> > > (By the way, this is why it's critical that all parts of your Airflow
> > > infrastructure have access to the same DAGS_FOLDER)
> > >
> > > Now it is true that the DagBag loads DAG objects but think of it as
> more
> > of
> > > an "index" so that the scheduler/webserver know what DAGs are
> available.
> > > When it's time to actually run one of those DAGs, the executor loads it
> > > from the underlying source file.
> > >
> > > Jeremiah
> > >
> > > On Wed, Mar 22, 2017 at 8:45 AM Boris Tyukin <boris@boristyukin.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a weird question but it bugs my mind. I have some like below
> to
> > > > generate dags dynamically, using Max's example code from FAQ.
> > > >
> > > > It works fine but I have one large dict (let's call it my_outer_dict)
> > > that
> > > > takes over 60Mb in memory and I need to access it from all generated
> > > dags.
> > > > Needless to say, i do not want to recreate that dict for every dag
> as I
> > > > want to load it to memory only once.
> > > >
> > > > To my surprise, if i define that dag outside of my dag definition
> > code, I
> > > > can still access it.
> > > >
> > > > Can someone explain why and where is it stored? I thought only dag
> > > > definitions are loaded to dagbag and not the variables outside it.
> > > >
> > > > Is it even a good practice and will it work still if I switch to
> celery
> > > > executor?
> > > >
> > > >
> > > > def get_dag(i):
> > > >     dag_id = 'foo_{}'.format(i)
> > > > dag = DAG(dag_id)
> > > > ....
> > > > print my_outer_dict
> > > >
> > > > my_outer_dict = {}
> > > > for i in range(10):
> > > > dag = get_dag(i)
> > > >     globals()[dag.dag_id] = dag
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message