airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: programmatically creating and airflow quirks
Date Mon, 26 Nov 2018 02:20:03 GMT
The historical reason is that people would check in scripts in the repo
that had actual compute or other forms or undesired effect in module scope
(scripts with no "if __name__ == '__main__':") and Airflow would just run
this script while seeking for DAGs. So we added this mitigation patch that
would confirm that there's something Airflow-related in the .py file. Not
elegant, and confusing at times, but it also probably prevented some issues
over the years.

The solution here is to have a more explicit way of adding DAGs to the
DagBag (instead of the folder-crawling approach). The DagFetcher proposal
offers solutions around that, having a central "manifest" file that
provides explicit pointers to all DAGs in the environment.

Max

On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbourne@gmail.com>
wrote:

> In my opinion this searching for dags is not ideal.
>
> We should be explicitly specifying the dags to load somewhere.
>
>
> > On 25 Nov 2018, at 10:41 am, Kevin Yang <yrqls21@gmail.com> wrote:
> >
> > I believe that is mostly because we want to skip parsing/loading .py
> files
> > that doesn't contain DAG defs to save time, as scheduler is going to
> > parse/load the .py files over and over again and some files can take
> quite
> > long to load.
> >
> > Cheers,
> > Kevin Y
> >
> > On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhavala@gmail.com>
> > wrote:
> >
> >> happy to report that the “fix” worked. thanks Alex.
> >>
> >> btw, wondering why was it there in the first place? how does it help —
> >> saves time, early termination — what?
> >>
> >>
> >>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guziel@airbnb.com>
> wrote:
> >>>
> >>> Yup.
> >>>
> >>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhavala@gmail.com
> >> <mailto:soma.dhavala@gmail.com>> wrote:
> >>>
> >>>
> >>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guziel@airbnb.com
> >> <mailto:alex.guziel@airbnb.com>> wrote:
> >>>>
> >>>> It’s because of this
> >>>>
> >>>> “When searching for DAGs, Airflow will only consider files where the
> >> string “airflow” and “DAG” both appear in the contents of the .py file.”
> >>>>
> >>>
> >>> Have not noticed it.  From airflow/models.py, in process_file — (both
> in
> >> 1.9 and 1.10)
> >>> ..
> >>> if not all([s in content for s in (b'DAG', b'airflow')]):
> >>> ..
> >>> is looking for those strings and if they are not found, it is returning
> >> without loading the DAGs.
> >>>
> >>>
> >>> So having “airflow” and “DAG”  dummy strings placed somewhere will
make
> >> it work?
> >>>
> >>>
> >>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhavala@gmail.com
> >> <mailto:soma.dhavala@gmail.com>> wrote:
> >>>>
> >>>>
> >>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guziel@airbnb.com
> >> <mailto:alex.guziel@airbnb.com>> wrote:
> >>>>>
> >>>>> I think this is what is going on. The dags are picked by local
> >> variables. I.E. if you do
> >>>>> dag = Dag(...)
> >>>>> dag = Dag(…)
> >>>>
> >>>> from my_module import create_dag
> >>>>
> >>>> for file in yaml_files:
> >>>>     dag = create_dag(file)
> >>>>     globals()[dag.dag_id] = dag
> >>>>
> >>>> You notice that create_dag is in a different module. If it is in the
> >> same scope (file), it will be fine.
> >>>>
> >>>>>
> >>>>
> >>>>> Only the second dag will be picked up.
> >>>>>
> >>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
> soma.dhavala@gmail.com
> >> <mailto:soma.dhavala@gmail.com>> wrote:
> >>>>> Hey AirFlow Devs:
> >>>>> In our organization, we build a Machine Learning WorkBench with
> >> AirFlow as
> >>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> >>>>> operators to customize the behaviour. These work flows are specified
> in
> >>>>> YAML.
> >>>>>
> >>>>> We drop a DAG loader (written python) in the default location airflow
> >>>>> expects the DAG files.  This DAG loader reads the specified YAML
> files
> >> and
> >>>>> converts them into airflow DAG objects. Essentially, we are
> >>>>> programmatically creating the DAG objects. In order to support
> muliple
> >>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
> >> But
> >>>>> when a DAG is created (in a separate module) and made available
to
> the
> >> DAG
> >>>>> loaders, airflow does not pick it up. As an example, consider that
I
> >>>>> created a DAG picked it, and will simply unpickle the DAG and give
it
> >> to
> >>>>> airflow.
> >>>>>
> >>>>> However, in current avatar of airfow, the very creation of DAG has
to
> >>>>> happen in the loader itself. As far I am concerned, airflow should
> not
> >> care
> >>>>> where and how the DAG object is created, so long as it is a valid
DAG
> >>>>> object. The workaround for us is to mix parser and loader in the
same
> >> file
> >>>>> and drop it in the airflow default dags folder. During dag_bag
> >> creation,
> >>>>> this file is loaded up with import_modules utility and shows up
in
> the
> >> UI.
> >>>>> While this is a solution, but it is not clean.
> >>>>>
> >>>>> What do DEVs think about a solution to this problem? Will saving
the
> >> DAG to
> >>>>> the db and reading it from the db work? Or some core changes need
to
> >> happen
> >>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
> >>>>>
> >>>>> thanks,
> >>>>> -soma
> >>>>
> >>>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message