airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Guziel <alex.guz...@airbnb.com.INVALID>
Subject Re: dag file processing times
Date Mon, 24 Apr 2017 22:07:39 GMT
You can also use reflection in Python to read the modules all the way down.

On Mon, Apr 24, 2017 at 3:05 PM, Dan Davydov <dan.davydov@airbnb.com.invalid
> wrote:

> Was talking with Alex about the DB case offline, for those we could support
> a force refresh arg with an interval param.
>
> Manifests would need to be hierarchal but I feel like it would spin out
> into a full blown build system inevitably.
>
> On Mon, Apr 24, 2017 at 3:02 PM, Arthur Wiedmer <arthur.wiedmer@gmail.com>
> wrote:
>
> > What if the DAG actually depends on configuration that only exists in a
> > database and is retrieved by the Python code generating the DAG?
> >
> > Just asking because we have this case in production here. It is slowly
> > changing, so still fits within the Airflow framework, but you cannot just
> > watch a file...
> >
> > Best,
> > Arthur
> >
> > On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin <bdbruin@gmail.com>
> wrote:
> >
> > > Inotify can work without a daemon. Just fire a call to the API when a
> > file
> > > changes. Just a few lines in bash.
> > >
> > > If you bundle you dependencies in a zip you should be fine with the
> > above.
> > > Or if we start using manifests that list the files that are needed in a
> > > dag...
> > >
> > >
> > > Sent from my iPhone
> > >
> > > > On 24 Apr 2017, at 22:46, Dan Davydov <dan.davydov@airbnb.com.
> INVALID>
> > > wrote:
> > > >
> > > > One idea to solve this is to use a daemon that uses inotify to watch
> > for
> > > > changes in files and then reprocesses just those files. The hard part
> > is
> > > > without any kind of dependency/build system for DAGs it can be hard
> to
> > > tell
> > > > which DAGs depend on which files.
> > > >
> > > > On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra <
> gtoonstra@gmail.com>
> > > > wrote:
> > > >
> > > >> Hey,
> > > >>
> > > >> I've seen some people complain about DAG file processing times. An
> > issue
> > > >> was raised about this today:
> > > >>
> > > >> https://issues.apache.org/jira/browse/AIRFLOW-1139
> > > >>
> > > >> I attempted to provide a good explanation what's going on. Feel free
> > to
> > > >> validate and comment.
> > > >>
> > > >>
> > > >> I'm noticing that the file processor is a bit naive in the way it
> > > >> reprocesses DAGs. It doesn't look at the DAG interval for example,
> so
> > it
> > > >> looks like it reprocesses all files continuously in one big batch,
> > even
> > > if
> > > >> we can determine that the next "schedule"  for all its dags are in
> the
> > > >> future?
> > > >>
> > > >>
> > > >> Wondering if a change in the DagFileProcessingManager could optimize
> > > things
> > > >> a bit here.
> > > >>
> > > >> In the part where it gets the simple_dags from a file it's currently
> > > >> processing:
> > > >>
> > > >>                for simple_dag in processor.result:
> > > >>                    simple_dags.append(simple_dag)
> > > >>
> > > >> the file_path is in the context and the simple_dags should be able
> to
> > > >> provide the next interval date for each dag in the file.
> > > >>
> > > >> The idea is to add files to a sorted deque by
> "next_schedule_datetime"
> > > (the
> > > >> minimum next interval date), so that when we build the list
> > > >> "files_paths_to_queue", it can remove files that have dags that we
> > know
> > > >> won't have a new dagrun for a while.
> > > >>
> > > >> One gotcha to resolve after that is to deal with files getting
> updated
> > > with
> > > >> new dags or changed dag definitions and renames and different
> interval
> > > >> schedules.
> > > >>
> > > >> Worth a PR to glance over?
> > > >>
> > > >> Rgds,
> > > >>
> > > >> Gerard
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message