airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Guziel <alex.guz...@airbnb.com.INVALID>
Subject Re: dag file processing times
Date Mon, 24 Apr 2017 22:27:57 GMT
It wouldn't really be serialization. You would still need to watch all the
dependent code unless you wanted a continuous parse going on.

On Mon, Apr 24, 2017 at 3:19 PM, Bolke de Bruin <bdbruin@gmail.com> wrote:

> That would be close to serialization which you could do with marshmallow
> (which works better than pickle).
>
> B.
>
> Sent from my iPhone
>
> > On 25 Apr 2017, at 00:07, Alex Guziel <alex.guziel@airbnb.com.INVALID>
> wrote:
> >
> > You can also use reflection in Python to read the modules all the way
> down.
> >
> > On Mon, Apr 24, 2017 at 3:05 PM, Dan Davydov <dan.davydov@airbnb.com.
> invalid
> >> wrote:
> >
> >> Was talking with Alex about the DB case offline, for those we could
> support
> >> a force refresh arg with an interval param.
> >>
> >> Manifests would need to be hierarchal but I feel like it would spin out
> >> into a full blown build system inevitably.
> >>
> >> On Mon, Apr 24, 2017 at 3:02 PM, Arthur Wiedmer <
> arthur.wiedmer@gmail.com>
> >> wrote:
> >>
> >>> What if the DAG actually depends on configuration that only exists in a
> >>> database and is retrieved by the Python code generating the DAG?
> >>>
> >>> Just asking because we have this case in production here. It is slowly
> >>> changing, so still fits within the Airflow framework, but you cannot
> just
> >>> watch a file...
> >>>
> >>> Best,
> >>> Arthur
> >>>
> >>> On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin <bdbruin@gmail.com>
> >> wrote:
> >>>
> >>>> Inotify can work without a daemon. Just fire a call to the API when
a
> >>> file
> >>>> changes. Just a few lines in bash.
> >>>>
> >>>> If you bundle you dependencies in a zip you should be fine with the
> >>> above.
> >>>> Or if we start using manifests that list the files that are needed in
> a
> >>>> dag...
> >>>>
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>> On 24 Apr 2017, at 22:46, Dan Davydov <dan.davydov@airbnb.com.
> >> INVALID>
> >>>> wrote:
> >>>>>
> >>>>> One idea to solve this is to use a daemon that uses inotify to watch
> >>> for
> >>>>> changes in files and then reprocesses just those files. The hard
part
> >>> is
> >>>>> without any kind of dependency/build system for DAGs it can be hard
> >> to
> >>>> tell
> >>>>> which DAGs depend on which files.
> >>>>>
> >>>>> On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra <
> >> gtoonstra@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hey,
> >>>>>>
> >>>>>> I've seen some people complain about DAG file processing times.
An
> >>> issue
> >>>>>> was raised about this today:
> >>>>>>
> >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-1139
> >>>>>>
> >>>>>> I attempted to provide a good explanation what's going on. Feel
free
> >>> to
> >>>>>> validate and comment.
> >>>>>>
> >>>>>>
> >>>>>> I'm noticing that the file processor is a bit naive in the way
it
> >>>>>> reprocesses DAGs. It doesn't look at the DAG interval for example,
> >> so
> >>> it
> >>>>>> looks like it reprocesses all files continuously in one big
batch,
> >>> even
> >>>> if
> >>>>>> we can determine that the next "schedule"  for all its dags
are in
> >> the
> >>>>>> future?
> >>>>>>
> >>>>>>
> >>>>>> Wondering if a change in the DagFileProcessingManager could
optimize
> >>>> things
> >>>>>> a bit here.
> >>>>>>
> >>>>>> In the part where it gets the simple_dags from a file it's currently
> >>>>>> processing:
> >>>>>>
> >>>>>>               for simple_dag in processor.result:
> >>>>>>                   simple_dags.append(simple_dag)
> >>>>>>
> >>>>>> the file_path is in the context and the simple_dags should be
able
> >> to
> >>>>>> provide the next interval date for each dag in the file.
> >>>>>>
> >>>>>> The idea is to add files to a sorted deque by
> >> "next_schedule_datetime"
> >>>> (the
> >>>>>> minimum next interval date), so that when we build the list
> >>>>>> "files_paths_to_queue", it can remove files that have dags that
we
> >>> know
> >>>>>> won't have a new dagrun for a while.
> >>>>>>
> >>>>>> One gotcha to resolve after that is to deal with files getting
> >> updated
> >>>> with
> >>>>>> new dags or changed dag definitions and renames and different
> >> interval
> >>>>>> schedules.
> >>>>>>
> >>>>>> Worth a PR to glance over?
> >>>>>>
> >>>>>> Rgds,
> >>>>>>
> >>>>>> Gerard
> >>>>>>
> >>>>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message