airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wiedmer <arthur.wied...@gmail.com>
Subject Re: Experiences with 1.8.0
Date Mon, 23 Jan 2017 20:35:03 GMT
Chris,

Just double checking, you mean more than 15 seconds not 15 minutes, right?

Best,
Arthur

On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini <criccomini@apache.org>
wrote:

> Hey all,
>
> I've upgraded on production. Things seem to be working so far (only been an
> hour), but I am seeing this in the scheduler logs:
>
> File Path                                                             PID
>  Runtime    Last Runtime    Last Run
> ------------------------------------------------------------------  -----
>  ---------  --------------  -------------------
> ...
> /etc/airflow/dags/dags/elt/el/db.py                                 24793
>  43.41s     986.63s         2017-01-23T20:04:09
> ...
>
> It seems to be taking more than 15 minutes to parse this DAG. Any idea
> what's causing this? Scheduler config:
>
> [scheduler]
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> max_threads = 2
> child_process_log_directory = /var/log/airflow/scheduler
>
> The db.py file, itself, doesn't interact with any outside systems, so I
> would have expected this to not take so long. It does, however,
> programmatically generate many DAGs within the single .py file.
>
> A snippet of the scheduler log is here:
>
> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669e8ffd65
>
> Note how there are 10-15 second gaps occasionally. Any idea what's going
> on?
>
> Cheers,
> Chris
>
> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <bdbruin@gmail.com> wrote:
>
> > I created:
> >
> > - AIRFLOW-791: At start up all running dag_runs are being checked, but
> not
> > fixed
> > - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t get
> fixed
> > - AIRFLOW-788: Context unexpectedly added to hive conf
> > - AIRFLOW-792: Allow fixing of schedule when wrong start_date / interval
> > was specified
> >
> > I created AIRFLOW-789 to update UPDATING.md, it is also out as a PR.
> >
> > Please note that I don't consider any of these blockers for a release of
> > 1.8.0 and can be fixed in 1.8.1 - so we are still on track for an RC on
> Feb
> > 2. However if people are using a restarting scheduler (run_duration is
> set)
> > and have a lot of running dag runs they won’t like AIRFLOW-791. So a
> > workaround for this would be nice (we just updated dag_runs directly in
> the
> > database to say ‘finished’ before a certain date, but we are also not
> using
> > the run_duration).
> >
> > Bolke
> >
> >
> >
> > > On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbruin@gmail.com> wrote:
> > >
> > > Will do. And thanks.
> > >
> > > Adding another issue:
> > >
> > > * Some of our DAGs are not getting scheduled for some unknown reason.
> > > Need to investigate why.
> > >
> > > Related but not root cause:
> > > * Logging is so chatty that it gets really hard to find the real issue
> > >
> > > Bolke.
> > >
> > >> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davydov@airbnb.com.
> INVALID>
> > wrote:
> > >>
> > >> I'd be happy to lend a hand fixing these issues and hopefully some
> > others
> > >> are too. Do you mind creating jiras for these since you have the full
> > >> context? I have created a JIRA for (1) and have assigned it to myself:
> > >> https://issues.apache.org/jira/browse/AIRFLOW-780
> > >>
> > >> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <bdbruin@gmail.com>
> > wrote:
> > >>
> > >>> This is to report back on some of the (early) experiences we have
> with
> > >>> Airflow 1.8.0 (beta 1 at the moment):
> > >>>
> > >>> 1. The UI does not show faulty DAG, leading to confusion for
> > developers.
> > >>> When a faulty dag is placed in the dags folder the UI would report
a
> > >>> parsing error. Now it doesn’t due to the separate parising (but not
> > >>> reporting back errors)
> > >>>
> > >>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
> > >>> We run in a secure environment which requires this variable to be
> > >>> whitelisted if it is modified (needs to be added to UPDATING.md)
> > >>>
> > >>> 3. DagRuns do not exist for certain tasks, but don’t get fixed
> > >>> Log gets flooded without a suggestion what to do
> > >>>
> > >>> 4. At start up all running dag_runs are being checked, we seemed to
> > have a
> > >>> lot of “left over” dag_runs (couple of thousand)
> > >>> - Checking was logged to INFO -> requires a fsync for every log
> message
> > >>> making it very slow
> > >>> - Checking would happen at every restart, but dag_runs’ states were
> not
> > >>> being updated
> > >>> - These dag_runs would never er be marked anything else than running
> > for
> > >>> some reason
> > >>> -> Applied work around to update all dag_run in sql before a certain
> > date
> > >>> to -> finished
> > >>> -> need to investigate why dag_runs did not get marked
> > “finished/failed”
> > >>>
> > >>> 5. Our umask is set to 027
> > >>>
> > >>>
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message