airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wiedmer <arthur.wied...@gmail.com>
Subject Re: Experiences with 1.8.0
Date Mon, 23 Jan 2017 23:45:14 GMT
Maybe we can start with
" .with_hint(TI, 'USE INDEX (PRIMARY)', dialect_name='mysql')"

and see if other databases exhibit the same query plan issue ?

Best,
Arthur

On Mon, Jan 23, 2017 at 3:27 PM, Chris Riccomini <criccomini@apache.org>
wrote:

> With this patch:
>
> $ git diff
> diff --git a/airflow/jobs.py b/airflow/jobs.py
> index f1de333..9d08e75 100644
> --- a/airflow/jobs.py
> +++ b/airflow/jobs.py
> @@ -544,6 +544,7 @@ class SchedulerJob(BaseJob):
>              .query(
>                  TI.task_id,
>                  func.max(TI.execution_date).label('max_ti'))
> +            .with_hint(TI, 'USE INDEX (PRIMARY)')
>              .filter(TI.dag_id == dag.dag_id)
>              .filter(TI.state == State.SUCCESS)
>              .filter(TI.task_id.in_(dag.task_ids))
>
> The db.py file parses in 90s. About 10x faster.
>
> Th slow_query log is not showing any offending queries anymore. I will open
> a JIRA for this. Note that with_hint stuff I put in is MySQL specific, I
> think. Any one have any other ideas on how to affect the equivalent
> outcome?
>
> On Mon, Jan 23, 2017 at 3:08 PM, Chris Riccomini <criccomini@apache.org>
> wrote:
>
> > OK, it's using `state` instead of PRIMARY. Using PRIMARY with a hint,
> > query takes .47s. Without hint, 10s. Going to try and patch.
> >
> > On Mon, Jan 23, 2017 at 2:57 PM, Chris Riccomini <criccomini@apache.org>
> > wrote:
> >
> >> This inner query takes 10s:
> >>
> >> SELECT task_instance.task_id AS task_id, max(task_instance.execution_
> date)
> >> AS max_ti
> >> FROM task_instance
> >> WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success'
> >> AND task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id
> >>
> >> Explain seems OK:
> >>
> >> +----+-------------+---------------+------+-----------------
> >> -----------------------------------+----------+---------+---
> >> ----+-------+--------------------------+
> >> | id | select_type | table         | type | possible_keys
> >>                      | key      | key_len | ref   | rows  | Extra
> >>          |
> >> +----+-------------+---------------+------+-----------------
> >> -----------------------------------+----------+---------+---
> >> ----+-------+--------------------------+
> >> |  1 | SIMPLE      | task_instance | ref  |
> PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state
> >> | ti_state | 63      | const | 81898 | Using where; Using index |
> >> +----+-------------+---------------+------+-----------------
> >> -----------------------------------+----------+---------+---
> >> ----+-------+--------------------------+
> >>
> >> A look at the query indicates that it's using the state field as its
> >> index lookup, which isn't good.
> >>
> >> On Mon, Jan 23, 2017 at 2:49 PM, Chris Riccomini <criccomini@apache.org
> >
> >> wrote:
> >>
> >>> It's this query:
> >>>
> >>> SELECT task_instance.task_id AS task_instance_task_id,
> >>> task_instance.dag_id AS task_instance_dag_id,
> task_instance.execution_date
> >>> AS task_instance_execution_date, task_instance.start_date AS
> >>> task_instance_start_date, task_instance.end_date AS
> task_instance_end_date,
> >>> task_instance.duration AS task_instance_duration, task_instance.state
> AS
> >>> task_instance_state, task_instance.try_number AS
> task_instance_try_number,
> >>> task_instance.hostname AS task_instance_hostname,
> task_instance.unixname AS
> >>> task_instance_unixname, task_instance.job_id AS task_instance_job_id,
> >>> task_instance.pool AS task_instance_pool, task_instance.queue AS
> >>> task_instance_queue, task_instance.priority_weight AS
> >>> task_instance_priority_weight, task_instance.operator AS
> >>> task_instance_operator, task_instance.queued_dttm AS
> >>> task_instance_queued_dttm, task_instance.pid AS task_instance_pid
> >>> FROM task_instance, (SELECT task_instance.task_id AS task_id,
> >>> max(task_instance.execution_date) AS max_ti
> >>> FROM task_instance
> >>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.state =
> >>> 'success' AND task_instance.task_id IN ('t1', 't2', 't3', 't4', 't5',
> 't6')
> >>> GROUP BY task_instance.task_id) AS sq
> >>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.task_id =
> >>> sq.task_id AND task_instance.execution_date = sq.max_ti
> >>>
> >>> Taking 10-20s
> >>>
> >>> On Mon, Jan 23, 2017 at 2:05 PM, Chris Riccomini <
> criccomini@apache.org>
> >>> wrote:
> >>>
> >>>> Can confirm it's a slow query on task_instance table. Still digging.
> >>>> Unfortunately, the query is truncated in my UI right now:
> >>>>
> >>>> SELECT task_instance.task_id AS task_instance_...
> >>>>
> >>>> On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini <
> criccomini@apache.org
> >>>> > wrote:
> >>>>
> >>>>> Digging. Might be a bit.
> >>>>>
> >>>>> On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin <bdbruin@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Slow query log? Db load?
> >>>>>>
> >>>>>> B.
> >>>>>>
> >>>>>> Verstuurd vanaf mijn iPad
> >>>>>>
> >>>>>> > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini <
> >>>>>> criccomini@apache.org> het volgende geschreven:
> >>>>>> >
> >>>>>> > Note: 6.5 million TIs in the task_instance table.
> >>>>>> >
> >>>>>> > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini <
> >>>>>> criccomini@apache.org>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> >> Hey Bolke,
> >>>>>> >>
> >>>>>> >> Re: system usage, it's pretty quiet <5% CPU usage.
Mem is almost
> >>>>>> all free
> >>>>>> >> as well.
> >>>>>> >>
> >>>>>> >> I am thinking that this is DB related, given that it's
pausing
> when
> >>>>>> >> executing an update. Was looking at the update_state
method in
> >>>>>> models.py,
> >>>>>> >> which logs right before the 15s pause.
> >>>>>> >>
> >>>>>> >> Cheers,
> >>>>>> >> Chris
> >>>>>> >>
> >>>>>> >> On Mon, Jan 23, 2017 at 12:51 PM, Bolke de Bruin <
> >>>>>> bdbruin@gmail.com>
> >>>>>> >> wrote:
> >>>>>> >>
> >>>>>> >>> Hi Chris,
> >>>>>> >>>
> >>>>>> >>> What is the load on your machine? (CPU/IO/MEM)
It seems that the
> >>>>>> executor
> >>>>>> >>> is faster in checking the state than the TaskInstance
is able to
> >>>>>> exit
> >>>>>> >>> itself. No, I don’t consider it normal, but it
was sometimes
> >>>>>> reported (
> >>>>>> >>> https://github.com/apache/incubator-airflow/pull/1821)
though
> >>>>>> not really
> >>>>>> >>> replicable as of yet.
> >>>>>> >>>
> >>>>>> >>> Parsing seems exceptionally slow, it might be worth
looking at
> it
> >>>>>> with a
> >>>>>> >>> debugger. Very faint guess might be that something
with the
> >>>>>> multiprocessing
> >>>>>> >>> part could do something with memory that is costly,
but then we
> >>>>>> need to
> >>>>>> >>> know more about what is running on the system.
Never clue of
> >>>>>> system metrics
> >>>>>> >>> could be helpful here.
> >>>>>> >>>
> >>>>>> >>> Bolke
> >>>>>> >>>
> >>>>>> >>>> On 23 Jan 2017, at 21:34, Chris Riccomini <
> criccomini@apache.org
> >>>>>> >
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>> Also, seeing this in EVERY task that runs:
> >>>>>> >>>>
> >>>>>> >>>> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING
- State of
> this
> >>>>>> >>>> instance has been externally set to queued.
Taking the poison
> >>>>>> pill. So
> >>>>>> >>>> long.
> >>>>>> >>>> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO
- Task exited
> with
> >>>>>> return
> >>>>>> >>> code 0
> >>>>>> >>>>
> >>>>>> >>>>
> >>>>>> >>>> All successful tasks are showing this at the
end of their logs.
> >>>>>> Is this
> >>>>>> >>>> normal?
> >>>>>> >>>>
> >>>>>> >>>> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini
<
> >>>>>> >>> criccomini@apache.org>
> >>>>>> >>>> wrote:
> >>>>>> >>>>
> >>>>>> >>>>> Hey all,
> >>>>>> >>>>>
> >>>>>> >>>>> I've upgraded on production. Things seem
to be working so far
> >>>>>> (only
> >>>>>> >>> been
> >>>>>> >>>>> an hour), but I am seeing this in the scheduler
logs:
> >>>>>> >>>>>
> >>>>>> >>>>> File Path
> >>>>>> >>> PID
> >>>>>> >>>>> Runtime    Last Runtime    Last Run
> >>>>>> >>>>> ------------------------------------------------------------
> >>>>>> ------
> >>>>>> >>> -----
> >>>>>> >>>>> ---------  --------------  -------------------
> >>>>>> >>>>> ...
> >>>>>> >>>>> /etc/airflow/dags/dags/elt/el/db.py
> >>>>>> >>> 24793
> >>>>>> >>>>> 43.41s     986.63s         2017-01-23T20:04:09
> >>>>>> >>>>> ...
> >>>>>> >>>>>
> >>>>>> >>>>> It seems to be taking more than 15 minutes
to parse this DAG.
> >>>>>> Any idea
> >>>>>> >>>>> what's causing this? Scheduler config:
> >>>>>> >>>>>
> >>>>>> >>>>> [scheduler]
> >>>>>> >>>>> job_heartbeat_sec = 5
> >>>>>> >>>>> scheduler_heartbeat_sec = 5
> >>>>>> >>>>> max_threads = 2
> >>>>>> >>>>> child_process_log_directory = /var/log/airflow/scheduler
> >>>>>> >>>>>
> >>>>>> >>>>> The db.py file, itself, doesn't interact
with any outside
> >>>>>> systems, so I
> >>>>>> >>>>> would have expected this to not take so
long. It does,
> however,
> >>>>>> >>>>> programmatically generate many DAGs within
the single .py
> file.
> >>>>>> >>>>>
> >>>>>> >>>>> A snippet of the scheduler log is here:
> >>>>>> >>>>>
> >>>>>> >>>>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669
> >>>>>> e8ffd65
> >>>>>> >>>>>
> >>>>>> >>>>> Note how there are 10-15 second gaps occasionally.
Any idea
> >>>>>> what's
> >>>>>> >>> going
> >>>>>> >>>>> on?
> >>>>>> >>>>>
> >>>>>> >>>>> Cheers,
> >>>>>> >>>>> Chris
> >>>>>> >>>>>
> >>>>>> >>>>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke
de Bruin <
> >>>>>> bdbruin@gmail.com>
> >>>>>> >>> wrote:
> >>>>>> >>>>>
> >>>>>> >>>>>> I created:
> >>>>>> >>>>>>
> >>>>>> >>>>>> - AIRFLOW-791: At start up all running
dag_runs are being
> >>>>>> checked, but
> >>>>>> >>>>>> not fixed
> >>>>>> >>>>>> - AIRFLOW-790: DagRuns do not exist
for certain tasks, but
> >>>>>> don’t get
> >>>>>> >>> fixed
> >>>>>> >>>>>> - AIRFLOW-788: Context unexpectedly
added to hive conf
> >>>>>> >>>>>> - AIRFLOW-792: Allow fixing of schedule
when wrong
> start_date /
> >>>>>> >>> interval
> >>>>>> >>>>>> was specified
> >>>>>> >>>>>>
> >>>>>> >>>>>> I created AIRFLOW-789 to update UPDATING.md,
it is also out
> as
> >>>>>> a PR.
> >>>>>> >>>>>>
> >>>>>> >>>>>> Please note that I don't consider any
of these blockers for a
> >>>>>> release
> >>>>>> >>> of
> >>>>>> >>>>>> 1.8.0 and can be fixed in 1.8.1 - so
we are still on track
> for
> >>>>>> an RC
> >>>>>> >>> on Feb
> >>>>>> >>>>>> 2. However if people are using a restarting
scheduler
> >>>>>> (run_duration
> >>>>>> >>> is set)
> >>>>>> >>>>>> and have a lot of running dag runs
they won’t like
> >>>>>> AIRFLOW-791. So a
> >>>>>> >>>>>> workaround for this would be nice (we
just updated dag_runs
> >>>>>> directly
> >>>>>> >>> in the
> >>>>>> >>>>>> database to say ‘finished’ before
a certain date, but we are
> >>>>>> also not
> >>>>>> >>> using
> >>>>>> >>>>>> the run_duration).
> >>>>>> >>>>>>
> >>>>>> >>>>>> Bolke
> >>>>>> >>>>>>
> >>>>>> >>>>>>
> >>>>>> >>>>>>
> >>>>>> >>>>>>> On 20 Jan 2017, at 23:55, Bolke
de Bruin <bdbruin@gmail.com
> >
> >>>>>> wrote:
> >>>>>> >>>>>>>
> >>>>>> >>>>>>> Will do. And thanks.
> >>>>>> >>>>>>>
> >>>>>> >>>>>>> Adding another issue:
> >>>>>> >>>>>>>
> >>>>>> >>>>>>> * Some of our DAGs are not getting
scheduled for some
> unknown
> >>>>>> reason.
> >>>>>> >>>>>>> Need to investigate why.
> >>>>>> >>>>>>>
> >>>>>> >>>>>>> Related but not root cause:
> >>>>>> >>>>>>> * Logging is so chatty that it
gets really hard to find the
> >>>>>> real
> >>>>>> >>> issue
> >>>>>> >>>>>>>
> >>>>>> >>>>>>> Bolke.
> >>>>>> >>>>>>>
> >>>>>> >>>>>>>> On 20 Jan 2017, at 23:45, Dan
Davydov <
> >>>>>> dan.davydov@airbnb.com
> >>>>>> >>> .INVALID>
> >>>>>> >>>>>> wrote:
> >>>>>> >>>>>>>>
> >>>>>> >>>>>>>> I'd be happy to lend a hand
fixing these issues and
> >>>>>> hopefully some
> >>>>>> >>>>>> others
> >>>>>> >>>>>>>> are too. Do you mind creating
jiras for these since you
> have
> >>>>>> the
> >>>>>> >>> full
> >>>>>> >>>>>>>> context? I have created a JIRA
for (1) and have assigned it
> >>>>>> to
> >>>>>> >>> myself:
> >>>>>> >>>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780
> >>>>>> >>>>>>>>
> >>>>>> >>>>>>>> On Fri, Jan 20, 2017 at 1:01
AM, Bolke de Bruin <
> >>>>>> bdbruin@gmail.com>
> >>>>>> >>>>>> wrote:
> >>>>>> >>>>>>>>
> >>>>>> >>>>>>>>> This is to report back
on some of the (early) experiences
> >>>>>> we have
> >>>>>> >>> with
> >>>>>> >>>>>>>>> Airflow 1.8.0 (beta 1 at
the moment):
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>>> 1. The UI does not show
faulty DAG, leading to confusion
> for
> >>>>>> >>>>>> developers.
> >>>>>> >>>>>>>>> When a faulty dag is placed
in the dags folder the UI
> would
> >>>>>> report
> >>>>>> >>> a
> >>>>>> >>>>>>>>> parsing error. Now it doesn’t
due to the separate parising
> >>>>>> (but not
> >>>>>> >>>>>>>>> reporting back errors)
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’
in hive
> >>>>>> >>>>>>>>> We run in a secure environment
which requires this
> variable
> >>>>>> to be
> >>>>>> >>>>>>>>> whitelisted if it is modified
(needs to be added to
> >>>>>> UPDATING.md)
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>>> 3. DagRuns do not exist
for certain tasks, but don’t get
> >>>>>> fixed
> >>>>>> >>>>>>>>> Log gets flooded without
a suggestion what to do
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>>> 4. At start up all running
dag_runs are being checked, we
> >>>>>> seemed to
> >>>>>> >>>>>> have a
> >>>>>> >>>>>>>>> lot of “left over”
dag_runs (couple of thousand)
> >>>>>> >>>>>>>>> - Checking was logged to
INFO -> requires a fsync for
> every
> >>>>>> log
> >>>>>> >>>>>> message
> >>>>>> >>>>>>>>> making it very slow
> >>>>>> >>>>>>>>> - Checking would happen
at every restart, but dag_runs’
> >>>>>> states were
> >>>>>> >>>>>> not
> >>>>>> >>>>>>>>> being updated
> >>>>>> >>>>>>>>> - These dag_runs would
never er be marked anything else
> than
> >>>>>> >>> running
> >>>>>> >>>>>> for
> >>>>>> >>>>>>>>> some reason
> >>>>>> >>>>>>>>> -> Applied work around
to update all dag_run in sql
> before a
> >>>>>> >>> certain
> >>>>>> >>>>>> date
> >>>>>> >>>>>>>>> to -> finished
> >>>>>> >>>>>>>>> -> need to investigate
why dag_runs did not get marked
> >>>>>> >>>>>> “finished/failed”
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>>> 5. Our umask is set to
027
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>>>
> >>>>>> >>>>>>>
> >>>>>> >>>>>>
> >>>>>> >>>>>>
> >>>>>> >>>>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message