airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@apache.org>
Subject Re: Experiences with 1.8.0
Date Wed, 25 Jan 2017 20:08:50 GMT
Hey all,

I have sent in a PR and JIRA here:

https://github.com/apache/incubator-airflow/pull/2021
https://issues.apache.org/jira/browse/AIRFLOW-807

Please have a look.

EDIT: I see Arthur just did haha.

Cheers,
Chris

On Tue, Jan 24, 2017 at 9:41 PM, Chris Riccomini <criccomini@apache.org>
wrote:

> @Max, ran both analyze/optimize. Didn't help. Explain still tries to use
> `state` index. :(
>
> On Tue, Jan 24, 2017 at 5:54 AM, Bolke de Bruin <bdbruin@gmail.com> wrote:
>
>> I have looked into the issue and it is harmless. What happens is that a
>> TaskInstance writes “success” to the database and the monitoring catches
>> this change before the process is exited. It reports wrongly (ie. queued)
>> as self.task_instance is not updated. I have opened AIRFLOW-798, but will
>> not consider it as a blocker for 1.8.0.
>>
>> - Bolke
>>
>>
>> > On 24 Jan 2017, at 10:38, Bolke de Bruin <bdbruin@gmail.com> wrote:
>> >
>> > Hey Chris,
>> >
>> > Could you dive into the below a bit more? I don’t like that the
>> LocalTask job is saying the external state is set to queued, although it
>> might just be that the monitoring just does not take into account the
>> queued state which it should (still I am wondering why it happens after the
>> task has finished - maybe db locking interferes). I also see it with my
>> tasks so will dive in myself as well.
>> >
>> > Bolke
>> >
>> >> On 23 Jan 2017, at 21:34, Chris Riccomini <criccomini@apache.org>
>> wrote:
>> >>
>> >> Also, seeing this in EVERY task that runs:
>> >>
>> >> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this
>> >> instance has been externally set to queued. Taking the poison pill. So
>> >> long.
>> >> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with
>> return code 0
>> >>
>> >>
>> >> All successful tasks are showing this at the end of their logs. Is this
>> >> normal?
>> >>
>> >> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini <
>> criccomini@apache.org>
>> >> wrote:
>> >>
>> >>> Hey all,
>> >>>
>> >>> I've upgraded on production. Things seem to be working so far (only
>> been
>> >>> an hour), but I am seeing this in the scheduler logs:
>> >>>
>> >>> File Path
>>  PID
>> >>> Runtime    Last Runtime    Last Run
>> >>> ------------------------------------------------------------------
>> -----
>> >>> ---------  --------------  -------------------
>> >>> ...
>> >>> /etc/airflow/dags/dags/elt/el/db.py
>>  24793
>> >>> 43.41s     986.63s         2017-01-23T20:04:09
>> >>> ...
>> >>>
>> >>> It seems to be taking more than 15 minutes to parse this DAG. Any idea
>> >>> what's causing this? Scheduler config:
>> >>>
>> >>> [scheduler]
>> >>> job_heartbeat_sec = 5
>> >>> scheduler_heartbeat_sec = 5
>> >>> max_threads = 2
>> >>> child_process_log_directory = /var/log/airflow/scheduler
>> >>>
>> >>> The db.py file, itself, doesn't interact with any outside systems, so
>> I
>> >>> would have expected this to not take so long. It does, however,
>> >>> programmatically generate many DAGs within the single .py file.
>> >>>
>> >>> A snippet of the scheduler log is here:
>> >>>
>> >>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669e8ffd65
>> >>>
>> >>> Note how there are 10-15 second gaps occasionally. Any idea what's
>> going
>> >>> on?
>> >>>
>> >>> Cheers,
>> >>> Chris
>> >>>
>> >>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <bdbruin@gmail.com>
>> wrote:
>> >>>
>> >>>> I created:
>> >>>>
>> >>>> - AIRFLOW-791: At start up all running dag_runs are being checked,
>> but
>> >>>> not fixed
>> >>>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t
get
>> fixed
>> >>>> - AIRFLOW-788: Context unexpectedly added to hive conf
>> >>>> - AIRFLOW-792: Allow fixing of schedule when wrong start_date /
>> interval
>> >>>> was specified
>> >>>>
>> >>>> I created AIRFLOW-789 to update UPDATING.md, it is also out as a
PR.
>> >>>>
>> >>>> Please note that I don't consider any of these blockers for a
>> release of
>> >>>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for an
RC
>> on Feb
>> >>>> 2. However if people are using a restarting scheduler (run_duration
>> is set)
>> >>>> and have a lot of running dag runs they won’t like AIRFLOW-791.
So a
>> >>>> workaround for this would be nice (we just updated dag_runs directly
>> in the
>> >>>> database to say ‘finished’ before a certain date, but we are
also
>> not using
>> >>>> the run_duration).
>> >>>>
>> >>>> Bolke
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbruin@gmail.com>
wrote:
>> >>>>>
>> >>>>> Will do. And thanks.
>> >>>>>
>> >>>>> Adding another issue:
>> >>>>>
>> >>>>> * Some of our DAGs are not getting scheduled for some unknown
>> reason.
>> >>>>> Need to investigate why.
>> >>>>>
>> >>>>> Related but not root cause:
>> >>>>> * Logging is so chatty that it gets really hard to find the
real
>> issue
>> >>>>>
>> >>>>> Bolke.
>> >>>>>
>> >>>>>> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davydov@airbnb.com
>> .INVALID>
>> >>>> wrote:
>> >>>>>>
>> >>>>>> I'd be happy to lend a hand fixing these issues and hopefully
some
>> >>>> others
>> >>>>>> are too. Do you mind creating jiras for these since you
have the
>> full
>> >>>>>> context? I have created a JIRA for (1) and have assigned
it to
>> myself:
>> >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780
>> >>>>>>
>> >>>>>> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <bdbruin@gmail.com
>> >
>> >>>> wrote:
>> >>>>>>
>> >>>>>>> This is to report back on some of the (early) experiences
we have
>> with
>> >>>>>>> Airflow 1.8.0 (beta 1 at the moment):
>> >>>>>>>
>> >>>>>>> 1. The UI does not show faulty DAG, leading to confusion
for
>> >>>> developers.
>> >>>>>>> When a faulty dag is placed in the dags folder the UI
would
>> report a
>> >>>>>>> parsing error. Now it doesn’t due to the separate
parising (but
>> not
>> >>>>>>> reporting back errors)
>> >>>>>>>
>> >>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
>> >>>>>>> We run in a secure environment which requires this variable
to be
>> >>>>>>> whitelisted if it is modified (needs to be added to
UPDATING.md)
>> >>>>>>>
>> >>>>>>> 3. DagRuns do not exist for certain tasks, but don’t
get fixed
>> >>>>>>> Log gets flooded without a suggestion what to do
>> >>>>>>>
>> >>>>>>> 4. At start up all running dag_runs are being checked,
we seemed
>> to
>> >>>> have a
>> >>>>>>> lot of “left over” dag_runs (couple of thousand)
>> >>>>>>> - Checking was logged to INFO -> requires a fsync
for every log
>> >>>> message
>> >>>>>>> making it very slow
>> >>>>>>> - Checking would happen at every restart, but dag_runs’
states
>> were
>> >>>> not
>> >>>>>>> being updated
>> >>>>>>> - These dag_runs would never er be marked anything else
than
>> running
>> >>>> for
>> >>>>>>> some reason
>> >>>>>>> -> Applied work around to update all dag_run in sql
before a
>> certain
>> >>>> date
>> >>>>>>> to -> finished
>> >>>>>>> -> need to investigate why dag_runs did not get marked
>> >>>> “finished/failed”
>> >>>>>>>
>> >>>>>>> 5. Our umask is set to 027
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message