airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyrone Hinderson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AIRFLOW-366) SchedulerJob gets locked up when when child processes attempt to log to single file
Date Tue, 26 Jul 2016 21:59:20 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394620#comment-15394620
] 

Tyrone Hinderson edited comment on AIRFLOW-366 at 7/26/16 9:58 PM:
-------------------------------------------------------------------

+1

I have observed this as well. At some point the scheduler will log `Starting {} scheduler
jobs` (line ~754), then it will go quiet.
Meanwhile, a child scheduler process can be observed sleeping repeatedly

{code}
# ps jx
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     5     94      1      1 ?            -1 Rl       0 4409:23 /usr/bin/python /usr/local/bin/airflow
scheduler
    94   6776      1      1 ?            -1 S        0   0:00 /usr/bin/python /usr/local/bin/airflow
scheduler <<< this one
    7021   7030   7030   7021 ?          7030 R+       0   0:00 ps jx
{code}

These processes pop up occasionally, but are typically short-lived. Killing the child process
will cause the scheduler to spring back to life, declaring 

{code}
{jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
{jobs.py:744} INFO - Loop took: {} seconds
{code}

I wonder whether this issue is related to the very [issue|https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.o6nhckkbb]
for which num_runs is recommended as a workaround?


was (Author: cognalog):
+1

I have observed this as well. At some point the scheduler will log `Starting {} scheduler
jobs` (line ~754), then it will go quiet.
Meanwhile, a child scheduler process can be observed sleeping repeatedly

{code}
# ps jx
  PPID    PID   PGID    SID TTY       TPGID STAT   UID   TIME COMMAND
     5     94      1      1 ?            -1 Rl       0 4409:23 /usr/bin/python /usr/local/bin/airflow
scheduler
    94   6776      1      1 ?            -1 S        0   0:00 /usr/bin/python /usr/local/bin/airflow
scheduler <<< this one
    7021   7030   7030   7021 ?          7030 R+       0   0:00 ps jx
{code}

These processes pop up occasionally, but are typically short-lived. Killing the child process
will cause the scheduler to spring back to life, declaring 

{code}
{jobs.py:741} INFO - Done queuing tasks, calling the executor's heartbeat
{jobs.py:744} INFO - Loop took: {} seconds
{code}

I wonder whether this issue is related to the very [issue](https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.o6nhckkbb)
for which num_runs is recommended as a workaround?

> SchedulerJob gets locked up when when child processes attempt to log to single file
> -----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-366
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-366
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>            Reporter: Greg Neiheisel
>
> After running the scheduler for a while (usually after 1 - 5 hours) it will eventually
lock up, and nothing will get scheduled.
> A `SchedulerJob` will end up getting stuck in the `while` loop around line 730 of `airflow/jobs.py`.
> From what I can tell this is related to logging from within a forked process using pythons
multiprocessing module.
> The job will fork off some child processes to process the DAGs but one (or more) will
end up getting suck and not terminating, resulting in the while loop getting hung up.  You
can `kill -9 PID` the child process manually, and the loop will end and the scheduler will
go on it's way, until it happens again.
> The issue is due to usage of the logging module from within the child processes.  From
what I can tell, logging to a file from multiple processes is not supported by the multiprocessing
module, but it is supported using python multithreading, using some sort of locking mechanism.
> I think a child process will somehow inherit a logger that is locked, right when it is
forked, resulting it the process completely locking up.
> I went in and commented out all the logging statements that could possibly be hit by
the child process (jobs.py, models.py), and was able to keep the scheduler alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message