airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Gao (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AIRFLOW-401) scheduler gets stuck without a trace
Date Wed, 09 Nov 2016 18:24:58 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651649#comment-15651649
] 

Kevin Gao edited comment on AIRFLOW-401 at 11/9/16 6:24 PM:
------------------------------------------------------------

We seem to be running into a similar issue on both versions 1.7.0 and 1.7.1.3 using the LocalExecutor.
I'm assuming this is just expected behavior, but it was definitely an unfortunate surprise.
My current thoughts are we should probably switch to the CeleryExecutor in order to make the
scheduler independent of the executors. In case it helps anyone, I detailed my investigation
notes below.

>From what I can tell, the scheduler may have run through its last iteration for the life
of the process, but is waiting for its child processes to complete (the local executors executing
the long-running tasks). As a result, no further tasks are able to be scheduled until the
long running task is completed.

Relevant configs:
{code:ini}
[core]
executor = LocalExecutor
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = False
max_active_runs_per_dag = 16

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
{code}

The scheduler is run using upstart for {{-n 5}} iterations.

Some symptoms:
- No logs being produced by scheduler
- The scheduler appears to be blocked on a long-running task
- 31 of the 32 airflow child processes are listed as defunct
- Killing the long-running tasks allows the scheduler to become "unstuck". The scheduler then
seems to finish its final iteration, and is then respawned by upstart.

Here is the output from {{pstree}}:
{code}
─airflow,4984 usr/local/bin/airflow scheduler -n 5
   ├─(airflow,4990)
   ├─(airflow,4991)
   ├─airflow,4992 usr/local/bin/airflow scheduler -n 5
   │   └─airflow,5086 /usr/local/bin/airflow run dag_name 2016-11-09T01:20:00 --local
-sd DAGS_FOLDER/dag_name.py
   │       └─airflow,5092 /usr/local/bin/airflow run dag_name dag_name 2016-11-09T01:20:00
--job_id 582112 --raw -sd DAGS_FOLDER/dag_name.py
   │           └─bash,5102 /tmp/airflowtmpOyW_H1/dag_nameRf_OMJ
   │               ├─sudo,5105 -u someuser node /path/to/some_script.js
   │               │   └─node,5107 /path/to/some_script.js
   │               │       ├─{node},5109
   │               │       ├─{node},5110
   │               │       ├─{node},5111
   │               │       ├─{node},5112
   │               │       ├─{node},5113
   │               │       ├─{node},5114
   │               │       ├─{node},5115
   │               │       └─{node},5116
   │               └─sudo,5106 -u someuser tee -a /var/log/some/log/file.log
   │                   └─tee,5108 -a /var/log/some/log/file.log
   ├─(airflow,4993)
   ├─(airflow,4994)
   ├─(airflow,4995)
   ├─(airflow,4996)
   ├─(airflow,4997)
   ├─(airflow,4998)
   ├─(airflow,4999)
   ├─(airflow,5000)
   ├─(airflow,5001)
   ├─(airflow,5002)
   ├─(airflow,5003)
   ├─(airflow,5004)
   ├─(airflow,5005)
   ├─(airflow,5006)
   ├─(airflow,5007)
   ├─(airflow,5008)
   ├─(airflow,5009)
   ├─(airflow,5010)
   ├─(airflow,5011)
   ├─(airflow,5012)
   ├─(airflow,5013)
   ├─(airflow,5014)
   ├─(airflow,5015)
   ├─(airflow,5016)
   ├─(airflow,5017)
   ├─(airflow,5018)
   ├─(airflow,5019)
   ├─(airflow,5020)
   ├─(airflow,5021)
   └─{airflow},5029
{code}

stracing process 4992 shows that it's waiting for the child process to terminate {{wait4(5086,}}.
stracing process 4984, the root process, shows it's also waiting for some state change, presumably
for the child process to change state: {{futex(0x7f7ac5efc000, FUTEX_WAIT, 0, NULL}}.

Here is some more complete strace output I had from a previous time when it was hung in this
state:
{code}
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7f9eb8000ce0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 0
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL

########################################################
# At this point I manually killed the long running task: sudo kill 25116 #
########################################################

futex(0x7f9ec913f000, FUTEX_WAKE, 1)    = 1
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0c", 4)                   = 4
read(6, "\200\2U!DAG0_NAME"..., 99) = 99
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0h", 4)                   = 4
read(6, "\200\2U&DAG1_NAME"..., 104) = 104
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0V", 4)                   = 4
read(6, "\200\2U\24DAG2_NAME"..., 86) = 86
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0f", 4)                   = 4
read(6, "\200\2U%DAG3_NAME"..., 102) = 102
select(7, [6], NULL, NULL, {0, 0})      = 0 (Timeout)
munmap(0x7f9ec9138000, 32)              = 0
close(9)                                = 0
munmap(0x7f9ec913a000, 32)              = 0
close(8)                                = 0
munmap(0x7f9ec9139000, 32)              = 0
gettimeofday({1478679655, 152452}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
gettimeofday({1478679655, 153818}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 40) = 40
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 41) = 41
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 43) = 43
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 90) = 90
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 420) = 420
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)              = 5
read(3, "redacted"..., 535) = 535
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 195) = 195
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 44) = 44
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 41) = 41
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 42) = 42
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
wait4(25019, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25019
wait4(25021, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25021
wait4(25030, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25030
wait4(25020, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25020
wait4(25015, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25015
wait4(25017, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25017
wait4(25006, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25006
wait4(25012, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25012
wait4(25033, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25033
wait4(25007, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25007
wait4(25008, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25008
wait4(25009, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25009
wait4(25031, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25031
wait4(25026, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25026
wait4(25013, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25013
wait4(25011, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25011
wait4(25010, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25010
wait4(25022, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25022
wait4(25018, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25018
wait4(25029, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25029
wait4(25025, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25025
wait4(25024, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25024
wait4(25027, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25027
wait4(25034, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25034
wait4(25028, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25028
wait4(25023, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25023
wait4(25014, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25014
wait4(25032, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25032
wait4(25035, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25035
wait4(25037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25037
wait4(25036, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25036
wait4(25016, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25016
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], SA_RESTORER,
0x7f9ec8da2cb0}, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], SA_RESTORER,
0x7f9ec8da2cb0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], SA_RESTORER,
0x7f9ec8da2cb0}, 8) = 0
exit_group(0)                           = ?
{code}


was (Author: sudowork@gmail.com):
We seem to be running into a similar issue on both versions 1.7.0 and 1.7.1.3. I'm wondering
if this behavior is expected, and we're just incorrectly using the airflow scheduler + LocalExecutor.
From what I can tell, the scheduler may have run through its last iteration for the life of
the process, but is waiting for its child processes to complete (the local executors executing
the long-running tasks). As a result, no further tasks are able to be scheduled until the
long running task is completed. My current thoughts are we should probably switch to the CeleryExecutor
in order to make the scheduler independent of the executors.

Relevant configs:
{code:ini}
[core]
executor = LocalExecutor
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = False
max_active_runs_per_dag = 16

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
{code}

The scheduler is run using upstart for {{-n 5}} iterations.

Some symptoms:
- No logs being produced by scheduler
- The scheduler appears to be blocked on a long-running task
- 31 of the 32 airflow child processes are listed as defunct
- Killing the long-running tasks allows the scheduler to become "unstuck". The scheduler then
seems to finish its final iteration, and is then respawned by upstart.

Here is the output from {{pstree}}:
{code}
─airflow,4984 usr/local/bin/airflow scheduler -n 5
   ├─(airflow,4990)
   ├─(airflow,4991)
   ├─airflow,4992 usr/local/bin/airflow scheduler -n 5
   │   └─airflow,5086 /usr/local/bin/airflow run dag_name 2016-11-09T01:20:00 --local
-sd DAGS_FOLDER/dag_name.py
   │       └─airflow,5092 /usr/local/bin/airflow run dag_name dag_name 2016-11-09T01:20:00
--job_id 582112 --raw -sd DAGS_FOLDER/dag_name.py
   │           └─bash,5102 /tmp/airflowtmpOyW_H1/dag_nameRf_OMJ
   │               ├─sudo,5105 -u someuser node /path/to/some_script.js
   │               │   └─node,5107 /path/to/some_script.js
   │               │       ├─{node},5109
   │               │       ├─{node},5110
   │               │       ├─{node},5111
   │               │       ├─{node},5112
   │               │       ├─{node},5113
   │               │       ├─{node},5114
   │               │       ├─{node},5115
   │               │       └─{node},5116
   │               └─sudo,5106 -u someuser tee -a /var/log/some/log/file.log
   │                   └─tee,5108 -a /var/log/some/log/file.log
   ├─(airflow,4993)
   ├─(airflow,4994)
   ├─(airflow,4995)
   ├─(airflow,4996)
   ├─(airflow,4997)
   ├─(airflow,4998)
   ├─(airflow,4999)
   ├─(airflow,5000)
   ├─(airflow,5001)
   ├─(airflow,5002)
   ├─(airflow,5003)
   ├─(airflow,5004)
   ├─(airflow,5005)
   ├─(airflow,5006)
   ├─(airflow,5007)
   ├─(airflow,5008)
   ├─(airflow,5009)
   ├─(airflow,5010)
   ├─(airflow,5011)
   ├─(airflow,5012)
   ├─(airflow,5013)
   ├─(airflow,5014)
   ├─(airflow,5015)
   ├─(airflow,5016)
   ├─(airflow,5017)
   ├─(airflow,5018)
   ├─(airflow,5019)
   ├─(airflow,5020)
   ├─(airflow,5021)
   └─{airflow},5029
{code}

stracing process 4992 shows that it's waiting for the child process to terminate {{wait4(5086,}}.
stracing process 4984, the root process, shows it's also waiting for some state change, presumably
for the child process to change state: {{futex(0x7f7ac5efc000, FUTEX_WAIT, 0, NULL}}.

Here is some more complete strace output I had from a previous time when it was hung in this
state:
{code}
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7f9eb8000ce0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec9141000, FUTEX_WAKE, 1)    = 0
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x3bc5840, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
futex(0x7f9ec913e000, FUTEX_WAIT, 0, NULL

########################################################
# At this point I manually killed the long running task: sudo kill 25116 #
########################################################

futex(0x7f9ec913f000, FUTEX_WAKE, 1)    = 1
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0c", 4)                   = 4
read(6, "\200\2U!DAG0_NAME"..., 99) = 99
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0h", 4)                   = 4
read(6, "\200\2U&DAG1_NAME"..., 104) = 104
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0V", 4)                   = 4
read(6, "\200\2U\24DAG2_NAME"..., 86) = 86
select(7, [6], NULL, NULL, {0, 0})      = 1 (in [6], left {0, 0})
read(6, "\0\0\0f", 4)                   = 4
read(6, "\200\2U%DAG3_NAME"..., 102) = 102
select(7, [6], NULL, NULL, {0, 0})      = 0 (Timeout)
munmap(0x7f9ec9138000, 32)              = 0
close(9)                                = 0
munmap(0x7f9ec913a000, 32)              = 0
close(8)                                = 0
munmap(0x7f9ec9139000, 32)              = 0
gettimeofday({1478679655, 152452}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
gettimeofday({1478679655, 153818}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 40) = 40
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 41) = 41
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 43) = 43
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 90) = 90
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 420) = 420
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)              = 5
read(3, "redacted"..., 535) = 535
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 195) = 195
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 44) = 44
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
write(3, "redacted"..., 41) = 41
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
read(3, "redacted", 5)                = 5
read(3, "redacted"..., 42) = 42
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7f9eb8000d00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17aff40, FUTEX_WAKE_PRIVATE, 1) = 1
wait4(25019, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25019
wait4(25021, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25021
wait4(25030, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25030
wait4(25020, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25020
wait4(25015, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25015
wait4(25017, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25017
wait4(25006, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25006
wait4(25012, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25012
wait4(25033, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25033
wait4(25007, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25007
wait4(25008, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25008
wait4(25009, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25009
wait4(25031, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25031
wait4(25026, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25026
wait4(25013, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25013
wait4(25011, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25011
wait4(25010, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25010
wait4(25022, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25022
wait4(25018, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25018
wait4(25029, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25029
wait4(25025, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25025
wait4(25024, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25024
wait4(25027, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25027
wait4(25034, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25034
wait4(25028, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25028
wait4(25023, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25023
wait4(25014, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25014
wait4(25032, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25032
wait4(25035, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25035
wait4(25037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25037
wait4(25036, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25036
wait4(25016, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 25016
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], SA_RESTORER,
0x7f9ec8da2cb0}, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], SA_RESTORER,
0x7f9ec8da2cb0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7f9ec8da2cb0}, {0x5570f0, [], SA_RESTORER,
0x7f9ec8da2cb0}, 8) = 0
exit_group(0)                           = ?
{code}

> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck.png, scheduler_stuck_7hours.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of
scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks
it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service.
But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors
but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message