airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Otten (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AIRFLOW-401) scheduler gets stuck without a trace
Date Mon, 17 Jul 2017 13:28:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089818#comment-16089818
] 

Rick Otten edited comment on AIRFLOW-401 at 7/17/17 1:27 PM:
-------------------------------------------------------------

According to the logs, ever since we set it up, it restarts itself regularly.  Before we had
a lot of tasks, it was more often than since we added a lot of tasks.  This is what the logs
look like during a routine restart (which cleans up all of the dead children):


{code:plain text}
Jul 15 12:54:38 myserver airflow[4467]: [2017-07-15 12:54:38,253] {jobs.py:1313} INFO - Exited
execute loop
Jul 15 12:54:43 myserver systemd[1]: airflow-scheduler.service: Service hold-off time over,
scheduling restart.
Jul 15 12:54:43 myserver systemd[1]: Stopped Airflow scheduler daemon.
Jul 15 12:54:43 myserver systemd[1]: Started Airflow scheduler daemon.
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,364] {__init__.py:57} INFO -
Using executor LocalExecutor
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,450] {driver.py:120} INFO -
Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,476] {driver.py:120} INFO -
Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
Jul 15 12:54:44 myserver airflow[20289]:   ____________       _____________
Jul 15 12:54:44 myserver airflow[20289]:  ____    |__( )_________  __/__  /________      __
Jul 15 12:54:44 myserver airflow[20289]: ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| /
/
Jul 15 12:54:44 myserver airflow[20289]: ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
Jul 15 12:54:44 myserver airflow[20289]:  _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
Jul 15 12:54:44 myserver airflow[20289]:  
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,777] {jobs.py:1264} INFO - Starting
the scheduler
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1280} INFO - Processing
files using up to 4 processes at a time
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1282} INFO - Running
execute loop for -1 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1284} INFO - Processing
each file at most 100 times
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1286} INFO - Process
each file at most once every 120 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1288} INFO - Checking
for new files in /home/airflow/airflow/dags every 300 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1291} INFO - Searching
for files in /home/airflow/airflow/dags
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,779] {jobs.py:1294} INFO - There
are 20 files in /home/airflow/airflow/dags
Jul 15 12:54:44 myserver airflow[20289]: /home/airflow/airflow/lib/python3.5/site-packages/airflow/www/app.py:23:
FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to "CSRFProtect" and
will be removed in 1.0.
Jul 15 12:54:44 myserver airflow[20289]:   csrf = CsrfProtect()
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,994] {jobs.py:1356} INFO - Resetting
state for orphaned tasks
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,008] {jobs.py:1366} INFO - Resetting
backup_clarivoypg 2017-07-14 06:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,018] {jobs.py:1366} INFO - Resetting
daily_web_response_processor 2017-07-14 09:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,023] {jobs.py:1366} INFO - Resetting
ping_pgbouncers 2017-07-15 09:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,027] {jobs.py:1407} INFO - Heartbeating
the process manager
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,032] {dag_processing.py:627}
INFO - Started a process (PID: 20462) to generate tasks for /home/airflow/airflow/dags/my_daily.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_daily.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,036] {dag_processing.py:627}
INFO - Started a process (PID: 20463) to generate tasks for /home/airflow/airflow/dags/my_hourly.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_hourly.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,042] {dag_processing.py:627}
INFO - Started a process (PID: 20464) to generate tasks for /home/airflow/airflow/dags/my_odd_job.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_odd_job.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,047] {dag_processing.py:627}
INFO - Started a process (PID: 20465) to generate tasks for /home/airflow/airflow/dags/my_other_odd_job.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_other_odd_job.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,049] {jobs.py:1443} INFO - Heartbeating
the executor
{code}



was (Author: rotten):
According to the logs, ever since we set it up, it restarts itself regularly.  Before we had
a lot of tasks, it was more often than since we added a lot of tasks.  This is what the logs
look like during a routine restart (which cleans up all of the dead children):


{code:plain text}
Jul 15 12:54:38 myserver airflow[4467]: [2017-07-15 12:54:38,253] {jobs.py:1313} INFO - Exited
execute loop
Jul 15 12:54:43 myserver systemd[1]: airflow-scheduler.service: Service hold-off time over,
scheduling restart.
Jul 15 12:54:43 myserver systemd[1]: Stopped Airflow scheduler daemon.
Jul 15 12:54:43 myserver systemd[1]: Started Airflow scheduler daemon.
Jul 15 12:54:43 myserver dd.collector[1540]: WARNING (disk.py:106): Unable to get disk metrics
for /sys/kernel/debug/tracing: [Errno 13] Permission denied: '/sys/kernel/debug/tracing'
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,364] {__init__.py:57} INFO -
Using executor LocalExecutor
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,450] {driver.py:120} INFO -
Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,476] {driver.py:120} INFO -
Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
Jul 15 12:54:44 myserver airflow[20289]:   ____________       _____________
Jul 15 12:54:44 myserver airflow[20289]:  ____    |__( )_________  __/__  /________      __
Jul 15 12:54:44 myserver airflow[20289]: ____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| /
/
Jul 15 12:54:44 myserver airflow[20289]: ___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
Jul 15 12:54:44 myserver airflow[20289]:  _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
Jul 15 12:54:44 myserver airflow[20289]:  
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,777] {jobs.py:1264} INFO - Starting
the scheduler
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1280} INFO - Processing
files using up to 4 processes at a time
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1282} INFO - Running
execute loop for -1 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1284} INFO - Processing
each file at most 100 times
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1286} INFO - Process
each file at most once every 120 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1288} INFO - Checking
for new files in /home/airflow/airflow/dags every 300 seconds
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1291} INFO - Searching
for files in /home/airflow/airflow/dags
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,779] {jobs.py:1294} INFO - There
are 20 files in /home/airflow/airflow/dags
Jul 15 12:54:44 myserver airflow[20289]: /home/airflow/airflow/lib/python3.5/site-packages/airflow/www/app.py:23:
FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to "CSRFProtect" and
will be removed in 1.0.
Jul 15 12:54:44 myserver airflow[20289]:   csrf = CsrfProtect()
Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,994] {jobs.py:1356} INFO - Resetting
state for orphaned tasks
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,008] {jobs.py:1366} INFO - Resetting
backup_clarivoypg 2017-07-14 06:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,018] {jobs.py:1366} INFO - Resetting
daily_web_response_processor 2017-07-14 09:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,023] {jobs.py:1366} INFO - Resetting
ping_pgbouncers 2017-07-15 09:00:00
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,027] {jobs.py:1407} INFO - Heartbeating
the process manager
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,032] {dag_processing.py:627}
INFO - Started a process (PID: 20462) to generate tasks for /home/airflow/airflow/dags/my_daily.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_daily.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,036] {dag_processing.py:627}
INFO - Started a process (PID: 20463) to generate tasks for /home/airflow/airflow/dags/my_hourly.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_hourly.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,042] {dag_processing.py:627}
INFO - Started a process (PID: 20464) to generate tasks for /home/airflow/airflow/dags/my_odd_job.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_odd_job.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,047] {dag_processing.py:627}
INFO - Started a process (PID: 20465) to generate tasks for /home/airflow/airflow/dags/my_other_odd_job.py
- logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_other_odd_job.py.log
Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,049] {jobs.py:1443} INFO - Heartbeating
the executor
{code}


> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>              Labels: celery, kombu
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck_7hours.png,
scheduler_stuck.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of
scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks
it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service.
But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors
but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message