airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Otten (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-401) scheduler gets stuck without a trace
Date Mon, 17 Jul 2017 13:11:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089802#comment-16089802
] 

Rick Otten commented on AIRFLOW-401:
------------------------------------

The scheduler restarts itself routinely.  I have no idea why unless it is to clear these stale
child processes.  At the moment, even though I never run more than 10 tasks at a time, I've
bumped parallelism up to 128.  I'm still consuming all 128 while the database backup task
is running, but it takes a while to use them all up.  It seems like a poor work-around.  The
real issue is that whatever is handling the "child exit" codes is not catching that child
processes are done.


> scheduler gets stuck without a trace
> ------------------------------------
>
>                 Key: AIRFLOW-401
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-401
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, scheduler
>    Affects Versions: Airflow 1.7.1.3
>            Reporter: Nadeem Ahmed Nazeer
>            Assignee: Bolke de Bruin
>            Priority: Minor
>              Labels: celery, kombu
>         Attachments: Dag_code.txt, schduler_cpu100%.png, scheduler_stuck_7hours.png,
scheduler_stuck.png
>
>
> The scheduler gets stuck without a trace or error. When this happens, the CPU usage of
scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks
it goes into some kind of infinite loop. 
> The only way I could make it run again is by manually restarting the scheduler service.
But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors
but same issue occurs. I am using the -n 3 parameter while starting scheduler. 
> Scheduler configs,
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> executor = LocalExecutor
> parallelism = 32
> Please help. I would be happy to provide any other information needed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message