airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: Tasks Queued but never run
Date Fri, 09 Jun 2017 09:23:05 GMT
I have made PR https://github.com/apache/incubator-airflow/pull/2356 <https://github.com/apache/incubator-airflow/pull/2356>
for this. The issue went a little bit deeper than I expected. 

In the backfills we can loose tasks to execute due to a task
setting its own state to NONE if concurrency limits are reached,
this makes them fall outside of the scope the backfill is
managing hence they will not be executed.

Several bugs are the cause for this. Firstly, the state
reported by the executor was always reported as success, ie.
the return code of the task instance was not propagated.
Next to that, if the executor already has a task instance
in its queue it will silently ignore the task instance
being added. The backfills did not guard against this, thus
tasks could get lost here as well.

This patch introduces CONCURRENCY_REACHED as an executor
state, which will be set if the task exits with EBUSY (16).
This allows the backfill to properly handle these tasks
and reschedule them. Please note that the CeleryExecutor
does not report back on executor states.


Please test the patch and report back if it doesn/does not solve the issue.

Bolke.

> On 8 Jun 2017, at 04:23, Russell Pierce <russell.s.pierce@gmail.com> wrote:
> 
> I hadn't thought of it that way. Given that SubDAGs are scheduled as
> backfills, then they'd inherit the same problem. So, the issue I had is
> version specific. Thanks for pointing that out Bolke. Do you know the
> relevant JIRA Issue off hand?
> 
> On Wed, Jun 7, 2017, 4:28 PM Bolke de Bruin <bdbruin@gmail.com> wrote:
> 
>> It is 1.8.x specific in this case (for backfills).
>> 
>> Sent from my iPhone
>> 
>>> On 7 Jun 2017, at 21:35, Russell Pierce <russell.s.pierce@gmail.com>
>> wrote:
>>> 
>>> Probably more of a configuration constellation issue than version
>> specific
>>> or even an 'issue' per se. As noted, on restart the scheduler reschedules
>>> everything. I had a heavy SubDAG that when rescheduled could produce many
>>> extra tasks and a small fixed number of Celery workers. So, the scheduled
>>> tasks wouldn't be done by the time of the scheduler restart and then the
>>> scheduler would reschedule the SubDAG... debugging hilarity followed from
>>> there.
>>> 
>>>> On Wed, Jun 7, 2017, 10:57 AM Jason Chen <chingchien.chen@gmail.com>
>> wrote:
>>>> 
>>>> I am using Airflow 1.7.1.3 with CeleryExecutor, but not run into this
>>>> issue.
>>>> I am wondering if this issue is only for 1.8.x ?
>>>> 
>>>> On Wed, Jun 7, 2017 at 8:34 AM, Russell Pierce <
>> russell.s.pierce@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Depending on how fast you can clear down your queue, -n can be harmful
>>>> and
>>>>> really stack up your celery queue. Keep an eye on your queue depth of
>> you
>>>>> see a ton of messages about the task already having been run.
>>>>> 
>>>>> On Mon, Jun 5, 2017, 9:18 AM Josef Samanek <josef.samanek@kiwi.com>
>>>> wrote:
>>>>> 
>>>>>> Hey. Thanks for the answer. I previously also tried to run scheduler
>> -n
>>>>>> 10, but it was back when I was still using LocalExecutor. And it
did
>>>> not
>>>>>> help. I have not yet tried to do it with CeleryExecutor, so I might.
>>>>>> 
>>>>>> Still, I would prefer to find an actual solution for the underlying
>>>>>> problem, not just a workaround (eventhough a working workaround is
>> also
>>>>>> appreciated).
>>>>>> 
>>>>>> Best regards,
>>>>>> Joe
>>>>>> 
>>>>>> On 2017-06-02 00:10 (+0200), Alex Guziel <alex.guziel@airbnb.com.
>>>>> INVALID>
>>>>>> wrote:
>>>>>>> We've noticed this with celery, relating to this
>>>>>>> https://github.com/celery/celery/issues/3765
>>>>>>> 
>>>>>>> We also use `-n 5` option on the scheduler so it restarts every
5
>>>> runs,
>>>>>>> which will reset all queued tasks.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Alex
>>>>>>> 
>>>>>>> On Thu, Jun 1, 2017 at 2:18 PM, Josef Samanek <
>>>> josef.samanek@gmail.com
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi!
>>>>>>>> 
>>>>>>>> We have a problem with our airflow. Sometimes, several tasks
get
>>>>> queued
>>>>>>>> but they never get run and remain in Queud state forever.
Other
>>>> tasks
>>>>>> from
>>>>>>>> the same schedule interval run. And next schedule interval
runs
>>>>>> normally
>>>>>>>> too. But these several tasks remain queued.
>>>>>>>> 
>>>>>>>> We are using Airflow 1.8.1. Currently with CeleryExecutor
and
>>>> redis,
>>>>>> but
>>>>>>>> we had the same problem with LocalExecutor as well (actually
>>>>> switching
>>>>>> to
>>>>>>>> Celery helped quite a bit, the problem now happens way less
often,
>>>>> but
>>>>>>>> still it happens). We have 18 DAGs total, 13 active. Some
have just
>>>>> 1-2
>>>>>>>> tasks, but some are more complex, like 8 tasks or so and
with
>>>>>> upstreams.
>>>>>>>> There are also ExternalTaskSensor tasks used.
>>>>>>>> 
>>>>>>>> I tried playing around with DAG configurations (limiting
>>>> concurrency,
>>>>>>>> max_active_runs, ...), tried switching off some DAGs completely
>>>> (not
>>>>>> all
>>>>>>>> but most) etc., so far nothing helped. Right now, I am not
really
>>>>> sure,
>>>>>>>> what else to try to identify a solve the issue.
>>>>>>>> 
>>>>>>>> I am getting a bit desperate, so I would really appreciate
any help
>>>>>> with
>>>>>>>> this. Thank you all in advance!
>>>>>>>> 
>>>>>>>> Joe
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message