airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Davydov <dan.davy...@airbnb.com.INVALID>
Subject Re: Make Scheduler More Centralized
Date Sat, 18 Mar 2017 00:27:32 GMT
I'm not convinced that this would add *that* much more load, we could
probably change this functionality now if we wanted to. Just my two cents.

On Thu, Mar 16, 2017 at 4:06 PM, Rui Wang <rui.wang@airbnb.com> wrote:

> Thanks all your comments!
>
> Then looks like we should focus on scalability of scheduler now rather
> than adding more load on it. I will give up this centralized idea now.
>
> On Tue, Mar 14, 2017 at 3:08 PM, Rui Wang <rui.wang@airbnb.com> wrote:
>
>> Hi,
>> The design doc below I created is trying to make airflow scheduler more
>> centralized. Briefly speaking, I propose moving state change of
>> TaskInstance to scheduler. You can see the reasons for this change below.
>>
>>
>> Could you take a look and comment if you see anything does not make sense?
>>
>> -Rui
>>
>> ------------------------------------------------------------
>> --------------------------------------
>> Current The state of TaskInstance is changed by both scheduler and
>> worker. On worker side, worker monitors TaskInstance and changes the state
>> to RUNNING, SUCCESS, if task succeed, or to UP_FOR_RETRY, FAILED if task
>> fail. Worker also does failure email logic and failure callback logic.
>> Proposal The general idea is to make a centralized scheduler and make
>> workers dumb. Worker should not change state of TaskInstance, but just
>> executes what it is assigned and reports the result of the task. Instead,
>> the scheduler should make the decision on TaskInstance state change.
>> Ideally, workers should not even handle the failure emails and callbacks
>> unless the scheduler asks it to do so.
>> Why Worker does not have as much information as scheduler has. There
>> were bugs observed caused by worker when worker gets into trouble but
>> cannot make decision to change task state due to lack of information.
>> Although there is airflow metadata DB, it is still not easy to share all
>> information that scheduler has with workers.
>>
>> We can also ensure a consistent environment. There are slight differences
>> in the chef recipes for the different workers which can cause strange
>> issues when DAGs parse on one but not the other.
>>
>> In the meantime, moving state changes to the scheduler can reduce the
>> complexity of airflow. It especially helps when airflow needs to move to
>> distributed schedulers. In that case state change everywhere by both
>> schedulers and workers are harder to maintain.
>> How to change After lots of discussions, following step will be done:
>>
>> 1. Add a new column to TaskInstance table. Worker will fill this column
>> with the task process exit code.
>>
>> 2. Worker will only set TaskInstance state to RUNNING when it is ready to
>> run task. There was debate on moving RUNNING to scheduler as well. If
>> moving RUNNING to scheduler, either scheduler marks TaskInstance RUNNING
>> before it gets into queue, or scheduler checks the status code in column
>> above, which is updated by worker when worker is ready to run task. In
>> Former case, from user's perspective, it is bad to mark TaskInstance as
>> RUNNING when worker is not ready to run. User could be confused. In the
>> latter case, scheduler could mark task as RUNNING late due to schedule
>> interval. It is still not a good user experience. Since only worker knows
>> when is ready to run task, worker should still deliver this message to user
>> by setting RUNNING state.
>>
>> 3. In any other cases, worker should not change state of TaskInstance,
>> but save defined status code into column above.
>>
>> 4. Worker still handles failure emails and callbacks because there were
>> concern that scheduler could use too much resource to run failure callbacks
>> given unpredictable callback sizes. ( I think ideally scheduler should
>> treat failure callbacks and emails as tasks, and assign such tasks to
>> workers after TaskInstance state changes correspondingly). Eventually this
>> logic will be moved to the workers once there is support for multiple
>> distributed schedulers.
>>
>> 5. In scheduler's loop, scheduler should check TaskInstance status code,
>> then change state and retry/fail TaskInstance correspondingly.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message