airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Seelmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-2747) Explicit re-schedule of sensors
Date Thu, 12 Jul 2018 20:51:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542191#comment-16542191
] 

Stefan Seelmann commented on AIRFLOW-2747:
------------------------------------------

[~pedromachado] Thanks for the feedback.

I added the content of task_fail and task_instance table above, I hope things get clearer.

Regarding the colors:
 * The black bars are executions that requested a reschedule (i.e. the sensor raised an AirflowRescheduleException).
The start_date and end_date are the actual dates the sensor task run, the reschedule_date
is the date it requested to be rescheduled. I borrowed the layout of the task_reschedule table
from task_fail table and added the two additional columns.
 * The red bars are failures (which then triggered a retry), those are recorded in task_fail
table and already today (in master and 1.10) shown like this in the gantt view.

Regarding start_date before reschedule_date: I cannot see that problem, the start_date of
the next row (with the same sensor task_id) is always after the previous reschedule_date.
Note that the table contains rows of two sensors s2 and s3.

The way it is visualized (in the gantt view) can be changed, for example there can just be
a one bar from first start_date to last end_date, in light green while still in unfinished
state, dark green or red when successful or failed. I personally like the multiple bars to
see what happened when.

> Explicit re-schedule of sensors
> -------------------------------
>
>                 Key: AIRFLOW-2747
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2747
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: core, operators
>    Affects Versions: 1.9.0
>            Reporter: Stefan Seelmann
>            Assignee: Stefan Seelmann
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: Screenshot_2018-07-12_14-10-24.png
>
>
> By default sensors block a worker and just sleep between pokes. This is very inefficient,
especially when there are many long-running sensors.
> There is a hacky workaroud by setting a small timeout value and a high retry number.
But that has drawbacks:
>  * Errors raised by sensors are hidden and the sensor retries too often
>  * The sensor is retried in a fixed time interval (with optional exponential backoff)
>  * There are many attempts and many log files are generated
>  I'd like to propose an explicit reschedule mechanism:
>  * A new "reschedule" flag for sensors, if set to True it will raise an AirflowRescheduleException
that causes a reschedule.
>  * AirflowRescheduleException contains the (earliest) re-schedule date.
>  * Reschedule requests are recorded in new `task_reschedule` table and visualized in
the Gantt view.
>  * A new TI dependency that checks if a sensor task is ready to be re-scheduled.
> Advantages:
>  * This change is backward compatible. Existing sensors behave like before. But it's
possible to set the "reschedule" flag.
>  * The poke_interval, timeout, and soft_fail parameters are still respected and used
to calculate the next schedule time.
>  * Custom sensor implementations can even define the next sensible schedule date by raising
AirflowRescheduleException themselves.
>  * Existing TimeSensor and TimeDeltaSensor can also be changed to be rescheduled when
the time is reached.
>  * This mechanism can also be used by non-sensor operators (but then the new ReadyToRescheduleDep
has to be added to deps or BaseOperator).
> Design decisions and caveats:
>  * When handling AirflowRescheduleException the `try_number` is decremented. That means
that subsequent runs use the same try number and write to the same log file.
>  * Sensor TI dependency check now depends on `task_reschedule` table. However only the
BaseSensorOperator includes the new ReadyToRescheduleDep for now.
> Open questions and TODOs:
>  * Should a dedicated state `UP_FOR_RESCHEDULE` be used instead of setting the state
back to `NONE`? This would require more changes in scheduler code and especially in the UI,
but the state of a task would be more explicit and more transparent to the user.
>  * Add example/test for a non-sensor operator
>  * Document the new feature



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message