airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From siddharth anand <>
Subject New Operator : LatestOnlyOperator
Date Wed, 28 Sep 2016 00:57:49 GMT
@gwax added the LatestOnlyOperator

This is a really nifty operator, so I wanted to let folks know about it. A
lot of people run cron for a mix of workloads. Some jobs map to traditional
ETL workloads (e.g. load hourly data summarization for 2016-10-01T00:00:00Z).
Some are simple cron tasks -- run a database backup every night. In the
latter case, if you miss 3 runs (e.g. your dag is paused or your start date
is a few days/weeks/months/ago), you don't want to make up for lost time
and backfill all of those days. Essentially, running N database backups at
once will take your database down... We'd prefer traditional cron behavior
in these cases, not ETL behavior.

*Enter the LatestOnlyOperator.*

Place this operator upstream of any tasks that you want to skip unless the
Dagrun is the latest. You can place a trigger rule downstream to "end" its
effect. By combining a Trigger Rule with this operator, you can ensure only
portions of your dag honor this "latest only" requirement. Or simply, have
an entire DAG run in "latest only" mode by using the LatestOnlyOperator
alone, i.e. not pairing it with a TriggerRule downstream.

This is a useful pattern that I have been coding around for some time by
using a ShortCircuitOperator with a python callable, where the callable
evaluates the "latest"-ness of the dag run. I suspect we have all been
re-inventing this wheel, which is where Airflow's Operators shine.

Thanks to @gwax for implementing this and sticking with a long and often
delayed review/merge process.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message