airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [airflow] dstandish commented on issue #6210: [AIRFLOW-5567] BaseReschedulePokeOperator
Date Mon, 25 Nov 2019 07:39:41 GMT
dstandish commented on issue #6210: [AIRFLOW-5567] BaseReschedulePokeOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-558030520
 
 
   > Having a state table will have a fundamental impact on the idempotency of the execution
of the tasks.
   
   It's optional to use such a thing.  Just like it is with XCom.  If you don't use it, nothing
is changed.
   
   > Why would the manual triggering of a dag introduce issues, the execution date will
be equal to the moment that it was triggered. I think it should work as well.
   
   Because execution_date is run date minus one interval.  So, suppose I want to persist state
with XCom (which I do in many jobs), and I have a daily job, running at midnight.  At end
of each run, we push some value to XCom.  At start of next job, we retrieve last value and
use it somehow. Consider this case:
   * run 1: 12am D1
   * run 2: manually triggered at 8am (exec date is D1 8am; xcom retrieves from run 1)
   * run 3: 12am D2
   * run 4: 12am D3
   * run 5: 12am D3
   
   Outcome:
   * Run 3 will retrive the XCom from run 1, because its execution date is prior to run 2
execution date.
   * Run 4 retrieves run 2 for same reason.
   * Run 5 retrieves run 4 (finally things are back in order); run 3 xcom is never retrieved
by any job.
   
   The schedule interval edge PR would resolve the execution date ordering problem.  But if
XCom is cleared at start of task, it is remains problematic as a mechanism for state persistence.
   
   > Since this will introduce such as a fundamental change to the way operators were intended,
being idempotent, I think it would be great to first start an AIP on the topic, so we can
have a clear and structured approach.
   
   An AIP sounds reasonable.  I am just a bit skeptical of the notion that this is some radical
change; I would be shocked if stateful processes were not already an extremely common use
pattern.  Here the goal would be to provide better support for them out of the box.  
   
   Airflow provides great support for a particular kind of "idempotent" task, but surely it
doesn't say this is the only way we can use it! 
   
   Anyway, I have occasionally rambled on dev list about these issues.  I am not sure what
the best solution is.  I wish there could be clearer and more generalized separation between
the concepts of "run date" and "interval of interest", but I am not sure what that should
look like.  But having a simple way to persist state would be of great immediate help to me,
and to this PR, incidentally.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message