airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah Lowin <jlo...@apache.org>
Subject Contrib & Dataflow
Date Sat, 04 Feb 2017 18:45:46 GMT
Max made some great points on my dataflow PR and I wanted to continue the
conversation here to make sure the conversation was visible to all.

While I think my dataflow implementation contains the basic requirements
for any more complicated extension (but that conversation can wait!), I had
to implement it by adding some very specific "dataflow-only" code to core
Operator logic. In retrospect, that makes me pause (as, I believe, it did
for Max).

After thinking for a few days, what I really want to do is propose a very
small change to core Airflow: change BaseOperator.post_execute(context) to
BaseOperator.post_execute(result, context). I think the pre_execute and
post_execute hooks have generally been an afterthought, but with that
change (which, I think, is reasonable in and of itself) I could implement
entirely through those hooks.

So that brings me to my next point: if the hook is changed, I could happily
drop a reworked dataflow implementation into contrib, rather than core.
That would alleviate some of the pressure for Airflow to officially decide
whether it's the right implementation or not (it is! :) ). I feel like that
would be the optimal situation at the moment.

And that brings me to my next point: the future of "contrib" and the
Airflow community.
Having contrib in the core Airflow repo has some advantages:
  - standardized access
  - centralized repository for PRs
  - at least a style review (if not unit tests) from the committers
But some big disadvantages as well:
  - Very complicated dependency management [presumably, most contrib
operators need to add an extras_require entry for their specific
dependencies]
  - No sense of ownership or even an easy way to raise issues (due to
friction of opening JIRA tickets vs github issues)

One thought is to move the contrib directory to its own repo which would
keep the advantages but remove the disadvantages from core Airflow. Another
is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
Airflow-YourExtensionHere) which could be installed a la carte. That would
leave maintenance up to the original author, but could lead to some
fracturing in the community as discovery becomes difficult.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message