airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian De Ruiter <julianderui...@godatadriven.com>
Subject Re: [2.0 spring cleaning] Deprecate contrib folder?
Date Mon, 15 Apr 2019 07:15:13 GMT
The downside of keeping everything in one gigantic codebase is that it also becomes a monster
in terms of dependencies and testing - something that Airflow is already experiencing issues
with. This is exactly also why I initially had troubles contributing to Airflow, as tests
are near impossible to run (and the CI also takes ages) + any hooks I tried to refactor had
crappy sensor/operator classes in contrib that relied on internal workings of the hook. (The
HdfsHook + Sensor classes are a clear example of this).

My experience with contrib is terrible, as the last time I used something from contrib it
broke on the very first example I tried. (Not the first time this happened, mind you.) Things
that do work are often poorly designed in terms of the implementation and/or API, of which
I would argue the HDFS sensor classes are a good example.

Based on these experiences, I tried developing a set of operators/hooks that we developed
for a client as a separate package and released this work recently on PyPI: https://github.com/jrderuiter/airflow-fs.
From the development experience, it was much easier to develop something like this separately
than trying to get this integrated into Airflow (just look at the relevant PRs to see the
discussions that resulted from trying to get it into Airflow core).

With maybe a bit more thought on how to organise work such as this within the larger community,
I think releasing operators/hooks separately in coherent, functionally-oriented packages would
have many benefits, including:

- easier development
- more flexibility to use different approaches/interaces for the task at hand
- the opportunity to release new operators etc separate from the main Airflow release cycle

Best,
Julian

On 14 Apr 2019, at 21:48, Austin Bennett <whatwouldaustindo@gmail.com<mailto:whatwouldaustindo@gmail.com>>
wrote:

It seems naming of contrib is used in practice with what I thought it would
be (as a novice), but incubator not bad either -- yes, it gets messy if
anything can get contributed.  If using for that purpose, maybe there is
some sort of timeframe for graduation or it winds up being removed?

Indeed, contrib is a place with hooks/operators/etc, that might work well
enough, but not wholly 'graduated'.  I have wondered what to do about the
promotion as was already mentioned, in terms of changing imports of dags.
But I guess that get's fixed pretty easily when imports break (and can have
contrib point to the graduated ones in certain places).

Things being part of the airflow repo is nice and shows that the community
overall supports.  Very much in favor of that approach, rather than
wondering whether some other pypi or other code is sufficient.  If I am
using Airflow, I then trust that community and what winds up in that
codebase ( and just one dev list to follow to stay up to date ).


On Sun, Apr 14, 2019 at 12:21 PM Driesprong, Fokko <fokko@driesprong.frl<mailto:fokko@driesprong.frl>>
wrote:

I'm in favor of removing the Contrib folder. It doesn't really add value in
my opinion, and moving the hooks/operators will break the import. While
DAG'ing I always have to look up if the operator is in contrib or not.

Also, I think we should keep the operators and hooks part of the Airflow
package. Having this separately will make the testing of the
hooks/operators much more complicated. That being said, I do think we need
to have more people on the project that "own" certain operators. Maybe keep
a list of the authors as well (or reintroduce the Mention-bot
https://github.com/facebookarchive/mention-bot ? Loved that bot).

Cheers, Fokko


Op za 13 apr. 2019 om 12:36 schreef Jarek Potiuk <Jarek.Potiuk@polidea.com<mailto:Jarek.Potiuk@polidea.com>
:

I think there are quite a few contrib parts that are at least on-par with
regards to code quality,  testing and especially documentation.

And yes, among those are GCP operators we developed which are not only
unit
but also system-tested and we put quite an effort into making
documentation
really useful and well structured ;).

I'd rather move those "graduated" operators/hooks to core and maybe
rename
the "contrib" folder to "incubator" or something like that to indicate
that
those operators are not yet "core-quality" but aspire to become one. That
would make a nice "intro" task for new contributors - to improve one of
the
incubating operators to become "core-ready".

I am quite sceptical myself about AIP-8 and separating out the hooks and
operators. There were already few discussions about that, but splitting
the
operators out might be quite difficult and it will only be possible if
there is some way to quickly test compatibility of those split operators
with various versions of Airflow and set of dependent requirements..
Otherwise it will very quickly become a mess - nobody will know which
version of Airflow is needed to run which operators and there will be
problems if someone will try to run different operators with different
requirements in the same DAG (and different versions of airflow).

Until we manage to isolate operators within the same DAG to potentially
use
different dependencies, this is straight road to dependency-hell.

One solution to that that I have in mind for some time (but this is very
long term) might be to make Airflow Docker-native and run every operator
within it's own separate Docker instance with it's own dependencies. That
would be quite possible to do (we would need to split operators into very
light "proxy" (basically current  __init__() - the part that is executed
within DAG scanning) and heavy "execute" parts (where the operator's
execute-related methods would be run in separate Docker on workers).

J.

On Sat, Apr 13, 2019 at 11:59 AM Felix Uellendall <
felix.uellendall@gmx.de<mailto:felix.uellendall@gmx.de>

wrote:

+1 on deprecating the contrib folder.

Bolk de Bruin the reason the core hooks and operators are properly
tested because, for example I added some more tests to it and I am
"only" a contributor.

So do you really want to split up contributors work and core committers
work? I personally think this is not the right way to go.

It is true that the contrib hooks and operators have not the same level
of code quality but we can do something about it. I am trying to
improve
our test coverage overall and add missing tests.

I don't think an extra step is needed here where we first move properly
tested ones into the core package and then moving new ones from time to
time into it. Wouldn't that mean that we think the code quality of
"contrib" (contributor) in general is worse than the code quality of
committers? Every new contributor who comes along this project would
think that, wouldn't he?

-feluelle

Am 13/04/2019 um 07:51 schrieb Beau Barker:
A separate airflow-contrib repo, on a separate release cadence would
be
my preference.


On 12 Apr 2019, at 11:17 pm, Julian De Ruiter <
julianderuiter@godatadriven.com> wrote:

Isn’t this in contradiction with AIP-8, which is aimed at removing
operators/hooks from the core Airflow package?

Personally I would rather remove hooks/operators from Airflow than
add
even more to the Airflow core. This counts double for the contrib
stuff,
which is often poorly designed and/or tested.

Best,
Julian

On 12 Apr 2019, at 10:23, Bolke de Bruin <bdbruin@gmail.com>
wrote:

That’s perfectly fine to me.

Verstuurd vanaf mijn iPad

Op 12 apr. 2019 om 10:20 heeft Kaxil Naik <kaxilnaik@gmail.com>
het
volgende geschreven:

Ok. How about moving the properly tested and maintained hooks/ops
from
contrib to core?

On Fri, Apr 12, 2019, 09:13 Bolke de Bruin <bdbruin@gmail.com>
wrote:

I disagree. Core signals “properly tested” and maintained. Ie. A
kind of
quality.  I don’t think contrib has that.

Verstuurd vanaf mijn iPad

Op 12 apr. 2019 om 10:03 heeft Kaxil Naik <kaxilnaik@gmail.com>
het
volgende geschreven:
Contrib folder was used when it was used at Airbnb. Currently,
it
doesn't
make any sense and we have equal responsibility to maintain all
the
hooks,
operators, sensors in contrib folder as we do for core.

I would suggest to remove contrib folder and move all hooks,
ops,
and
sensors to the core folder.

Or reorganize the folder structure similar to what was discussed
in
a
mailing thread few months ago.

Regards,
Kaxil



--

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
E: jarek.potiuk@polidea.com



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message