airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Swast <sw...@google.com.INVALID>
Subject Re: AIP-8 Split Hooks/Operators into Separate Packages
Date Tue, 15 Jan 2019 18:36:54 GMT
I like this proposal to create an
"airflow-backward-compatibility-operators" package, which can be released
independently of "airflow", and then migrate to "micro-packages" after
that. I've updated AIP-8
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303>
to describe detail around creating an
"airflow-backward-compatibility-operators" package.

I'll leave the question of who will own the micro-packages for the
long-term desired state to a later AIP. The gist of that is an owner needs
to be identified for each hook/operator and then a similar migration +
deprecation will need to be done to each new micropackage.

*  •  **Tim Swast*
*  •  *Software Friendliness Engineer
*  •  *Google Cloud Developer Relations
*  •  *Seattle, WA, USA


On Thu, Jan 10, 2019 at 8:59 AM Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> A related though is around the fact that all these libs depend on Airflow
> itself to get the base class they are deriving (BaseHook and BaseOperator
> mostly). It's a bit upside down when the small library depends on a big
> library. That may be ok as is, but pushing the micro-package logic would
> dictate to break down `airflow-core`, which in turns calls for breaking
> down `airflow-scheduler` and `airflow-web`.
>
> All this is super disruptive work (breaks all open PRs), but to me is a
> much better place to be once we get there. Apache-wise that means more
> [though smaller] releases and more coordination. With more packages, we
> have to be good at defining the dependencies supported version ranges. My
> incline would be to have a single repo with multiple `setup.py` within it,
> so that PRs can touch multiple packages.
>
> We don't have to do all of this or all of this at once either, but the
> community should agree on a plan.
>
> Refactoring the hooks and operators out, a set at a time, seems like a
> really good start.
>
> Max
>
> On Thu, Jan 10, 2019 at 8:44 AM Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > That's not what I meant. If I apply what I meant to your example we'd
> have
> > a single package for each hook `airflow-hook-s3` and `airflow-hook-gcs`,
> > and a package for `airflow-operator-s3-to-gcs`. The operator package
> would
> > depend on both hook packages.
> >
> > There's no code or test duplication there. If fancy mocking solutions are
> > defined for tests, they should be exposed in the hook packages and can be
> > reused in the operator packages.
> >
> > Max
> >
> > On Wed, Jan 9, 2019 at 11:00 PM airflowuser
> > <airflowuser@protonmail.com.invalid> wrote:
> >
> >> @Max  I don't see how this is doable.
> >> Consider S3ToGoogleCloudStorageOperator
> >> It users both S3Hook and GoogleCloudStorageHook.
> >>
> >> With your suggestion we have to maintain S3Hook in each separated
> package
> >> per operator/sensor. Which means for example  if new parameter is added
> to
> >> any of the hooks you have to add it in dozes of places (+tests).
> >> This is very inconvenient.
> >>
> >>
> >> Sent with ProtonMail Secure Email.
> >>
> >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >> On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin <
> >> maximebeauchemin@gmail.com> wrote:
> >>
> >> > If there's a strict policy of having a single hook and a single
> operator
> >> > per package, then the hook package would be the only place where the
> >> > external dependency is defined, and the operator packages would depend
> >> on
> >> > hook package(s). That would follow the "micro package" philosophy and
> >> could
> >> > work pretty well. Every hook and operator can have its own set of
> >> > maintainers, test/CI and release cadence.
> >> >
> >> > There can even be packages composing common operators as in
> >> > "airflow-hadoop-operators", or "airflow-common-databases-operators",
> and
> >> > for backward compatibility, During a transition phase, Airflow could
> >> depend
> >> > on a new "airflow-backward-compatibility-operators", though ultimately
> >> we
> >> > should encourage people to come up with the right packages/operators
> for
> >> > their environment instead.
> >> >
> >> > Max
> >> >
> >> > On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall
> >> felix.uellendall@gmx.de
> >> > wrote:
> >> >
> >> > > Regardless of how complex this implementation would be I am +1 on
> >> this.
> >> > > From the developer's point of view that the CI would run so much
> >> faster
> >> > > is the biggest plus for me. I think It will only become worse the
> more
> >> > > dependencies we add.
> >> > > From the user's point of view that I am able to choose from multiple
> >> > > packages/repositories only the ones I really want to use and I know
> >> that
> >> > > any contributor of this repo probably can answer my questions
> related
> >> to
> >> > > this hooks/operators is the biggest plus for me.
> >> > > I know there would be a lot to do and this gives me headaches even
> now
> >> > > but in the end I think it would be a great change that is necessary
> in
> >> > > the long run.
> >> > >
> >> > > -   feluelle
> >> > >
> >> > > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk:
> >> > >
> >> > > > While splitting the monolithical Airfllow architecture to pieces
> >> sounds
> >> > > > good, there is one problem that might be difficult to tackle
(or
> >> rather
> >> > > > impossible unless we change architecture of Airflow
> significantly) -
> >> > > > namely
> >> > > > dependencies/requirements.
> >> > > > The way Airflow uses operators is that its operators are already
> >> closely
> >> > > > coupled with Airflow core. Airflow has to parse all the operators
> >> within
> >> > > > the same python interpreter/virtual machine as the core Airflow.
> >> This
> >> > > > means
> >> > > > potentially big problem with dependency/requirement handling
if we
> >> have
> >> > > > multiple packages. There are enough common/shared dependencies
> that
> >> > > > various
> >> > > > operators use even now to cause occasional headaches even now.
We
> >> already
> >> > > > have quite a challenge with handling dependencies of Airflow
and
> its
> >> > > > operators/hooks when they are part of Airflow repo.
> >> > > > Currently the problem is that Ariflow sometimes uses outdated
> >> > > > dependencies
> >> > > > or that some random transient dependencies break Airflow
> >> installation.
> >> > > > But
> >> > > > at we at least have a common dependency list that we work against
> >> for all
> >> > > > operators. Unfortunately if we split, then the problem will be
> >> worse -
> >> > > > very
> >> > > > quickly some contrib operators will require different dependencies
> >> and
> >> > > > will
> >> > > > not be compatible with Airflow or break Airflow's behaviour.
> >> > > > Not mentioning the problem when you want to use hooks from some
> >> other
> >> > > > "area" in your operator. Currently Hooks are the way how you
can
> >> speed up
> >> > > > development of cross-are behaviour. You implement hooks in some
> >> "area"
> >> > > > and
> >> > > > other "areas" are free or even encouraged to use them. For example
> >> > > > exporting from BigQuery to all cloud storages in principle should
> >> depend
> >> > > > on
> >> > > > Hooks for every single Cloud Storage package out there (Google,
> >> Azure,
> >> > > > AWS). This is even worse than the MySqLToHive case described
> >> earlier -
> >> > > > very
> >> > > > quickly we would end up with totally unmanageable mesh of
> >> > > > cross-dependencies.
> >> > > > I think to really make Operators independent from Airflow core,
we
> >> would
> >> > > > need to allow the dependencies to be fully isolated - i.e. to
> allow
> >> > > > operators to have different set of dependencies than the core.
> >> That's
> >> > > > quite
> >> > > > impossible with the current Airflow approach where the same
> >> operator code
> >> > > > is parsed in the core. And the same code is used during execute
in
> >> > > > worker.
> >> > > > And the same code might be used by another operator in form of
> hook.
> >> > > > Unfortunately we are not in the npm
> >> > > > https://npm.github.io/how-npm-works-docs/index.html world (as
> Kamil
> >> > > > Breguła pointed to me today) where the module loaded handles
> >> multiple
> >> > > > versions of the same library in the same process.
> >> > > > One other questions that bothers me - I believe (please correct
me
> >> If I
> >> > > > am
> >> > > > wrong) some of the operators are using some core features of
> >> Airflow and
> >> > > > are even more tied with the Core. For example it is perfectly
fine
> >> for
> >> > > > the
> >> > > > operator to use SQL Alchemy ORM classes of Airflow and run
> >> > > > queries/perform
> >> > > > updates in the metadata database of Airflow, I believe - as far
as
> >> I know
> >> > > > there is a requirement (I saw this somewhere at least) that Celery
> >> or
> >> > > > Kubernetes workers need to be able to open a direct database
> >> connection
> >> > > > to
> >> > > > the metadata database of Airflow and there is nothing to prevent
> the
> >> > > > operators to do it. This in essence means that the operator has
to
> >> depend
> >> > > > on many core dependencies/requirements including sqlalchemy,
> >> > > > postgres/mysql
> >> > > > ....). This can be changed and "forbidden" to use Airflow's core
> >> features
> >> > > > but it might break compatibility (If I am right about it).
> >> > > > We could imagine a different approach - where operator is split
> to a
> >> > > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter
> >> with
> >> > > > Core's dependencies, and "Executor" within the worker case. Then
> >> each
> >> > > > task
> >> > > > could run in its own Docker image/Pod on Kubernetes with its
own
> >> > > > dependencies. But that looks like big, backwards-incompatible
> >> change and
> >> > > > it
> >> > > > still does not solve cross-dependencies between different "areas".
> >> For
> >> > > > handling cross-area operations we would somehow implement
> >> communication
> >> > > > between different containers - each having own dependencies.
That
> >> would
> >> > > > be
> >> > > > possible in Kubernetes by having single POD with several
> containers
> >> > > > sharing
> >> > > > the common data and communicating. Seems possible.
> >> > > > It's quite an entertaining idea, but it sounds like Airflow 3.0
> >> already
> >> > > > and
> >> > > > one that is not really backwards compatible ;).
> >> > > > J.
> >> > > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast swast@google.com.invalid
> >> > > > wrote:
> >> > > >
> >> > > > > > I don't see it solving any problem than test speed
(which is a
> >> big one,
> >> > > > > > yes) but doesn't reduce the amount of workload on the
> >> committers.
> >> > > > >
> >> > > > > It's about distributed ownership. For example, I'm not a
> >> committer in
> >> > > > > pandas, but I am the primary maintainer of pandas-gbq. You're
> >> right
> >> > > > > that if
> >> > > >
> >> > > > > the set of committers is the same for all 24 repos, there
isn't
> >> all that
> >> > > > > much benefit beyond testing speed.
> >> > > > >
> >> > > > > > Each sub-project would still have to follow the normal
Apache
> >> voting
> >> > > > > > process.
> >> > > > >
> >> > > > > Presumably the set of people that care about the sub-packages
> >> will be
> >> > > > > smaller. I don't know enough about the Apache voting process
to
> >> know how
> >> > > > > that might affect it.
> >> > > > > Maybe many of the sub-packages can live outside the Apache
org?
> >> Pandas
> >> > > > > keeps the I/O sub-packages in a different org, for example.
> >> > > > >
> >> > > > > > Google could choose to release a airflow-gcp-operators
package
> >> now and
> >> > > > > > tell people to |from gcp.airflow.operators import
> >> SomeNewOperator|.
> >> > > > >
> >> > > > > That's actually part of my motivation for this proposal.
I've
> got
> >> some
> >> > > > > red
> >> > > >
> >> > > > > tape to get through, but ideally the proposed airflow-google
> >> repository
> >> > > > > in
> >> > > >
> >> > > > > AIP-8 would actually live in the GoogleCloudPlatform org.
> >> > > > > Maybe I should decrease the scope of AIP-8 to Google
> >> hooks/operators?
> >> > > > >
> >> > > > > > There is nothing stopping someone /currently/ creating
their
> own
> >> > > > > > operators package.
> >> > > > >
> >> > > > > Hooks still need some support in core, so that connections
can
> be
> >> > > > > configured. Also, the fact that so many operators live in
the
> >> Airflow
> >> > > > > makes
> >> > > >
> >> > > > > it seem like an operator is less supported / a hack if it
> doesn't
> >> live
> >> > > > > there.
> >> > > > >
> >> > > > > > How will we ensure that core changes don't break any
> >> hooks/operators?
> >> > > > > > Pandas does this by running tests in the I/O repos
against
> >> pandas master
> >> > > > > > branch in addition to against supported releases.
> >> > > > >
> >> > > > > > How do we support the logging backends for s3/azure/gcp?
> >> > > > > > I don't see any reason we can't keep doing what we're
already
> >> doing.
> >> > >
> >> > >
> >>
> https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45
> >> > >
> >> > > > > We'd need to adjust the import path for the hook, but so
long as
> >> the
> >> > > > > upload
> >> > > >
> >> > > > > / download method remains stable, it'll work the same. The
> >> sub-package
> >> > > > > will
> >> > > >
> >> > > > > need to ensure it tests the logging code path in addition
to
> >> testing
> >> > > > > DAGs
> >> > > >
> >> > > > > that use the relevant operators.
> >> > > > >
> >> > > > > -   • *Tim Swast
> >> > > > > -   • *Software Friendliness Engineer
> >> > > > > -   • *Google Cloud Developer Relations
> >> > > > > -   • *Seattle, WA, USA
> >> > > > >
> >> > > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor ash@apache.org
> >> > > > > wrote:
> >> > > >
> >> > > > > > Can someone explain to me how having multiple packages
will
> >> work in
> >> > > > > > practice?
> >> > > > > > How will we ensure that core changes don't break any
> >> hooks/operators?
> >> > > > > > How do we support the logging backends for s3/azure/gcp?
> >> > > > > > What would the release process be for the "sub"-packages?
> >> > > > > > There is nothing stopping someone /currently/ creating
their
> own
> >> > > > > > operators package. There is nothing what-so-ever special
about
> >> the
> >> > > > > > |airflow.operators| package namespace, and for example
Google
> >> could
> >> > > > > > choose to release a airflow-gcp-operators package now
and tell
> >> people
> >> > > > > > to
> >> > > >
> >> > > > > > |from gcp.airflow.operators import SomeNewOperator|.
> >> > > > > > My view on this currently is -1 as I don't see it solving
any
> >> problem
> >> > > > > > than test speed (which is a big one, yes) but doesn't
reduce
> >> the amount
> >> > > > > > of workload on the committers - rather it increases
it by
> >> having a more
> >> > > > > > complex release process (each sub-project would still
have to
> >> follow
> >> > > > > > the
> >> > > >
> >> > > > > > normal Apache voting process) and having 24 repos to
check for
> >> PRs
> >> > > > > > rather than just 1.
> >> > > > > > Am I missing something?
> >> > > > > > ("Core" vs "contrib" made sense when Airflow was still
under
> >> Airbnb, we
> >> > > > > > should probably just move everything from contrib out
to core
> >> pre
> >> > > > > > 2.0.0)
> >> > > >
> >> > > > > > -ash
> >> > > > > > airflowuser wrote on 08/01/2019 15:44:
> >> > > > > >
> >> > > > > > > I think the operator should be placed by the source.
> >> > > > > > > If it's MySQLToHiveOperator then it would be placed
in MySQL
> >> package.
> >> > > > > > > The BIG question here is if this serve actual
improvement
> >> like faster
> >> > > > > > > deployment of hook/operators bug-fix to Airflow
users
> (faster
> >> than
> >> > > > > > > actual
> >> > > >
> >> > > > > > Airflow release) or this is mere cosmetic issue.
> >> > > > > >
> >> > > > > > > I assume that this also covers the unnecessary
separation of
> >> core and
> >> > > > > > > contrib.
> >> > > > > > > Sent with ProtonMail Secure Email.
> >> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >> > > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin
<
> >> > > > > > > maximebeauchemin@gmail.com> wrote:
> >> > > > > > >
> >> > > > > > > > Something to think about is how data transfer
operators
> >> like the
> >> > > > > > > > MysqlToHiveOperator usually rely on 2 hooks.
With a
> >> package-specific
> >> > > > > > > > approach that may mean something like an
`airflow-hive`,
> >> > > > > > > > `airflow-mysql`
> >> > > > > >
> >> > > > > > > > and `airflow-mysql-hive` packages, where
the
> >> `airflow-mysql-hive`
> >> > > > > > > > package
> >> > > > > > >
> >> > > > > > > > depends on the two other packages.
> >> > > > > > > > It's just a matter of having a clear strategy,
good naming
> >> > > > > > > > conventions
> >> > > >
> >> > > > > > and
> >> > > > > >
> >> > > > > > > > a nice central place in the docs that centralize
a list of
> >> approved
> >> > > > > > > > packages.
> >> > > > > > > > Max
> >> > > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast
> >> swast@google.com.invalid
> >> > > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > > I've created AIP-8:
> >> > >
> >> > >
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
> >> > >
> >> > > > > > > > > To follow-up from the discussion about
splitting
> >> hooks/operators out
> >> > > > > > > > > of the
> >> > > > > > >
> >> > > > > > > > > core Airflow package at
> >> > > > > > > > >
> >> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/<
> >> > > > > > > > > 308670DB-BD2A-4738-81B1-3F6FB312C0C8@apache.org>
> >> > > > > > >
> >> > > > > > > > > I propose packaging based on the target
system, informed
> >> by the
> >> > > > > > > > > existing
> >> > > > > > >
> >> > > > > > > > > hooks in both core and contrib. This
will allow those
> >> with the
> >> > > > > > > > > relevant
> >> > > > > >
> >> > > > > > > > > expertise in each target system to respond
to
> >> contributions / issues
> >> > > > > > > > > without having to follow the flood of
everything
> >> Airflow-related. It
> >> > > > > > > > > will
> >> > > > > > >
> >> > > > > > > > > also decrease the surface area of the
core package,
> >> helping with
> >> > > > > > > > > testability and long-term maintenance.
> >> > > > > > > > >
> >> > > > > > > > > -   • *Tim Swast
> >> > > > > > > > > -   • *Software Friendliness Engineer
> >> > > > > > > > > -   • *Google Cloud Developer Relations
> >> > > > > > > > > -   • *Seattle, WA, USA
> >>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message