airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From airflowuser <airflowu...@protonmail.com.INVALID>
Subject Re: AIP-8 Split Hooks/Operators into Separate Packages
Date Thu, 10 Jan 2019 07:00:36 GMT
@Max  I don't see how this is doable.
Consider S3ToGoogleCloudStorageOperator
It users both S3Hook and GoogleCloudStorageHook.

With your suggestion we have to maintain S3Hook in each separated package per operator/sensor.
Which means for example  if new parameter is added to any of the hooks you have to add it
in dozes of places (+tests).
This is very inconvenient.


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin <maximebeauchemin@gmail.com>
wrote:

> If there's a strict policy of having a single hook and a single operator
> per package, then the hook package would be the only place where the
> external dependency is defined, and the operator packages would depend on
> hook package(s). That would follow the "micro package" philosophy and could
> work pretty well. Every hook and operator can have its own set of
> maintainers, test/CI and release cadence.
>
> There can even be packages composing common operators as in
> "airflow-hadoop-operators", or "airflow-common-databases-operators", and
> for backward compatibility, During a transition phase, Airflow could depend
> on a new "airflow-backward-compatibility-operators", though ultimately we
> should encourage people to come up with the right packages/operators for
> their environment instead.
>
> Max
>
> On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall felix.uellendall@gmx.de
> wrote:
>
> > Regardless of how complex this implementation would be I am +1 on this.
> > From the developer's point of view that the CI would run so much faster
> > is the biggest plus for me. I think It will only become worse the more
> > dependencies we add.
> > From the user's point of view that I am able to choose from multiple
> > packages/repositories only the ones I really want to use and I know that
> > any contributor of this repo probably can answer my questions related to
> > this hooks/operators is the biggest plus for me.
> > I know there would be a lot to do and this gives me headaches even now
> > but in the end I think it would be a great change that is necessary in
> > the long run.
> >
> > -   feluelle
> >
> > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk:
> >
> > > While splitting the monolithical Airfllow architecture to pieces sounds
> > > good, there is one problem that might be difficult to tackle (or rather
> > > impossible unless we change architecture of Airflow significantly) -
> > > namely
> > > dependencies/requirements.
> > > The way Airflow uses operators is that its operators are already closely
> > > coupled with Airflow core. Airflow has to parse all the operators within
> > > the same python interpreter/virtual machine as the core Airflow. This
> > > means
> > > potentially big problem with dependency/requirement handling if we have
> > > multiple packages. There are enough common/shared dependencies that
> > > various
> > > operators use even now to cause occasional headaches even now. We already
> > > have quite a challenge with handling dependencies of Airflow and its
> > > operators/hooks when they are part of Airflow repo.
> > > Currently the problem is that Ariflow sometimes uses outdated
> > > dependencies
> > > or that some random transient dependencies break Airflow installation.
> > > But
> > > at we at least have a common dependency list that we work against for all
> > > operators. Unfortunately if we split, then the problem will be worse -
> > > very
> > > quickly some contrib operators will require different dependencies and
> > > will
> > > not be compatible with Airflow or break Airflow's behaviour.
> > > Not mentioning the problem when you want to use hooks from some other
> > > "area" in your operator. Currently Hooks are the way how you can speed up
> > > development of cross-are behaviour. You implement hooks in some "area"
> > > and
> > > other "areas" are free or even encouraged to use them. For example
> > > exporting from BigQuery to all cloud storages in principle should depend
> > > on
> > > Hooks for every single Cloud Storage package out there (Google, Azure,
> > > AWS). This is even worse than the MySqLToHive case described earlier -
> > > very
> > > quickly we would end up with totally unmanageable mesh of
> > > cross-dependencies.
> > > I think to really make Operators independent from Airflow core, we would
> > > need to allow the dependencies to be fully isolated - i.e. to allow
> > > operators to have different set of dependencies than the core. That's
> > > quite
> > > impossible with the current Airflow approach where the same operator code
> > > is parsed in the core. And the same code is used during execute in
> > > worker.
> > > And the same code might be used by another operator in form of hook.
> > > Unfortunately we are not in the npm
> > > https://npm.github.io/how-npm-works-docs/index.html world (as Kamil
> > > Breguła pointed to me today) where the module loaded handles multiple
> > > versions of the same library in the same process.
> > > One other questions that bothers me - I believe (please correct me If I
> > > am
> > > wrong) some of the operators are using some core features of Airflow and
> > > are even more tied with the Core. For example it is perfectly fine for
> > > the
> > > operator to use SQL Alchemy ORM classes of Airflow and run
> > > queries/perform
> > > updates in the metadata database of Airflow, I believe - as far as I know
> > > there is a requirement (I saw this somewhere at least) that Celery or
> > > Kubernetes workers need to be able to open a direct database connection
> > > to
> > > the metadata database of Airflow and there is nothing to prevent the
> > > operators to do it. This in essence means that the operator has to depend
> > > on many core dependencies/requirements including sqlalchemy,
> > > postgres/mysql
> > > ....). This can be changed and "forbidden" to use Airflow's core features
> > > but it might break compatibility (If I am right about it).
> > > We could imagine a different approach - where operator is split to a
> > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter with
> > > Core's dependencies, and "Executor" within the worker case. Then each
> > > task
> > > could run in its own Docker image/Pod on Kubernetes with its own
> > > dependencies. But that looks like big, backwards-incompatible change and
> > > it
> > > still does not solve cross-dependencies between different "areas". For
> > > handling cross-area operations we would somehow implement communication
> > > between different containers - each having own dependencies. That would
> > > be
> > > possible in Kubernetes by having single POD with several containers
> > > sharing
> > > the common data and communicating. Seems possible.
> > > It's quite an entertaining idea, but it sounds like Airflow 3.0 already
> > > and
> > > one that is not really backwards compatible ;).
> > > J.
> > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast swast@google.com.invalid
> > > wrote:
> > >
> > > > > I don't see it solving any problem than test speed (which is a big
one,
> > > > > yes) but doesn't reduce the amount of workload on the committers.
> > > >
> > > > It's about distributed ownership. For example, I'm not a committer in
> > > > pandas, but I am the primary maintainer of pandas-gbq. You're right
> > > > that if
> > >
> > > > the set of committers is the same for all 24 repos, there isn't all that
> > > > much benefit beyond testing speed.
> > > >
> > > > > Each sub-project would still have to follow the normal Apache voting
> > > > > process.
> > > >
> > > > Presumably the set of people that care about the sub-packages will be
> > > > smaller. I don't know enough about the Apache voting process to know how
> > > > that might affect it.
> > > > Maybe many of the sub-packages can live outside the Apache org? Pandas
> > > > keeps the I/O sub-packages in a different org, for example.
> > > >
> > > > > Google could choose to release a airflow-gcp-operators package now
and
> > > > > tell people to |from gcp.airflow.operators import SomeNewOperator|.
> > > >
> > > > That's actually part of my motivation for this proposal. I've got some
> > > > red
> > >
> > > > tape to get through, but ideally the proposed airflow-google repository
> > > > in
> > >
> > > > AIP-8 would actually live in the GoogleCloudPlatform org.
> > > > Maybe I should decrease the scope of AIP-8 to Google hooks/operators?
> > > >
> > > > > There is nothing stopping someone /currently/ creating their own
> > > > > operators package.
> > > >
> > > > Hooks still need some support in core, so that connections can be
> > > > configured. Also, the fact that so many operators live in the Airflow
> > > > makes
> > >
> > > > it seem like an operator is less supported / a hack if it doesn't live
> > > > there.
> > > >
> > > > > How will we ensure that core changes don't break any hooks/operators?
> > > > > Pandas does this by running tests in the I/O repos against pandas
master
> > > > > branch in addition to against supported releases.
> > > >
> > > > > How do we support the logging backends for s3/azure/gcp?
> > > > > I don't see any reason we can't keep doing what we're already doing.
> >
> > https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45
> >
> > > > We'd need to adjust the import path for the hook, but so long as the
> > > > upload
> > >
> > > > / download method remains stable, it'll work the same. The sub-package
> > > > will
> > >
> > > > need to ensure it tests the logging code path in addition to testing
> > > > DAGs
> > >
> > > > that use the relevant operators.
> > > >
> > > > -   • *Tim Swast
> > > > -   • *Software Friendliness Engineer
> > > > -   • *Google Cloud Developer Relations
> > > > -   • *Seattle, WA, USA
> > > >
> > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor ash@apache.org
> > > > wrote:
> > >
> > > > > Can someone explain to me how having multiple packages will work
in
> > > > > practice?
> > > > > How will we ensure that core changes don't break any hooks/operators?
> > > > > How do we support the logging backends for s3/azure/gcp?
> > > > > What would the release process be for the "sub"-packages?
> > > > > There is nothing stopping someone /currently/ creating their own
> > > > > operators package. There is nothing what-so-ever special about the
> > > > > |airflow.operators| package namespace, and for example Google could
> > > > > choose to release a airflow-gcp-operators package now and tell people
> > > > > to
> > >
> > > > > |from gcp.airflow.operators import SomeNewOperator|.
> > > > > My view on this currently is -1 as I don't see it solving any problem
> > > > > than test speed (which is a big one, yes) but doesn't reduce the
amount
> > > > > of workload on the committers - rather it increases it by having
a more
> > > > > complex release process (each sub-project would still have to follow
> > > > > the
> > >
> > > > > normal Apache voting process) and having 24 repos to check for PRs
> > > > > rather than just 1.
> > > > > Am I missing something?
> > > > > ("Core" vs "contrib" made sense when Airflow was still under Airbnb,
we
> > > > > should probably just move everything from contrib out to core pre
> > > > > 2.0.0)
> > >
> > > > > -ash
> > > > > airflowuser wrote on 08/01/2019 15:44:
> > > > >
> > > > > > I think the operator should be placed by the source.
> > > > > > If it's MySQLToHiveOperator then it would be placed in MySQL
package.
> > > > > > The BIG question here is if this serve actual improvement like
faster
> > > > > > deployment of hook/operators bug-fix to Airflow users (faster
than
> > > > > > actual
> > >
> > > > > Airflow release) or this is mere cosmetic issue.
> > > > >
> > > > > > I assume that this also covers the unnecessary separation of
core and
> > > > > > contrib.
> > > > > > Sent with ProtonMail Secure Email.
> > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin <
> > > > > > maximebeauchemin@gmail.com> wrote:
> > > > > >
> > > > > > > Something to think about is how data transfer operators
like the
> > > > > > > MysqlToHiveOperator usually rely on 2 hooks. With a package-specific
> > > > > > > approach that may mean something like an `airflow-hive`,
> > > > > > > `airflow-mysql`
> > > > >
> > > > > > > and `airflow-mysql-hive` packages, where the `airflow-mysql-hive`
> > > > > > > package
> > > > > >
> > > > > > > depends on the two other packages.
> > > > > > > It's just a matter of having a clear strategy, good naming
> > > > > > > conventions
> > >
> > > > > and
> > > > >
> > > > > > > a nice central place in the docs that centralize a list
of approved
> > > > > > > packages.
> > > > > > > Max
> > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast swast@google.com.invalid
> > > > > > > wrote:
> > > > > >
> > > > > > > > I've created AIP-8:
> >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
> >
> > > > > > > > To follow-up from the discussion about splitting hooks/operators
out
> > > > > > > > of the
> > > > > >
> > > > > > > > core Airflow package at
> > > > > > > > http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/<
> > > > > > > > 308670DB-BD2A-4738-81B1-3F6FB312C0C8@apache.org>
> > > > > >
> > > > > > > > I propose packaging based on the target system, informed
by the
> > > > > > > > existing
> > > > > >
> > > > > > > > hooks in both core and contrib. This will allow those
with the
> > > > > > > > relevant
> > > > >
> > > > > > > > expertise in each target system to respond to contributions
/ issues
> > > > > > > > without having to follow the flood of everything Airflow-related.
It
> > > > > > > > will
> > > > > >
> > > > > > > > also decrease the surface area of the core package,
helping with
> > > > > > > > testability and long-term maintenance.
> > > > > > > >
> > > > > > > > -   • *Tim Swast
> > > > > > > > -   • *Software Friendliness Engineer
> > > > > > > > -   • *Google Cloud Developer Relations
> > > > > > > > -   • *Seattle, WA, USA



Mime
View raw message