airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremiah Lowin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-31) Use standard imports for hooks/operators
Date Tue, 03 May 2016 02:23:13 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267954#comment-15267954
] 

Jeremiah Lowin commented on AIRFLOW-31:
---------------------------------------

Here's a more complete description of the behavior, taken from the PR. We had a larger discussion
about this at the time but it was immediately before the Apache move picked up and I think
it got lost in the shuffle (well, I certainly forgot about it, at least!).

------

This PR replaces Airflow's import machinery with standard Python imports. This has an impact
in four places:

airflow.hooks.*
airflow.operators.*
airflow.contrib.hooks.*
airflow.contrib.operators.*

Operators that could formerly be accessed from these directories must now be accessed explicitly
from submodules:
{code}
from airflow.operators import PigOperator # NO
from airflow.operators.pig_operator import PigOperator # YES
{code}

Old style imports will still work but will warn:
{code}
In [4]: airflow.operators.PigOperator
/Users/jlowin/anaconda3/bin/ipython:1: DeprecationWarning: PigOperator: Importing 
PigOperator directly from 'airflow.operators' has been deprecated. Please import 
from 'airflow.operators.[operator_module]' instead. Support for direct imports will 
be dropped entirely in Airflow 2.0.
  #!/bin/bash /Users/jlowin/anaconda3/bin/python.app

Out[4]: pig_operator.PigOperator
{code}
run_unit_tests.sh exports an environment variable AIRFLOW_USE_NEW_IMPORTS which turns off
the old-style (deprecated) imports completely. Therefore, unit tests must use the new style.
I've tried to adjust the tests already but there are some I can't test locally so I will see
what Travis has issues with and make adjustments.

Plugins are handled similarly. Let's say the user creates a plugin MyPlugin with some operators.
Currently those operators are available at airflow.operators.\*. Through this PR, that behavior
is maintained (through Airflow 2.0) but a new module is dynamically created with the operators
at airflow.operators.myplugin.\*

Lastly, I took advantage of the fact I was going into lots of files to add licenses to all
hooks and operators.

> Use standard imports for hooks/operators
> ----------------------------------------
>
>                 Key: AIRFLOW-31
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-31
>             Project: Apache Airflow
>          Issue Type: Improvement
>    Affects Versions: Airflow 2.0
>            Reporter: Jeremiah Lowin
>            Assignee: Jeremiah Lowin
>              Labels: enhancement
>
> (Migrated from https://github.com/airbnb/airflow/issues/1238)
> Currently, Airflow uses a relatively complex import mechanism to import hooks and operators
without polluting the namespace with submodules. I would like to propose that Airflow abandon
that system and use standard Python importing.
> Here are a few major reasons why I think the current system has run its course.
> h3. Polluting namespace
> The biggest advantage of the current system, as I understand it, is that only Operators
appear in the `airflow.operators` namespace.  The submodules that actually contain the operators
do not.
> So for example while `airflow.operators.python_operator.PythonOperator` is a thing, `PythonOperator`
is in the `airflow.operators` namespace but `python_operator` is not.
> I think this sort of namespace pollution was helpful when Airflow was a smaller project,
but as the number of hooks/operators grows -- and especially as the `contrib` hooks/operators
grow -- I'd argue that namespacing is a *good thing*. It provides structure and organization,
and opportunities for documentation (through module docstrings).
> In fact, I'd argue that the current namespace is itself getting quite polluted -- the
only way to know what's available is to use something like Ipython tab-completion to browse
an alphabetical list of Operator names, or to load the source file and grok the import definition
(which no one installing from pypi is likely to do).
> h3. Conditional imports
> There's a second advantage to the current system that any module that fails to import
is silently ignored. It makes it easy to have optional dependencies. For example, if someone
doesn't have `boto` installed, then they don't have an `S3Hook` either. Same for a HiveOperator
> Again, as Airflow grows and matures, I think this is a little too magic. If my environment
is missing a dependency, I want to hear about it.
> On the other hand, the `contrib` namespace sort of depends on this -- we don't want users
to have to install every single dependency. So I propose that contrib modules all live in
their submodules: `from airflow.contrib.operators.my_operator import MyOperator`. As mentioned
previously, having structure and namespacing is a good thing as the project gets more complex.
> Other ways to handle this include putting "non-standard" dependencies inside the operator/hook
rather than the module (see `HiveOperator`/`HiveHook`), so it can be imported but not used.
Another is judicious use of `try`/`except ImportError`. The simplest is to make people import
things explicitly from submodules.
> h3. Operator dependencies
> Right now, operators can't depend on each other if they aren't in the same file. This
is for the simple reason that there is no guarantee on what order the operators will be loaded.
It all comes down to which dictionary key gets loaded first. One day Operator B could be loaded
after Operator A; the next day it might be loaded before. Consequently, A and B can't depend
on each other. Worse, if a user makes two operators that do depend on each other, they won't
get an error message when one fails to import.
> For contrib modules in particular, this is sort of killer.
> h3. Ease of use
> It's *hard* to set up imports for a new operator. The dictionary-based import instructions
aren't obvious for new users, and errors are silently dismissed which makes debugging difficult.
> h3. Identity
> Surprisingly, `airflow.operators.SubDagOperator != airflow.operators.subdag_operator.SubDagOperator`.
See #1168.
> h2. Proposal
> Use standard python importing for hooks/operators/etc.
> - `__init__.py` files use straightforward, standard Python imports
> - major operators are available at `airflow.operators.OperatorName` or `airflow.operators.operator_module.OperatorName`.
> - contrib operators are only available at `airflow.contrib.operators.operator_module.OperatorName`
in order to manage dependencies
> - operator authors are encouraged to use `__all__` to define their module's exports
> Possibly delete namespace afterward
> - in `operators/__init__.py`, run a function at the end of the file which deletes all
modules from the namespace, leaving only `Operators`. This keeps the namespace clear but lets
people use familiar import mechanisms.
> Possibly use an import function to handle `ImportError` gracefully
> - rewrite `import_module_attrs` to take one module name at a time instead of a dictionary.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message