airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: separating DAGs and code and handling PYTHONPATH
Date Thu, 02 Jun 2016 21:24:37 GMT
A few related things:
* You can use the `template_searchpath` param of the DAG constructor to add
folders to the jinja searchpath for your DAG. Documented here:
http://pythonhosted.org/airflow/code.html?highlight=template_searchpath#airflow.models.DAG
* Airflow only adds DAGS_FOLDER to your `sys.path` beyond that you have to
manage your PYTHONPATH on your own. Note that in the current version
messing with `sys.path` affects the main thread, meaning that DAGs parsed
after this alteration have a different `sys.path` than the ones before,
which can create some serious, hard to debug problem. We're addressing this
issue in the next version where DAG parsing will be done in subprocesses

Max

On Thu, Jun 2, 2016 at 1:43 AM, Matthias Huschle <
matthias.huschle@paymill.de> wrote:

> Hi Dennis,
>
> the first error is thrown by jinja2.PackageLoader. I think you still have
> to use dot notation in the first argument, as the module itself is under
> the reports path:
>
> In:
>
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.
> py", line 212, in get_email_html
> Change:
> env = jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
> 'templates'))
> To:
> env = jinja2.Environment(loader=jinja2.PackageLoader('reports.gsn_kpi_
> daily_email', 'templates'))
>
> For the second error I don't see a cause. You should first check sys.path
> from within the script to see if etl/lib/ is properly added. It's strange
> that the first error is thrown during runtime of the same module that fails
> to import in the second error. Do you modify sys.path from within your
> scripts?
>
> If I understand your setup correctly, an __init__.py is only necessary in
> reports. I don't think it has any purpose in folders, that are directly in
> sys.path . However, the names "lib" and "db_connect" are quite generic. I'd
> consider renaming lib (sth. like etl_lib) and adding just etl/ to sys.path
> , and an __init__.py to the lib folder to avoid namespace pollution. You'd
> have to use "from etl_lib import db_connect" then, of course.
>
> Hope that helps,
> Matthias
>
>
> 2016-06-01 20:10 GMT+02:00 Dennis O'Brien <dennis@dennisobrien.net>:
>
> > Hi folks
> >
> > I'm looking for some advice here on how others separate their DAGs and
> the
> > code those DAGs call and any PYTHONPATH fixups that may be necessary.
> >
> > I have a project that looks like this
> >
> > .
> > ├── airflow
> > │ ├── dags
> > │ │ ├── reports
> > │ │ └── sql
> > │ └── deploy
> > │    └── templates
> > ├── etl
> > │ ├── lib
> >
> > All the DAGs are in airflow/dags
> > The sql used by SqlSensor tasks are in airflow/dags/sql
> > The python code used by PythonOperator is in airflow/dags/reports and
> > etl/lib
> > Existing etl code is all in etl
> >
> > In ./airflow/dags/etl_gsn_daily_kpi_email.py
> > ```
> > from reports.gsn_kpi_daily_email import send_daily_kpi_email
> > ```
> >
> > I thought I could just import code in airflow/dags/reports from
> > airflow/dags since DAGS_FOLDER is added to sys.path but after deploying
> the
> > code I saw an error in the web UI about failing to import the module
> > `reports.gsn_kpi_daily_email`.  So I added __init__.py files in dags and
> > dags/reports with no success.  Then I modified my upstart scripts to fix
> up
> > the PYTHONPATH.
> >
> > ```
> > env PYTHONPATH=$PYTHONPATH:{{ destination_dir }}/airflow/dags/:{{
> > destination_dir }}/etl/lib/
> > export PYTHONPATH
> > ```
> >
> > This fixed the error in the web UI but on the next run of the job, I got
> > these tracebacks:
> > ```
> > [2016-06-01 12:14:38,352] {models.py:1286} ERROR - No module named
> > gsn_kpi_daily_email
> > Traceback (most recent call last):
> >   File
> > "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
> > line 1245, in run
> >     result = task_copy.execute(context=context)
> >   File
> >
> >
> "/home/airflow/venv/lib/python2.7/site-packages/airflow/operators/python_operator.py",
> > line 66, in execute
> >     return_value = self.python_callable(*self.op_args, **self.op_kwargs)
> >   File
> >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > line 223, in send_daily_kpi_email
> >     html = get_email_html(kpi_df)
> >   File
> >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > line 212, in get_email_html
> >     env =
> > jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
> > 'templates'))
> >   File
> > "/home/airflow/venv/local/lib/python2.7/site-packages/jinja2/loaders.py",
> > line 224, in __init__
> >     provider = get_provider(package_name)
> >   File
> >
> >
> "/home/airflow/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
> > line 419, in get_provider
> >     __import__(moduleOrReq)
> > ImportError: No module named gsn_kpi_daily_email
> >
> > ...
> >
> > [2016-06-01 12:19:42,556] {models.py:250} ERROR - Failed to import:
> > /home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py
> > Traceback (most recent call last):
> >   File
> > "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
> > line 247, in process_file
> >     m = imp.load_source(mod_name, filepath)
> >   File
> >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py",
> > line 4, in <module>
> >     from reports.gsn_kpi_daily_email import send_daily_kpi_email
> >   File
> >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > line 8, in <module>
> >     from db_connect import get_db_connection_native as get_db_connection
> > ImportError: No module named db_connect
> > ```
> >
> > The first error is strange because the module it can't find,
> > gsn_kpi_daily_email,
> > is in the stack trace.
> >
> > With that second error, db_connect is in etl/lib which I added to the
> > PYTHONPATH.
> >
> > If anyone has advice on how to separate DAG code and other Python code,
> I'd
> > appreciate any pointers.
> >
> > And some configuration info:
> > airflow[celery,crypto,hive,jdbc,postgres,s3,redis,vertica]==1.7.1.2
> > celery[redis]==3.1.23
> > AWS EC2 m4.large with Ubuntu 14.04 AMI
> > Using CeleryExecutor
> >
> > thanks,
> > Dennis
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message