airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <lance.nors...@gmail.com>
Subject Re: separating DAGs and code and handling PYTHONPATH
Date Fri, 03 Jun 2016 20:07:26 GMT
About structuring memory use: we have some major chunks of code set up as
web services. We have a separate machine that runs one service (a
Java-based app) and is limited to running 20 at once so that we can't run
out of ram.

Our installation uses a separate Docker container for each Airflow app.
Docker includes quotas for containers but we have not used them yet
(cgroups). This feature allows us to allocate X amount of space for each
app, so one unruly app cannot crash the whole Airflow service.



On Fri, Jun 3, 2016 at 9:41 AM, Dennis O'Brien <dennis@dennisobrien.net>
wrote:

> Thanks very much for the help.
>
> It seems I had two errors happening here.  First, as Mattias pointed out, I
> was doing it wrong with the jinja2.PackageLoader.  (It's always
> embarrassing to email a dev list when the error is somewhere entirely
> different.)  I switched to jinja2.FileLoader and it worked.
>
> My other issue was from an out-of-memory problem.  This wasn't obvious from
> the task instance log, but when I found it when running the job command
> line.  I dialed down the concurrency in airflow.cfg and this fixed the
> problem.  I also deferred some imports so that the DAG itself was not
> importing so much (the entire pydata stack) but the workers themselves did
> the imports when run.
>
> And thanks for the pointers about template_searchpath and the pitfalls of
> sys.path hacks.
>
> I'd still be interested to learn more about how others structure more
> complex roll outs of Airflow.  We're moving from the "proof of concept"
> phase to the "we're doing this" phase so learning how others are
> configuring and deploying would be really helpful.  Maybe at the next
> meetup. :-)
>
> cheers,
> Dennis
>
>
> On Thu, Jun 2, 2016 at 2:24 PM Maxime Beauchemin <
> maximebeauchemin@gmail.com>
> wrote:
>
> > A few related things:
> > * You can use the `template_searchpath` param of the DAG constructor to
> add
> > folders to the jinja searchpath for your DAG. Documented here:
> >
> >
> http://pythonhosted.org/airflow/code.html?highlight=template_searchpath#airflow.models.DAG
> > * Airflow only adds DAGS_FOLDER to your `sys.path` beyond that you have
> to
> > manage your PYTHONPATH on your own. Note that in the current version
> > messing with `sys.path` affects the main thread, meaning that DAGs parsed
> > after this alteration have a different `sys.path` than the ones before,
> > which can create some serious, hard to debug problem. We're addressing
> this
> > issue in the next version where DAG parsing will be done in subprocesses
> >
> > Max
> >
> > On Thu, Jun 2, 2016 at 1:43 AM, Matthias Huschle <
> > matthias.huschle@paymill.de> wrote:
> >
> > > Hi Dennis,
> > >
> > > the first error is thrown by jinja2.PackageLoader. I think you still
> have
> > > to use dot notation in the first argument, as the module itself is
> under
> > > the reports path:
> > >
> > > In:
> > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.
> > > py", line 212, in get_email_html
> > > Change:
> > > env =
> > jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
> > > 'templates'))
> > > To:
> > > env = jinja2.Environment(loader=jinja2.PackageLoader('reports.gsn_kpi_
> > > daily_email', 'templates'))
> > >
> > > For the second error I don't see a cause. You should first check
> sys.path
> > > from within the script to see if etl/lib/ is properly added. It's
> strange
> > > that the first error is thrown during runtime of the same module that
> > fails
> > > to import in the second error. Do you modify sys.path from within your
> > > scripts?
> > >
> > > If I understand your setup correctly, an __init__.py is only necessary
> in
> > > reports. I don't think it has any purpose in folders, that are directly
> > in
> > > sys.path . However, the names "lib" and "db_connect" are quite generic.
> > I'd
> > > consider renaming lib (sth. like etl_lib) and adding just etl/ to
> > sys.path
> > > , and an __init__.py to the lib folder to avoid namespace pollution.
> > You'd
> > > have to use "from etl_lib import db_connect" then, of course.
> > >
> > > Hope that helps,
> > > Matthias
> > >
> > >
> > > 2016-06-01 20:10 GMT+02:00 Dennis O'Brien <dennis@dennisobrien.net>:
> > >
> > > > Hi folks
> > > >
> > > > I'm looking for some advice here on how others separate their DAGs
> and
> > > the
> > > > code those DAGs call and any PYTHONPATH fixups that may be necessary.
> > > >
> > > > I have a project that looks like this
> > > >
> > > > .
> > > > ├── airflow
> > > > │ ├── dags
> > > > │ │ ├── reports
> > > > │ │ └── sql
> > > > │ └── deploy
> > > > │    └── templates
> > > > ├── etl
> > > > │ ├── lib
> > > >
> > > > All the DAGs are in airflow/dags
> > > > The sql used by SqlSensor tasks are in airflow/dags/sql
> > > > The python code used by PythonOperator is in airflow/dags/reports and
> > > > etl/lib
> > > > Existing etl code is all in etl
> > > >
> > > > In ./airflow/dags/etl_gsn_daily_kpi_email.py
> > > > ```
> > > > from reports.gsn_kpi_daily_email import send_daily_kpi_email
> > > > ```
> > > >
> > > > I thought I could just import code in airflow/dags/reports from
> > > > airflow/dags since DAGS_FOLDER is added to sys.path but after
> deploying
> > > the
> > > > code I saw an error in the web UI about failing to import the module
> > > > `reports.gsn_kpi_daily_email`.  So I added __init__.py files in dags
> > and
> > > > dags/reports with no success.  Then I modified my upstart scripts to
> > fix
> > > up
> > > > the PYTHONPATH.
> > > >
> > > > ```
> > > > env PYTHONPATH=$PYTHONPATH:{{ destination_dir }}/airflow/dags/:{{
> > > > destination_dir }}/etl/lib/
> > > > export PYTHONPATH
> > > > ```
> > > >
> > > > This fixed the error in the web UI but on the next run of the job, I
> > got
> > > > these tracebacks:
> > > > ```
> > > > [2016-06-01 12:14:38,352] {models.py:1286} ERROR - No module named
> > > > gsn_kpi_daily_email
> > > > Traceback (most recent call last):
> > > >   File
> > > >
> > "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
> > > > line 1245, in run
> > > >     result = task_copy.execute(context=context)
> > > >   File
> > > >
> > > >
> > >
> >
> "/home/airflow/venv/lib/python2.7/site-packages/airflow/operators/python_operator.py",
> > > > line 66, in execute
> > > >     return_value = self.python_callable(*self.op_args,
> > **self.op_kwargs)
> > > >   File
> > > >
> > > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > > > line 223, in send_daily_kpi_email
> > > >     html = get_email_html(kpi_df)
> > > >   File
> > > >
> > > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > > > line 212, in get_email_html
> > > >     env =
> > > > jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
> > > > 'templates'))
> > > >   File
> > > >
> > "/home/airflow/venv/local/lib/python2.7/site-packages/jinja2/loaders.py",
> > > > line 224, in __init__
> > > >     provider = get_provider(package_name)
> > > >   File
> > > >
> > > >
> > >
> >
> "/home/airflow/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
> > > > line 419, in get_provider
> > > >     __import__(moduleOrReq)
> > > > ImportError: No module named gsn_kpi_daily_email
> > > >
> > > > ...
> > > >
> > > > [2016-06-01 12:19:42,556] {models.py:250} ERROR - Failed to import:
> > > >
> > /home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py
> > > > Traceback (most recent call last):
> > > >   File
> > > >
> > "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
> > > > line 247, in process_file
> > > >     m = imp.load_source(mod_name, filepath)
> > > >   File
> > > >
> > > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py",
> > > > line 4, in <module>
> > > >     from reports.gsn_kpi_daily_email import send_daily_kpi_email
> > > >   File
> > > >
> > > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > > > line 8, in <module>
> > > >     from db_connect import get_db_connection_native as
> > get_db_connection
> > > > ImportError: No module named db_connect
> > > > ```
> > > >
> > > > The first error is strange because the module it can't find,
> > > > gsn_kpi_daily_email,
> > > > is in the stack trace.
> > > >
> > > > With that second error, db_connect is in etl/lib which I added to the
> > > > PYTHONPATH.
> > > >
> > > > If anyone has advice on how to separate DAG code and other Python
> code,
> > > I'd
> > > > appreciate any pointers.
> > > >
> > > > And some configuration info:
> > > > airflow[celery,crypto,hive,jdbc,postgres,s3,redis,vertica]==1.7.1.2
> > > > celery[redis]==3.1.23
> > > > AWS EC2 m4.large with Ubuntu 14.04 AMI
> > > > Using CeleryExecutor
> > > >
> > > > thanks,
> > > > Dennis
> > > >
> > >
> >
>



-- 
Lance Norskog
lance.norskog@gmail.com
Redwood City, CA

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message