airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis O'Brien" <den...@dennisobrien.net>
Subject Re: best way to handle version upgrades of libraries used by tasks
Date Mon, 05 Feb 2018 20:11:35 GMT
Hi Andrew,

I think the issue is that each worker has a single airflow entry point
(what does `which airflow` point to) which has an associated environment
and list of packages installed, whether those are managed via conda,
virtualenv, or the available python environment.  So the executor would
need to know which environment you want to run.  I don't know how this
would be possible with the LocalExecutor or SequentialExecutor since both
are tied to the original python environment.  (Someone correct me if I am
wrong here.  I'm definitely not an expert on the Airflow internals.)

The BashOperator will allow you to run any process you want, including any
Python environment, but there is some plumbing overhead required if you
want access to the context, etc.  The CeleryExecutor (and any of the
executors that support distributed workers) plus a queue gets around the
issue of the worker environment tied to the scheduler environment.

That said, I don't want to discourage you from trying things out.  I am
sure there are some mysteries of Python that might make this possible.  For
example, this project from Armin Ronacher that allows modules to use
different versions of available libraries.  (Warning: I wouldn't use this
in production.  I think it was more proof of concept.)
https://github.com/mitsuhiko/multiversion

cheers,
Dennis



On Mon, Feb 5, 2018 at 5:06 AM Andrew Maguire <andrewm4894@gmail.com> wrote:

> I am curious about similar issue. I'm wondering if we could use
> https://github.com/pypa/pipenv - so each dag is in a folder say and that
> folder has pipfile.lock that i think could then sort of bundle the required
> environment into the dag code folder itself.
>
> I've not used this yet or anything but seems interesting...
>
> On Mon, Feb 5, 2018 at 7:17 AM Dennis O'Brien <dennis@dennisobrien.net>
> wrote:
>
> > Thanks for the input!  I'll take a look at using queues for this.
> >
> > thanks,
> > Dennis
> >
> > On Tue, Jan 30, 2018 at 4:17 PM Hbw <brian@heisenbergwoodworking.com>
> > wrote:
> >
> > > Run them on different workers by using queues?
> > > That way different workers can have different 3rd party libs while
> > sharing
> > > the same af core.
> > >
> > > B
> > >
> > > Sent from a device with less than stellar autocorrect
> > >
> > > > On Jan 30, 2018, at 9:13 AM, Dennis O'Brien <dennis@dennisobrien.net
> >
> > > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I have a number of jobs that use scikit-learn for scoring players.
> > > > Occasionally I need to upgrade scikit-learn to take advantage of some
> > new
> > > > features.  We have a single conda environment that specifies all the
> > > > dependencies for Airflow as well as for all of our DAGs.  So
> currently
> > > > upgrading scikit-learn means upgrading it for all DAGs that use it,
> and
> > > > retraining all models for that version.  It becomes a very involved
> > task
> > > > and I'm hoping to find a better way.
> > > >
> > > > One option is to use BashOperator (or something that wraps
> > BashOperator)
> > > > and have bash use a specific conda environment with that version of
> > > > scikit-learn.  While simple, I don't like the idea of limiting task
> > input
> > > > to the command line.  Still, an option.
> > > >
> > > > Another option is the DockerOperator.  But when I asked around at a
> > > > previous Airflow Meetup, I couldn't find anyone actually using it.
> It
> > > also
> > > > adds some complexity to the build and deploy process in that now I
> have
> > > to
> > > > maintain docker images for all my environments.  Still, not ruling it
> > > out.
> > > >
> > > > And the last option I can think of is just heterogeneous workers.  We
> > are
> > > > migrating our Airflow infrastructure to AWS ECS (from EC2) and plan
> on
> > > > having support for separate worker clusters, so this could include
> > > workers
> > > > with different conda environments.  I assume as long as a few key
> > > packages
> > > > are identical between scheduler and worker instances (airflow, redis,
> > > > celery?) the rest can be whatever.
> > > >
> > > > Has anyone faced this problem and have some advice?  Am I missing any
> > > > simpler options?  Any thoughts much appreciated.
> > > >
> > > > thanks,
> > > > Dennis
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message