airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Wiedmer <arthur.wied...@gmail.com>
Subject Re: R Support
Date Thu, 13 Jul 2017 18:16:50 GMT
Hi,

Just to share a little bit about our experience.

The R parallelism packages have a tendency to use as many cores as they see
available by default if you do not set the parallelism explicitly. We have
seen R happily fork and create 64 processes on some machines because the
AWS box was reporting 64 vCPUs. Of course, if you try to run something else
on that airflow node, bad things happen.

We recommend at the very least a separate queue/isolated boxes if you go
down that path.

Best,
Arthur



On Thu, Jul 13, 2017 at 10:45 AM, Andrew Maguire <andrewm4894@gmail.com>
wrote:

> Cool, will bear that in mind. It's really only a handful of small scripts
> that use certain r packages for various reasons.
>
> Cheers,
> Andy
>
> On Thu, 13 Jul 2017, 17:30 Maxime Beauchemin, <maximebeauchemin@gmail.com>
> wrote:
>
> > Operators as an abstraction for something like R tend to be more
> > restrictive than useful. Similarly it's hard to write a useful
> > SparkOperator because it will typically simply fetch an artifact and fire
> > it up, and people have different ways of storing artifacts so there's not
> > much to generalize.
> >
> > Though I could see that if there are a set of common patterns you use R
> for
> > and want to parameterize and abstract out or "industrialize" then
> specific
> > operators can be useful. "FetchFromS3andRankROperator" or something like
> > that makes more sense than a generic ROperator(script) which would be a
> > very thin wrapper around BashOperator.
> >
> > These specific operators are usually specific to your environment and can
> > be defined and reused within your DAG repository.
> >
> > I don't want to start a flame war here but there's a bigger question on
> > whether you want to allow running R in production. It's dangerous for
> many
> > reasons that I won't get into here unless we decide to have this
> > conversation. Regardless, we do use R in production at Airbnb and would
> > recommend using the cgroup features in Airflow and having a dedicated
> queue
> > of workers to insulate abuse and contain resource utilisation. I'd also
> > recommend publishing a set of internal rules "When is it ok to use R in
> > production" and have engineers do some gatekeeping in source control.
> >
> > You also may want to consider SparkR as a path to productionize R though
> > from my experience data scientists tend to find it too restrictive as it
> > doesn't have the bells, whistles and trumpets the desktop R has.
> >
> > Max
> >
> > On Thu, Jul 13, 2017 at 7:32 AM, Scott Halgrim <scott.halgrim@zapier.com
> >
> > wrote:
> >
> > > This doesn’t really answer your question, but for what it’s worth,
> > > virtually our entire pipeline is written in R. We use BashOperators to
> > call
> > > a templated Rscript call.
> > >
> > > On Jul 13, 2017, 6:21 AM -0700, Andrew Maguire <andrewm4894@gmail.com
> >,
> > > wrote:
> > > > Hey,
> > > >
> > > > I'm sure this has been asked 100's times before.
> > > >
> > > > Is there any plans for adding R script operators?
> > > >
> > > > Looks around the contrib part of code base but could'nt find
> anything.
> > > >
> > > > Found some tickets in the JIRA but seemed to be from around 2014 and
> > > maybe
> > > > for stuff that has since been removed.
> > > >
> > > > I'm porting lots of jobs over to airflow and just trying to assess if
> > > worth
> > > > redoing them in python, maybe call them with bash operators, or just
> > > leave
> > > > them in my cron jobs for now.
> > > >
> > > > Would be happy to help out testing or reviewing anything in any way
> if
> > > > there are efforts ongoing.
> > > >
> > > > Cheers,
> > > > Andy
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message