crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Tue, 13 Nov 2012 01:37:00 GMT
I actually want to defer this to hadoop admins, we just need to create a
procedure for setting up nodes. Ideally as simple as possible. something
like

1) setup R
2) install.packages("rJava","RProtoBuf","crunchR")
3) R CMD javareconf
3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
either mapred command lines or LD_LIBRARY_PATH...

but it will depend on their versions of hadoop, jre etc. I hoped crunch
might have something to hide a lot of that complexity (since it is about
hiding complexities, for the most part :)  ) besides hadoop has a way to
ship .so's to the backend so if crunch had an api to do something similar
it is conceivable that driver might yank and ship it too to hide that
complexity as well. But then there's a host of issues how to handle
potentially different rJava versions installed on different nodes... So, it
increasingly looks like something we might want to defer to sysops to do
with approximate set of requirements .


On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jwills@cloudera.com> wrote:

> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > so java tasks need to be able to load libjri.so from
> > whatever system.file("jri", package="rJava") says.
> >
> > Traditionally, these issues were handled with -Djava.library.path.
> > Apparently there's nothing java task can do to enable loadLibrary()
> command
> > to see the damn library once started. But -Djava.library.path requires
> for
> > nodes to configure and lock jvm command line from modifications of the
> > client.  which is fine.
> >
> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
> (again).
> >
> > but... any other suggestions about best practice configuring crunch to
> run
> > user's .so's?
> >
>
> Not off the top of my head. I suspect that whatever you come up with will
> become the "best practice." :)
>
> >
> > thanks.
> >
> >
> >
> >
> >
> >
> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <josh.wills@gmail.com>
> wrote:
> >
> > > I believe that is a safe assumption, at least right now.
> > >
> > >
> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > >
> > > > Question.
> > > >
> > > > So in Crunch api, initialize() doesn't get an emitter. and the
> process
> > > gets
> > > > emitter every time.
> > > >
> > > > However, my guess any single reincranation of a DoFn object in the
> > > backend
> > > > will always be getting the same emitter thru its lifecycle. Is it an
> > > > admissible assumption or there's currently a counter example to that?
> > > >
> > > > The problem is that as i implement the two way pipeline of input and
> > > > emitter data between R and Java, I am bulking these calls together
> for
> > > > performance reasons. Each individual datum in these chunks of data
> will
> > > not
> > > > have attached emitter function information to them in any way. (well
> it
> > > > could but it would be a performance killer and i bet emitter never
> > > > changes).
> > > >
> > > > So, thoughts? can i assume emitter never changes between first and
> lass
> > > > call to DoFn instance?
> > > >
> > > > thanks.
> > > >
> > > >
> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > yes...
> > > > >
> > > > > i think it worked for me before, although just adding all jars
> from R
> > > > > package distribution would be a little bit more appropriate
> approach
> > > > > -- but it creates a problem with jars in dependent R packages. I
> > think
> > > > > it would be much easier to just compile a hadoop-job file and stick
> > it
> > > > > in rather than doing cherry-picking of individual jars from who
> knows
> > > > > how many locations.
> > > > >
> > > > > i think i used the hadoop job format with distributed cache before
> > and
> > > > > it worked... at least with Pig "register jar" functionality.
> > > > >
> > > > > ok i guess i will just try if it works.
> > > > >
> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jwills@cloudera.com>
> > > wrote:
> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > >> Great! so it is in Crunch.
> > > > > >>
> > > > > >> does it support hadoop-job jar format or only pure java
jars?
> > > > > >>
> > > > > >
> > > > > > I think just pure jars-- you're referring to hadoop-job format
as
> > > > having
> > > > > > all the dependencies in a lib/ directory within the jar?
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> jwills@cloudera.com>
> > > > > wrote:
> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > > > dlieu.7@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I think i need functionality to add more jars (or
external
> > > > > hadoop-jar)
> > > > > >> >> to drive that from an R package. Just setting job
jar by
> class
> > is
> > > > not
> > > > > >> >> enough. I can push overall job-jar as an addiitonal
jar to R
> > > > package;
> > > > > >> >> however, i cannot really run hadoop command line
on it, i
> need
> > to
> > > > set
> > > > > >> >> up classpath thru RJava.
> > > > > >> >>
> > > > > >> >> Traditional single hadoop job jar will unlikely
work here
> since
> > > we
> > > > > >> >> cannot hardcode pipelines in java code but rather
have to
> > > construct
> > > > > >> >> them on the fly. (well, we could serialize pipeline
> definitions
> > > > from
> > > > > R
> > > > > >> >> and then replay them in a driver -- but that's
too cumbersome
> > and
> > > > > more
> > > > > >> >> work than it has to be.) There's no reason why
i shouldn't be
> > > able
> > > > to
> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
when
> > > > kicking
> > > > > >> >> off a pipeline.
> > > > > >> >>
> > > > > >> >
> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > > > > >> >
> > > > > >> >
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov
<
> > > > > dlieu.7@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >> > Ok, sounds very promising...
> > > > > >> >> >
> > > > > >> >> > i'll try to start digging on the driver part
this week then
> > > > > (Pipeline
> > > > > >> >> > wrapper in R5).
> > > > > >> >> >
> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills
<
> > > > josh.wills@gmail.com
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy
Lyubimov <
> > > > > dlieu.7@gmail.com
> > > > > >> >
> > > > > >> >> wrote:
> > > > > >> >> >>> Ok, cool.
> > > > > >> >> >>>
> > > > > >> >> >>> So what state is Crunch in? I take
it is in a fairly
> > advanced
> > > > > state.
> > > > > >> >> >>> So every api mentioned in the  FlumeJava
paper is
> working ,
> > > > > right?
> > > > > >> Or
> > > > > >> >> >>> there's something that is not working
specifically?
> > > > > >> >> >>
> > > > > >> >> >> I think the only thing in the paper that
we don't have in
> a
> > > > > working
> > > > > >> >> >> state is MSCR fusion. It's mostly just
a question of
> > > > prioritizing
> > > > > it
> > > > > >> >> >> and getting the work done.
> > > > > >> >> >>
> > > > > >> >> >>>
> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh
Wills <
> > > > jwills@cloudera.com
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> >>>> Hey Dmitriy,
> > > > > >> >> >>>>
> > > > > >> >> >>>> Got a fork going and looking forward
to playing with
> > crunchR
> > > > > this
> > > > > >> >> weekend--
> > > > > >> >> >>>> thanks!
> > > > > >> >> >>>>
> > > > > >> >> >>>> J
> > > > > >> >> >>>>
> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM,
Dmitriy Lyubimov <
> > > > > >> dlieu.7@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >> >>>>
> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> Default profile does not compile
R artifact . R profile
> > > > > compiles R
> > > > > >> >> >>>>> artifact. for convenience,
it is enabled by supplying
> -DR
> > > to
> > > > > mvn
> > > > > >> >> >>>>> command line, e.g.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> mvn install -DR
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> there's also a helper that
installs the snapshot
> version
> > of
> > > > the
> > > > > >> >> >>>>> package in the crunchR module.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> There's RJava and JRI java
dependencies which i did not
> > > find
> > > > > >> anywhere
> > > > > >> >> >>>>> in public maven repos; so
it is installed into my
> github
> > > > maven
> > > > > >> repo
> > > > > >> >> so
> > > > > >> >> >>>>> far. Should compile for 3rd
party.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> -DR compilation requires R,
RJava and optionally,
> > > RProtoBuf.
> > > > R
> > > > > Doc
> > > > > >> >> >>>>> compilation requires roxygen2
(i think).
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> For some reason RProtoBuf
fails to import into another
> > > > package,
> > > > > >> got a
> > > > > >> >> >>>>> weird exception when i put
@import RProtoBuf into
> > crunchR,
> > > so
> > > > > >> >> >>>>> RProtoBuf is now in "Suggests"
category. Down the road
> > that
> > > > may
> > > > > >> be a
> > > > > >> >> >>>>> problem though...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> other than the template, not
much else has been done so
> > > > far...
> > > > > >> >> finding
> > > > > >> >> >>>>> hadoop libraries and adding
it to the package path on
> > > > > >> initialization
> > > > > >> >> >>>>> via "hadoop classpath"...
adding Crunch jars and its
> > > > > >> non-"provided"
> > > > > >> >> >>>>> transitives to the crunchR's
java part...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> No legal stuff...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> No readmes... complete stealth
at this point.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35
PM, Dmitriy Lyubimov <
> > > > > >> >> dlieu.7@gmail.com>
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> > Ok, cool. I will try
to roll project template by some
> > > time
> > > > > next
> > > > > >> >> week.
> > > > > >> >> >>>>> > we can start with prototyping
and benchmarking
> > something
> > > > > really
> > > > > >> >> >>>>> > simple, such as parallelDo().
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> > My interim goal is to
perhaps take some more or less
> > > simple
> > > > > >> >> algorithm
> > > > > >> >> >>>>> > from Mahout and demonstrate
it can be solved with
> > Rcrunch
> > > > (or
> > > > > >> >> whatever
> > > > > >> >> >>>>> > name it has to be) in
a comparable time (performance)
> > but
> > > > > with
> > > > > >> much
> > > > > >> >> >>>>> > fewer lines of code.
(say one of factorization or
> > > > clustering
> > > > > >> >> things)
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> > On Wed, Oct 17, 2012
at 10:24 PM, Rahul <
> > > rsharma@xebia.com
> > > > >
> > > > > >> wrote:
> > > > > >> >> >>>>> >> I am not much of
R user but I am interested to see
> how
> > > > well
> > > > > we
> > > > > >> can
> > > > > >> >> >>>>> integrate
> > > > > >> >> >>>>> >> the two. I would
be happy to help.
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>> >> regards,
> > > > > >> >> >>>>> >> Rahul
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>> >> On 18-10-2012 04:04,
Josh Wills wrote:
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>> On Wed, Oct 17,
2012 at 3:07 PM, Dmitriy Lyubimov <
> > > > > >> >> dlieu.7@gmail.com>
> > > > > >> >> >>>>> >>> wrote:
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> Yep, ok.
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> I imagine
it has to be an R module so I can set
> up a
> > > > maven
> > > > > >> >> project
> > > > > >> >> >>>>> >>>> with java/R
code tree (I have been doing that a
> lot
> > > > > lately).
> > > > > >> Or
> > > > > >> >> if you
> > > > > >> >> >>>>> >>>> have a template
to look at, it would be useful i
> > guess
> > > > > too.
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>> No, please go
right ahead.
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> On Wed, Oct
17, 2012 at 3:02 PM, Josh Wills <
> > > > > >> >> josh.wills@gmail.com>
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> >>>>>
> > > > > >> >> >>>>> >>>>> I'd like
it to be separate at first, but I am
> happy
> > > to
> > > > > help.
> > > > > >> >> Github
> > > > > >> >> >>>>> >>>>> repo?
> > > > > >> >> >>>>> >>>>> On Oct
17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> > > > > >> dlieu.7@gmail.com
> > > > > >> >> >
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> >>>>>
> > > > > >> >> >>>>> >>>>>> Ok
maybe there's a benefit to try a JRI/RJava
> > > > prototype
> > > > > on
> > > > > >> >> top of
> > > > > >> >> >>>>> >>>>>> Crunch
for something simple. This should both
> save
> > > > time
> > > > > and
> > > > > >> >> prove or
> > > > > >> >> >>>>> >>>>>> disprove
if Crunch via RJava integration is
> > viable.
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> On
my part i can try to do it within Crunch
> > > framework
> > > > > or we
> > > > > >> >> can keep
> > > > > >> >> >>>>> >>>>>> it
completely separate.
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> -d
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> On
Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> > > > > >> >> jwills@cloudera.com>
> > > > > >> >> >>>>> >>>>>> wrote:
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
I am an avid R user and would be into it-- who
> > gave
> > > > the
> > > > > >> >> talk? Was
> > > > > >> >> >>>>> it
> > > > > >> >> >>>>> >>>>>>>
Murray Stokely?
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
> > Lyubimov <
> > > > > >> >> >>>>> dlieu.7@gmail.com>
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> wrote:
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
Hello,
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
I was pretty excited to learn of Google's
> > > experience
> > > > > of R
> > > > > >> >> mapping
> > > > > >> >> >>>>> of
> > > > > >> >> >>>>> >>>>>>>>
flume java on one of recent BARUGs. I think a
> > lot
> > > of
> > > > > >> >> applications
> > > > > >> >> >>>>> >>>>>>>>
similar to what we do in Mahout could be
> > > prototyped
> > > > > using
> > > > > >> >> flume R.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
I did not quite get the details of Google
> > > > > implementation
> > > > > >> of
> > > > > >> >> R
> > > > > >> >> >>>>> >>>>>>>>
mapping,
> > > > > >> >> >>>>> >>>>>>>>
but i am not sure if just a direct mapping
> from
> > R
> > > to
> > > > > >> Crunch
> > > > > >> >> would
> > > > > >> >> >>>>> be
> > > > > >> >> >>>>> >>>>>>>>
sufficient (and, for most part, efficient).
> > > > RJava/JRI
> > > > > and
> > > > > >> >> jni
> > > > > >> >> >>>>> seem to
> > > > > >> >> >>>>> >>>>>>>>
be a pretty terrible performer to do that
> > > directly.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
on top of it, I am thinknig if this project
> > could
> > > > > have a
> > > > > >> >> >>>>> contributed
> > > > > >> >> >>>>> >>>>>>>>
adapter to Mahout's distributed matrices, that
> > > would
> > > > > be
> > > > > >> >> just a
> > > > > >> >> >>>>> very
> > > > > >> >> >>>>> >>>>>>>>
good synergy.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
Is there anyone interested in
> > > contributing/advising
> > > > > for
> > > > > >> open
> > > > > >> >> >>>>> source
> > > > > >> >> >>>>> >>>>>>>>
version of flume R support? Just gauging
> > interest,
> > > > > Crunch
> > > > > >> >> list
> > > > > >> >> >>>>> seems
> > > > > >> >> >>>>> >>>>>>>>
like a natural place to poke.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
Thanks .
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
-Dmitriy
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
--
> > > > > >> >> >>>>> >>>>>>>
Director of Data Science
> > > > > >> >> >>>>> >>>>>>>
Cloudera
> > > > > >> >> >>>>> >>>>>>>
Twitter: @josh_wills
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> --
> > > > > >> >> >>>> Director of Data Science
> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Director of Data Science
> > > > > >> > Cloudera <http://www.cloudera.com>
> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Director of Data Science
> > > > > > Cloudera <http://www.cloudera.com>
> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message