incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Flume R -- any interest?
Date Tue, 30 Oct 2012 00:10:39 GMT
On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> I think i need functionality to add more jars (or external hadoop-jar)
> to drive that from an R package. Just setting job jar by class is not
> enough. I can push overall job-jar as an addiitonal jar to R package;
> however, i cannot really run hadoop command line on it, i need to set
> up classpath thru RJava.
>
> Traditional single hadoop job jar will unlikely work here since we
> cannot hardcode pipelines in java code but rather have to construct
> them on the fly. (well, we could serialize pipeline definitions from R
> and then replay them in a driver -- but that's too cumbersome and more
> work than it has to be.) There's no reason why i shouldn't be able to
> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
> off a pipeline.
>

o.a.c.util.DistCache.addJarToDistributedCache?


>
>
> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> > Ok, sounds very promising...
> >
> > i'll try to start digging on the driver part this week then (Pipeline
> > wrapper in R5).
> >
> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com>
> wrote:
> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>> Ok, cool.
> >>>
> >>> So what state is Crunch in? I take it is in a fairly advanced state.
> >>> So every api mentioned in the  FlumeJava paper is working , right? Or
> >>> there's something that is not working specifically?
> >>
> >> I think the only thing in the paper that we don't have in a working
> >> state is MSCR fusion. It's mostly just a question of prioritizing it
> >> and getting the work done.
> >>
> >>>
> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com>
> wrote:
> >>>> Hey Dmitriy,
> >>>>
> >>>> Got a fork going and looking forward to playing with crunchR this
> weekend--
> >>>> thanks!
> >>>>
> >>>> J
> >>>>
> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >>>>
> >>>>> Project template https://github.com/dlyubimov/crunchR
> >>>>>
> >>>>> Default profile does not compile R artifact . R profile compiles
R
> >>>>> artifact. for convenience, it is enabled by supplying -DR to mvn
> >>>>> command line, e.g.
> >>>>>
> >>>>> mvn install -DR
> >>>>>
> >>>>> there's also a helper that installs the snapshot version of the
> >>>>> package in the crunchR module.
> >>>>>
> >>>>> There's RJava and JRI java dependencies which i did not find anywhere
> >>>>> in public maven repos; so it is installed into my github maven repo
> so
> >>>>> far. Should compile for 3rd party.
> >>>>>
> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc
> >>>>> compilation requires roxygen2 (i think).
> >>>>>
> >>>>> For some reason RProtoBuf fails to import into another package,
got a
> >>>>> weird exception when i put @import RProtoBuf into crunchR, so
> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may
be a
> >>>>> problem though...
> >>>>>
> >>>>> other than the template, not much else has been done so far...
> finding
> >>>>> hadoop libraries and adding it to the package path on initialization
> >>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided"
> >>>>> transitives to the crunchR's java part...
> >>>>>
> >>>>> No legal stuff...
> >>>>>
> >>>>> No readmes... complete stealth at this point.
> >>>>>
> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >>>>> wrote:
> >>>>> > Ok, cool. I will try to roll project template by some time
next
> week.
> >>>>> > we can start with prototyping and benchmarking something really
> >>>>> > simple, such as parallelDo().
> >>>>> >
> >>>>> > My interim goal is to perhaps take some more or less simple
> algorithm
> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or
> whatever
> >>>>> > name it has to be) in a comparable time (performance) but with
much
> >>>>> > fewer lines of code. (say one of factorization or clustering
> things)
> >>>>> >
> >>>>> >
> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com>
wrote:
> >>>>> >> I am not much of R user but I am interested to see how
well we can
> >>>>> integrate
> >>>>> >> the two. I would be happy to help.
> >>>>> >>
> >>>>> >> regards,
> >>>>> >> Rahul
> >>>>> >>
> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >>>>> >>>
> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >>>>> >>> wrote:
> >>>>> >>>>
> >>>>> >>>> Yep, ok.
> >>>>> >>>>
> >>>>> >>>> I imagine it has to be an R module so I can set
up a maven
> project
> >>>>> >>>> with java/R code tree (I have been doing that a
lot lately). Or
> if you
> >>>>> >>>> have a template to look at, it would be useful
i guess too.
> >>>>> >>>
> >>>>> >>> No, please go right ahead.
> >>>>> >>>
> >>>>> >>>>
> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> josh.wills@gmail.com>
> >>>>> wrote:
> >>>>> >>>>>
> >>>>> >>>>> I'd like it to be separate at first, but I
am happy to help.
> Github
> >>>>> >>>>> repo?
> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov"
<dlieu.7@gmail.com
> >
> >>>>> wrote:
> >>>>> >>>>>
> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
prototype on
> top of
> >>>>> >>>>>> Crunch for something simple. This should
both save time and
> prove or
> >>>>> >>>>>> disprove if Crunch via RJava integration
is viable.
> >>>>> >>>>>>
> >>>>> >>>>>> On my part i can try to do it within Crunch
framework or we
> can keep
> >>>>> >>>>>> it completely separate.
> >>>>> >>>>>>
> >>>>> >>>>>> -d
> >>>>> >>>>>>
> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills
<
> jwills@cloudera.com>
> >>>>> >>>>>> wrote:
> >>>>> >>>>>>>
> >>>>> >>>>>>> I am an avid R user and would be into
it-- who gave the
> talk? Was
> >>>>> it
> >>>>> >>>>>>> Murray Stokely?
> >>>>> >>>>>>>
> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
Lyubimov <
> >>>>> dlieu.7@gmail.com>
> >>>>> >>>>>>
> >>>>> >>>>>> wrote:
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> Hello,
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> I was pretty excited to learn of
Google's experience of R
> mapping
> >>>>> of
> >>>>> >>>>>>>> flume java on one of recent BARUGs.
I think a lot of
> applications
> >>>>> >>>>>>>> similar to what we do in Mahout
could be prototyped using
> flume R.
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> I did not quite get the details
of Google implementation of
> R
> >>>>> >>>>>>>> mapping,
> >>>>> >>>>>>>> but i am not sure if just a direct
mapping from R to Crunch
> would
> >>>>> be
> >>>>> >>>>>>>> sufficient (and, for most part,
efficient). RJava/JRI and
> jni
> >>>>> seem to
> >>>>> >>>>>>>> be a pretty terrible performer
to do that directly.
> >>>>> >>>>>>>>
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> on top of it, I am thinknig if
this project could have a
> >>>>> contributed
> >>>>> >>>>>>>> adapter to Mahout's distributed
matrices, that would be
> just a
> >>>>> very
> >>>>> >>>>>>>> good synergy.
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> Is there anyone interested in contributing/advising
for open
> >>>>> source
> >>>>> >>>>>>>> version of flume R support? Just
gauging interest, Crunch
> list
> >>>>> seems
> >>>>> >>>>>>>> like a natural place to poke.
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> Thanks .
> >>>>> >>>>>>>>
> >>>>> >>>>>>>> -Dmitriy
> >>>>> >>>>>>>
> >>>>> >>>>>>>
> >>>>> >>>>>>>
> >>>>> >>>>>>> --
> >>>>> >>>>>>> Director of Data Science
> >>>>> >>>>>>> Cloudera
> >>>>> >>>>>>> Twitter: @josh_wills
> >>>>> >>>
> >>>>> >>>
> >>>>> >>>
> >>>>> >>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Director of Data Science
> >>>> Cloudera <http://www.cloudera.com>
> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message