incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Flume R -- any interest?
Date Tue, 30 Oct 2012 01:24:38 GMT
On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> Great! so it is in Crunch.
>
> does it support hadoop-job jar format or only pure java jars?
>

I think just pure jars-- you're referring to hadoop-job format as having
all the dependencies in a lib/ directory within the jar?


>
> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jwills@cloudera.com> wrote:
> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> >
> >> I think i need functionality to add more jars (or external hadoop-jar)
> >> to drive that from an R package. Just setting job jar by class is not
> >> enough. I can push overall job-jar as an addiitonal jar to R package;
> >> however, i cannot really run hadoop command line on it, i need to set
> >> up classpath thru RJava.
> >>
> >> Traditional single hadoop job jar will unlikely work here since we
> >> cannot hardcode pipelines in java code but rather have to construct
> >> them on the fly. (well, we could serialize pipeline definitions from R
> >> and then replay them in a driver -- but that's too cumbersome and more
> >> work than it has to be.) There's no reason why i shouldn't be able to
> >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
> >> off a pipeline.
> >>
> >
> > o.a.c.util.DistCache.addJarToDistributedCache?
> >
> >
> >>
> >>
> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> >> wrote:
> >> > Ok, sounds very promising...
> >> >
> >> > i'll try to start digging on the driver part this week then (Pipeline
> >> > wrapper in R5).
> >> >
> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com>
> >> wrote:
> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> wrote:
> >> >>> Ok, cool.
> >> >>>
> >> >>> So what state is Crunch in? I take it is in a fairly advanced state.
> >> >>> So every api mentioned in the  FlumeJava paper is working , right?
> Or
> >> >>> there's something that is not working specifically?
> >> >>
> >> >> I think the only thing in the paper that we don't have in a working
> >> >> state is MSCR fusion. It's mostly just a question of prioritizing it
> >> >> and getting the work done.
> >> >>
> >> >>>
> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com>
> >> wrote:
> >> >>>> Hey Dmitriy,
> >> >>>>
> >> >>>> Got a fork going and looking forward to playing with crunchR
this
> >> weekend--
> >> >>>> thanks!
> >> >>>>
> >> >>>> J
> >> >>>>
> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> wrote:
> >> >>>>
> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> >> >>>>>
> >> >>>>> Default profile does not compile R artifact . R profile
compiles R
> >> >>>>> artifact. for convenience, it is enabled by supplying -DR
to mvn
> >> >>>>> command line, e.g.
> >> >>>>>
> >> >>>>> mvn install -DR
> >> >>>>>
> >> >>>>> there's also a helper that installs the snapshot version
of the
> >> >>>>> package in the crunchR module.
> >> >>>>>
> >> >>>>> There's RJava and JRI java dependencies which i did not
find
> anywhere
> >> >>>>> in public maven repos; so it is installed into my github
maven
> repo
> >> so
> >> >>>>> far. Should compile for 3rd party.
> >> >>>>>
> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf.
R Doc
> >> >>>>> compilation requires roxygen2 (i think).
> >> >>>>>
> >> >>>>> For some reason RProtoBuf fails to import into another
package,
> got a
> >> >>>>> weird exception when i put @import RProtoBuf into crunchR,
so
> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
that may
> be a
> >> >>>>> problem though...
> >> >>>>>
> >> >>>>> other than the template, not much else has been done so
far...
> >> finding
> >> >>>>> hadoop libraries and adding it to the package path on
> initialization
> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> non-"provided"
> >> >>>>> transitives to the crunchR's java part...
> >> >>>>>
> >> >>>>> No legal stuff...
> >> >>>>>
> >> >>>>> No readmes... complete stealth at this point.
> >> >>>>>
> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> >> dlieu.7@gmail.com>
> >> >>>>> wrote:
> >> >>>>> > Ok, cool. I will try to roll project template by some
time next
> >> week.
> >> >>>>> > we can start with prototyping and benchmarking something
really
> >> >>>>> > simple, such as parallelDo().
> >> >>>>> >
> >> >>>>> > My interim goal is to perhaps take some more or less
simple
> >> algorithm
> >> >>>>> > from Mahout and demonstrate it can be solved with
Rcrunch (or
> >> whatever
> >> >>>>> > name it has to be) in a comparable time (performance)
but with
> much
> >> >>>>> > fewer lines of code. (say one of factorization or
clustering
> >> things)
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com>
> wrote:
> >> >>>>> >> I am not much of R user but I am interested to
see how well we
> can
> >> >>>>> integrate
> >> >>>>> >> the two. I would be happy to help.
> >> >>>>> >>
> >> >>>>> >> regards,
> >> >>>>> >> Rahul
> >> >>>>> >>
> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >> >>>>> >>>
> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov
<
> >> dlieu.7@gmail.com>
> >> >>>>> >>> wrote:
> >> >>>>> >>>>
> >> >>>>> >>>> Yep, ok.
> >> >>>>> >>>>
> >> >>>>> >>>> I imagine it has to be an R module so
I can set up a maven
> >> project
> >> >>>>> >>>> with java/R code tree (I have been doing
that a lot lately).
> Or
> >> if you
> >> >>>>> >>>> have a template to look at, it would be
useful i guess too.
> >> >>>>> >>>
> >> >>>>> >>> No, please go right ahead.
> >> >>>>> >>>
> >> >>>>> >>>>
> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
Wills <
> >> josh.wills@gmail.com>
> >> >>>>> wrote:
> >> >>>>> >>>>>
> >> >>>>> >>>>> I'd like it to be separate at first,
but I am happy to help.
> >> Github
> >> >>>>> >>>>> repo?
> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
Lyubimov" <
> dlieu.7@gmail.com
> >> >
> >> >>>>> wrote:
> >> >>>>> >>>>>
> >> >>>>> >>>>>> Ok maybe there's a benefit to
try a JRI/RJava prototype on
> >> top of
> >> >>>>> >>>>>> Crunch for something simple. This
should both save time and
> >> prove or
> >> >>>>> >>>>>> disprove if Crunch via RJava integration
is viable.
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> On my part i can try to do it
within Crunch framework or we
> >> can keep
> >> >>>>> >>>>>> it completely separate.
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> -d
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM,
Josh Wills <
> >> jwills@cloudera.com>
> >> >>>>> >>>>>> wrote:
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>> I am an avid R user and would
be into it-- who gave the
> >> talk? Was
> >> >>>>> it
> >> >>>>> >>>>>>> Murray Stokely?
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05
PM, Dmitriy Lyubimov <
> >> >>>>> dlieu.7@gmail.com>
> >> >>>>> >>>>>>
> >> >>>>> >>>>>> wrote:
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> Hello,
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> I was pretty excited to
learn of Google's experience of R
> >> mapping
> >> >>>>> of
> >> >>>>> >>>>>>>> flume java on one of recent
BARUGs. I think a lot of
> >> applications
> >> >>>>> >>>>>>>> similar to what we do
in Mahout could be prototyped using
> >> flume R.
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> I did not quite get the
details of Google implementation
> of
> >> R
> >> >>>>> >>>>>>>> mapping,
> >> >>>>> >>>>>>>> but i am not sure if just
a direct mapping from R to
> Crunch
> >> would
> >> >>>>> be
> >> >>>>> >>>>>>>> sufficient (and, for most
part, efficient). RJava/JRI and
> >> jni
> >> >>>>> seem to
> >> >>>>> >>>>>>>> be a pretty terrible performer
to do that directly.
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> on top of it, I am thinknig
if this project could have a
> >> >>>>> contributed
> >> >>>>> >>>>>>>> adapter to Mahout's distributed
matrices, that would be
> >> just a
> >> >>>>> very
> >> >>>>> >>>>>>>> good synergy.
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> Is there anyone interested
in contributing/advising for
> open
> >> >>>>> source
> >> >>>>> >>>>>>>> version of flume R support?
Just gauging interest, Crunch
> >> list
> >> >>>>> seems
> >> >>>>> >>>>>>>> like a natural place to
poke.
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> Thanks .
> >> >>>>> >>>>>>>>
> >> >>>>> >>>>>>>> -Dmitriy
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>>
> >> >>>>> >>>>>>> --
> >> >>>>> >>>>>>> Director of Data Science
> >> >>>>> >>>>>>> Cloudera
> >> >>>>> >>>>>>> Twitter: @josh_wills
> >> >>>>> >>>
> >> >>>>> >>>
> >> >>>>> >>>
> >> >>>>> >>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Director of Data Science
> >> >>>> Cloudera <http://www.cloudera.com>
> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message