crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Tue, 30 Oct 2012 00:46:22 GMT
Great! so it is in Crunch.

does it support hadoop-job jar format or only pure java jars?

On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jwills@cloudera.com> wrote:
> On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
>> I think i need functionality to add more jars (or external hadoop-jar)
>> to drive that from an R package. Just setting job jar by class is not
>> enough. I can push overall job-jar as an addiitonal jar to R package;
>> however, i cannot really run hadoop command line on it, i need to set
>> up classpath thru RJava.
>>
>> Traditional single hadoop job jar will unlikely work here since we
>> cannot hardcode pipelines in java code but rather have to construct
>> them on the fly. (well, we could serialize pipeline definitions from R
>> and then replay them in a driver -- but that's too cumbersome and more
>> work than it has to be.) There's no reason why i shouldn't be able to
>> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
>> off a pipeline.
>>
>
> o.a.c.util.DistCache.addJarToDistributedCache?
>
>
>>
>>
>> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> > Ok, sounds very promising...
>> >
>> > i'll try to start digging on the driver part this week then (Pipeline
>> > wrapper in R5).
>> >
>> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com>
>> wrote:
>> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> >>> Ok, cool.
>> >>>
>> >>> So what state is Crunch in? I take it is in a fairly advanced state.
>> >>> So every api mentioned in the  FlumeJava paper is working , right? Or
>> >>> there's something that is not working specifically?
>> >>
>> >> I think the only thing in the paper that we don't have in a working
>> >> state is MSCR fusion. It's mostly just a question of prioritizing it
>> >> and getting the work done.
>> >>
>> >>>
>> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com>
>> wrote:
>> >>>> Hey Dmitriy,
>> >>>>
>> >>>> Got a fork going and looking forward to playing with crunchR this
>> weekend--
>> >>>> thanks!
>> >>>>
>> >>>> J
>> >>>>
>> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> >>>>
>> >>>>> Project template https://github.com/dlyubimov/crunchR
>> >>>>>
>> >>>>> Default profile does not compile R artifact . R profile compiles
R
>> >>>>> artifact. for convenience, it is enabled by supplying -DR to
mvn
>> >>>>> command line, e.g.
>> >>>>>
>> >>>>> mvn install -DR
>> >>>>>
>> >>>>> there's also a helper that installs the snapshot version of
the
>> >>>>> package in the crunchR module.
>> >>>>>
>> >>>>> There's RJava and JRI java dependencies which i did not find
anywhere
>> >>>>> in public maven repos; so it is installed into my github maven
repo
>> so
>> >>>>> far. Should compile for 3rd party.
>> >>>>>
>> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf.
R Doc
>> >>>>> compilation requires roxygen2 (i think).
>> >>>>>
>> >>>>> For some reason RProtoBuf fails to import into another package,
got a
>> >>>>> weird exception when i put @import RProtoBuf into crunchR, so
>> >>>>> RProtoBuf is now in "Suggests" category. Down the road that
may be a
>> >>>>> problem though...
>> >>>>>
>> >>>>> other than the template, not much else has been done so far...
>> finding
>> >>>>> hadoop libraries and adding it to the package path on initialization
>> >>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided"
>> >>>>> transitives to the crunchR's java part...
>> >>>>>
>> >>>>> No legal stuff...
>> >>>>>
>> >>>>> No readmes... complete stealth at this point.
>> >>>>>
>> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >>>>> wrote:
>> >>>>> > Ok, cool. I will try to roll project template by some time
next
>> week.
>> >>>>> > we can start with prototyping and benchmarking something
really
>> >>>>> > simple, such as parallelDo().
>> >>>>> >
>> >>>>> > My interim goal is to perhaps take some more or less simple
>> algorithm
>> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch
(or
>> whatever
>> >>>>> > name it has to be) in a comparable time (performance) but
with much
>> >>>>> > fewer lines of code. (say one of factorization or clustering
>> things)
>> >>>>> >
>> >>>>> >
>> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com>
wrote:
>> >>>>> >> I am not much of R user but I am interested to see
how well we can
>> >>>>> integrate
>> >>>>> >> the two. I would be happy to help.
>> >>>>> >>
>> >>>>> >> regards,
>> >>>>> >> Rahul
>> >>>>> >>
>> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>> >>>>> >>>
>> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov
<
>> dlieu.7@gmail.com>
>> >>>>> >>> wrote:
>> >>>>> >>>>
>> >>>>> >>>> Yep, ok.
>> >>>>> >>>>
>> >>>>> >>>> I imagine it has to be an R module so I can
set up a maven
>> project
>> >>>>> >>>> with java/R code tree (I have been doing that
a lot lately). Or
>> if you
>> >>>>> >>>> have a template to look at, it would be useful
i guess too.
>> >>>>> >>>
>> >>>>> >>> No, please go right ahead.
>> >>>>> >>>
>> >>>>> >>>>
>> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills
<
>> josh.wills@gmail.com>
>> >>>>> wrote:
>> >>>>> >>>>>
>> >>>>> >>>>> I'd like it to be separate at first, but
I am happy to help.
>> Github
>> >>>>> >>>>> repo?
>> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov"
<dlieu.7@gmail.com
>> >
>> >>>>> wrote:
>> >>>>> >>>>>
>> >>>>> >>>>>> Ok maybe there's a benefit to try a
JRI/RJava prototype on
>> top of
>> >>>>> >>>>>> Crunch for something simple. This should
both save time and
>> prove or
>> >>>>> >>>>>> disprove if Crunch via RJava integration
is viable.
>> >>>>> >>>>>>
>> >>>>> >>>>>> On my part i can try to do it within
Crunch framework or we
>> can keep
>> >>>>> >>>>>> it completely separate.
>> >>>>> >>>>>>
>> >>>>> >>>>>> -d
>> >>>>> >>>>>>
>> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
Wills <
>> jwills@cloudera.com>
>> >>>>> >>>>>> wrote:
>> >>>>> >>>>>>>
>> >>>>> >>>>>>> I am an avid R user and would be
into it-- who gave the
>> talk? Was
>> >>>>> it
>> >>>>> >>>>>>> Murray Stokely?
>> >>>>> >>>>>>>
>> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM,
Dmitriy Lyubimov <
>> >>>>> dlieu.7@gmail.com>
>> >>>>> >>>>>>
>> >>>>> >>>>>> wrote:
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> Hello,
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> I was pretty excited to learn
of Google's experience of R
>> mapping
>> >>>>> of
>> >>>>> >>>>>>>> flume java on one of recent
BARUGs. I think a lot of
>> applications
>> >>>>> >>>>>>>> similar to what we do in Mahout
could be prototyped using
>> flume R.
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> I did not quite get the details
of Google implementation of
>> R
>> >>>>> >>>>>>>> mapping,
>> >>>>> >>>>>>>> but i am not sure if just a
direct mapping from R to Crunch
>> would
>> >>>>> be
>> >>>>> >>>>>>>> sufficient (and, for most part,
efficient). RJava/JRI and
>> jni
>> >>>>> seem to
>> >>>>> >>>>>>>> be a pretty terrible performer
to do that directly.
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> on top of it, I am thinknig
if this project could have a
>> >>>>> contributed
>> >>>>> >>>>>>>> adapter to Mahout's distributed
matrices, that would be
>> just a
>> >>>>> very
>> >>>>> >>>>>>>> good synergy.
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> Is there anyone interested
in contributing/advising for open
>> >>>>> source
>> >>>>> >>>>>>>> version of flume R support?
Just gauging interest, Crunch
>> list
>> >>>>> seems
>> >>>>> >>>>>>>> like a natural place to poke.
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> Thanks .
>> >>>>> >>>>>>>>
>> >>>>> >>>>>>>> -Dmitriy
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>
>> >>>>> >>>>>>>
>> >>>>> >>>>>>> --
>> >>>>> >>>>>>> Director of Data Science
>> >>>>> >>>>>>> Cloudera
>> >>>>> >>>>>>> Twitter: @josh_wills
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Director of Data Science
>> >>>> Cloudera <http://www.cloudera.com>
>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message