incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Tue, 30 Oct 2012 00:04:51 GMT
I think i need functionality to add more jars (or external hadoop-jar)
to drive that from an R package. Just setting job jar by class is not
enough. I can push overall job-jar as an addiitonal jar to R package;
however, i cannot really run hadoop command line on it, i need to set
up classpath thru RJava.

Traditional single hadoop job jar will unlikely work here since we
cannot hardcode pipelines in java code but rather have to construct
them on the fly. (well, we could serialize pipeline definitions from R
and then replay them in a driver -- but that's too cumbersome and more
work than it has to be.) There's no reason why i shouldn't be able to
do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
off a pipeline.


On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> Ok, sounds very promising...
>
> i'll try to start digging on the driver part this week then (Pipeline
> wrapper in R5).
>
> On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com> wrote:
>> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>> Ok, cool.
>>>
>>> So what state is Crunch in? I take it is in a fairly advanced state.
>>> So every api mentioned in the  FlumeJava paper is working , right? Or
>>> there's something that is not working specifically?
>>
>> I think the only thing in the paper that we don't have in a working
>> state is MSCR fusion. It's mostly just a question of prioritizing it
>> and getting the work done.
>>
>>>
>>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com> wrote:
>>>> Hey Dmitriy,
>>>>
>>>> Got a fork going and looking forward to playing with crunchR this weekend--
>>>> thanks!
>>>>
>>>> J
>>>>
>>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>
>>>>> Project template https://github.com/dlyubimov/crunchR
>>>>>
>>>>> Default profile does not compile R artifact . R profile compiles R
>>>>> artifact. for convenience, it is enabled by supplying -DR to mvn
>>>>> command line, e.g.
>>>>>
>>>>> mvn install -DR
>>>>>
>>>>> there's also a helper that installs the snapshot version of the
>>>>> package in the crunchR module.
>>>>>
>>>>> There's RJava and JRI java dependencies which i did not find anywhere
>>>>> in public maven repos; so it is installed into my github maven repo so
>>>>> far. Should compile for 3rd party.
>>>>>
>>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc
>>>>> compilation requires roxygen2 (i think).
>>>>>
>>>>> For some reason RProtoBuf fails to import into another package, got a
>>>>> weird exception when i put @import RProtoBuf into crunchR, so
>>>>> RProtoBuf is now in "Suggests" category. Down the road that may be a
>>>>> problem though...
>>>>>
>>>>> other than the template, not much else has been done so far... finding
>>>>> hadoop libraries and adding it to the package path on initialization
>>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided"
>>>>> transitives to the crunchR's java part...
>>>>>
>>>>> No legal stuff...
>>>>>
>>>>> No readmes... complete stealth at this point.
>>>>>
>>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>>> wrote:
>>>>> > Ok, cool. I will try to roll project template by some time next
week.
>>>>> > we can start with prototyping and benchmarking something really
>>>>> > simple, such as parallelDo().
>>>>> >
>>>>> > My interim goal is to perhaps take some more or less simple algorithm
>>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or whatever
>>>>> > name it has to be) in a comparable time (performance) but with much
>>>>> > fewer lines of code. (say one of factorization or clustering things)
>>>>> >
>>>>> >
>>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com>
wrote:
>>>>> >> I am not much of R user but I am interested to see how well
we can
>>>>> integrate
>>>>> >> the two. I would be happy to help.
>>>>> >>
>>>>> >> regards,
>>>>> >> Rahul
>>>>> >>
>>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>>>>> >>>
>>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>> Yep, ok.
>>>>> >>>>
>>>>> >>>> I imagine it has to be an R module so I can set up a
maven project
>>>>> >>>> with java/R code tree (I have been doing that a lot
lately). Or if you
>>>>> >>>> have a template to look at, it would be useful i guess
too.
>>>>> >>>
>>>>> >>> No, please go right ahead.
>>>>> >>>
>>>>> >>>>
>>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <josh.wills@gmail.com>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>> I'd like it to be separate at first, but I am happy
to help. Github
>>>>> >>>>> repo?
>>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <dlieu.7@gmail.com>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
prototype on top of
>>>>> >>>>>> Crunch for something simple. This should both
save time and prove or
>>>>> >>>>>> disprove if Crunch via RJava integration is
viable.
>>>>> >>>>>>
>>>>> >>>>>> On my part i can try to do it within Crunch
framework or we can keep
>>>>> >>>>>> it completely separate.
>>>>> >>>>>>
>>>>> >>>>>> -d
>>>>> >>>>>>
>>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills
<jwills@cloudera.com>
>>>>> >>>>>> wrote:
>>>>> >>>>>>>
>>>>> >>>>>>> I am an avid R user and would be into it--
who gave the talk? Was
>>>>> it
>>>>> >>>>>>> Murray Stokely?
>>>>> >>>>>>>
>>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
Lyubimov <
>>>>> dlieu.7@gmail.com>
>>>>> >>>>>>
>>>>> >>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> Hello,
>>>>> >>>>>>>>
>>>>> >>>>>>>> I was pretty excited to learn of Google's
experience of R mapping
>>>>> of
>>>>> >>>>>>>> flume java on one of recent BARUGs.
I think a lot of applications
>>>>> >>>>>>>> similar to what we do in Mahout could
be prototyped using flume R.
>>>>> >>>>>>>>
>>>>> >>>>>>>> I did not quite get the details of Google
implementation of R
>>>>> >>>>>>>> mapping,
>>>>> >>>>>>>> but i am not sure if just a direct mapping
from R to Crunch would
>>>>> be
>>>>> >>>>>>>> sufficient (and, for most part, efficient).
RJava/JRI and jni
>>>>> seem to
>>>>> >>>>>>>> be a pretty terrible performer to do
that directly.
>>>>> >>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>> on top of it, I am thinknig if this
project could have a
>>>>> contributed
>>>>> >>>>>>>> adapter to Mahout's distributed matrices,
that would be just a
>>>>> very
>>>>> >>>>>>>> good synergy.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Is there anyone interested in contributing/advising
for open
>>>>> source
>>>>> >>>>>>>> version of flume R support? Just gauging
interest, Crunch list
>>>>> seems
>>>>> >>>>>>>> like a natural place to poke.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Thanks .
>>>>> >>>>>>>>
>>>>> >>>>>>>> -Dmitriy
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> --
>>>>> >>>>>>> Director of Data Science
>>>>> >>>>>>> Cloudera
>>>>> >>>>>>> Twitter: @josh_wills
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message