crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Mon, 29 Oct 2012 17:17:01 GMT
Ok, sounds very promising...

i'll try to start digging on the driver part this week then (Pipeline
wrapper in R5).

On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com> wrote:
> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> Ok, cool.
>>
>> So what state is Crunch in? I take it is in a fairly advanced state.
>> So every api mentioned in the  FlumeJava paper is working , right? Or
>> there's something that is not working specifically?
>
> I think the only thing in the paper that we don't have in a working
> state is MSCR fusion. It's mostly just a question of prioritizing it
> and getting the work done.
>
>>
>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com> wrote:
>>> Hey Dmitriy,
>>>
>>> Got a fork going and looking forward to playing with crunchR this weekend--
>>> thanks!
>>>
>>> J
>>>
>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>>
>>>> Project template https://github.com/dlyubimov/crunchR
>>>>
>>>> Default profile does not compile R artifact . R profile compiles R
>>>> artifact. for convenience, it is enabled by supplying -DR to mvn
>>>> command line, e.g.
>>>>
>>>> mvn install -DR
>>>>
>>>> there's also a helper that installs the snapshot version of the
>>>> package in the crunchR module.
>>>>
>>>> There's RJava and JRI java dependencies which i did not find anywhere
>>>> in public maven repos; so it is installed into my github maven repo so
>>>> far. Should compile for 3rd party.
>>>>
>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc
>>>> compilation requires roxygen2 (i think).
>>>>
>>>> For some reason RProtoBuf fails to import into another package, got a
>>>> weird exception when i put @import RProtoBuf into crunchR, so
>>>> RProtoBuf is now in "Suggests" category. Down the road that may be a
>>>> problem though...
>>>>
>>>> other than the template, not much else has been done so far... finding
>>>> hadoop libraries and adding it to the package path on initialization
>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided"
>>>> transitives to the crunchR's java part...
>>>>
>>>> No legal stuff...
>>>>
>>>> No readmes... complete stealth at this point.
>>>>
>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>> wrote:
>>>> > Ok, cool. I will try to roll project template by some time next week.
>>>> > we can start with prototyping and benchmarking something really
>>>> > simple, such as parallelDo().
>>>> >
>>>> > My interim goal is to perhaps take some more or less simple algorithm
>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or whatever
>>>> > name it has to be) in a comparable time (performance) but with much
>>>> > fewer lines of code. (say one of factorization or clustering things)
>>>> >
>>>> >
>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com> wrote:
>>>> >> I am not much of R user but I am interested to see how well we can
>>>> integrate
>>>> >> the two. I would be happy to help.
>>>> >>
>>>> >> regards,
>>>> >> Rahul
>>>> >>
>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>>>> >>>
>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Yep, ok.
>>>> >>>>
>>>> >>>> I imagine it has to be an R module so I can set up a maven
project
>>>> >>>> with java/R code tree (I have been doing that a lot lately).
Or if you
>>>> >>>> have a template to look at, it would be useful i guess too.
>>>> >>>
>>>> >>> No, please go right ahead.
>>>> >>>
>>>> >>>>
>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <josh.wills@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> I'd like it to be separate at first, but I am happy
to help. Github
>>>> >>>>> repo?
>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <dlieu.7@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype
on top of
>>>> >>>>>> Crunch for something simple. This should both save
time and prove or
>>>> >>>>>> disprove if Crunch via RJava integration is viable.
>>>> >>>>>>
>>>> >>>>>> On my part i can try to do it within Crunch framework
or we can keep
>>>> >>>>>> it completely separate.
>>>> >>>>>>
>>>> >>>>>> -d
>>>> >>>>>>
>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <jwills@cloudera.com>
>>>> >>>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> I am an avid R user and would be into it-- who
gave the talk? Was
>>>> it
>>>> >>>>>>> Murray Stokely?
>>>> >>>>>>>
>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov
<
>>>> dlieu.7@gmail.com>
>>>> >>>>>>
>>>> >>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> Hello,
>>>> >>>>>>>>
>>>> >>>>>>>> I was pretty excited to learn of Google's
experience of R mapping
>>>> of
>>>> >>>>>>>> flume java on one of recent BARUGs. I think
a lot of applications
>>>> >>>>>>>> similar to what we do in Mahout could be
prototyped using flume R.
>>>> >>>>>>>>
>>>> >>>>>>>> I did not quite get the details of Google
implementation of R
>>>> >>>>>>>> mapping,
>>>> >>>>>>>> but i am not sure if just a direct mapping
from R to Crunch would
>>>> be
>>>> >>>>>>>> sufficient (and, for most part, efficient).
RJava/JRI and jni
>>>> seem to
>>>> >>>>>>>> be a pretty terrible performer to do that
directly.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> on top of it, I am thinknig if this project
could have a
>>>> contributed
>>>> >>>>>>>> adapter to Mahout's distributed matrices,
that would be just a
>>>> very
>>>> >>>>>>>> good synergy.
>>>> >>>>>>>>
>>>> >>>>>>>> Is there anyone interested in contributing/advising
for open
>>>> source
>>>> >>>>>>>> version of flume R support? Just gauging
interest, Crunch list
>>>> seems
>>>> >>>>>>>> like a natural place to poke.
>>>> >>>>>>>>
>>>> >>>>>>>> Thanks .
>>>> >>>>>>>>
>>>> >>>>>>>> -Dmitriy
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Director of Data Science
>>>> >>>>>>> Cloudera
>>>> >>>>>>> Twitter: @josh_wills
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message