incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Flume R -- any interest?
Date Wed, 24 Oct 2012 20:28:13 GMT
Project template

Default profile does not compile R artifact . R profile compiles R
artifact. for convenience, it is enabled by supplying -DR to mvn
command line, e.g.

mvn install -DR

there's also a helper that installs the snapshot version of the
package in the crunchR module.

There's RJava and JRI java dependencies which i did not find anywhere
in public maven repos; so it is installed into my github maven repo so
far. Should compile for 3rd party.

-DR compilation requires R, RJava and optionally, RProtoBuf. R Doc
compilation requires roxygen2 (i think).

For some reason RProtoBuf fails to import into another package, got a
weird exception when i put @import RProtoBuf into crunchR, so
RProtoBuf is now in "Suggests" category. Down the road that may be a
problem though...

other than the template, not much else has been done so far... finding
hadoop libraries and adding it to the package path on initialization
via "hadoop classpath"... adding Crunch jars and its non-"provided"
transitives to the crunchR's java part...

No legal stuff...

No readmes... complete stealth at this point.

On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <> wrote:
> Ok, cool. I will try to roll project template by some time next week.
> we can start with prototyping and benchmarking something really
> simple, such as parallelDo().
> My interim goal is to perhaps take some more or less simple algorithm
> from Mahout and demonstrate it can be solved with Rcrunch (or whatever
> name it has to be) in a comparable time (performance) but with much
> fewer lines of code. (say one of factorization or clustering things)
> On Wed, Oct 17, 2012 at 10:24 PM, Rahul <> wrote:
>> I am not much of R user but I am interested to see how well we can integrate
>> the two. I would be happy to help.
>> regards,
>> Rahul
>> On 18-10-2012 04:04, Josh Wills wrote:
>>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <>
>>> wrote:
>>>> Yep, ok.
>>>> I imagine it has to be an R module so I can set up a maven project
>>>> with java/R code tree (I have been doing that a lot lately). Or if you
>>>> have a template to look at, it would be useful i guess too.
>>> No, please go right ahead.
>>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <>
>>>>> I'd like it to be separate at first, but I am happy to help. Github
>>>>> repo?
>>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <>
>>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype on top of
>>>>>> Crunch for something simple. This should both save time and prove
>>>>>> disprove if Crunch via RJava integration is viable.
>>>>>> On my part i can try to do it within Crunch framework or we can keep
>>>>>> it completely separate.
>>>>>> -d
>>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <>
>>>>>> wrote:
>>>>>>> I am an avid R user and would be into it-- who gave the talk?
Was it
>>>>>>> Murray Stokely?
>>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov <>
>>>>>> wrote:
>>>>>>>> Hello,
>>>>>>>> I was pretty excited to learn of Google's experience of R
mapping of
>>>>>>>> flume java on one of recent BARUGs. I think a lot of applications
>>>>>>>> similar to what we do in Mahout could be prototyped using
flume R.
>>>>>>>> I did not quite get the details of Google implementation
of R
>>>>>>>> mapping,
>>>>>>>> but i am not sure if just a direct mapping from R to Crunch
would be
>>>>>>>> sufficient (and, for most part, efficient). RJava/JRI and
jni seem to
>>>>>>>> be a pretty terrible performer to do that directly.
>>>>>>>> on top of it, I am thinknig if this project could have a
>>>>>>>> adapter to Mahout's distributed matrices, that would be just
a very
>>>>>>>> good synergy.
>>>>>>>> Is there anyone interested in contributing/advising for open
>>>>>>>> version of flume R support? Just gauging interest, Crunch
list seems
>>>>>>>> like a natural place to poke.
>>>>>>>> Thanks .
>>>>>>>> -Dmitriy
>>>>>>> --
>>>>>>> Director of Data Science
>>>>>>> Cloudera
>>>>>>> Twitter: @josh_wills

View raw message