incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Crunch R first milestone
Date Sun, 18 Nov 2012 17:37:38 GMT
Dmitrity,

Just sent you a pull request based on playing with the code on OS X. It
contains a README about my experience getting things working.

Unfortunately, I haven't succeeded in getting crunchR loaded, I'm running
into some issues w/RProtoBuf on OS X. I'll give it another go this week on
my Linux machine at work.

J


On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:

> Josh,
>
> ok the following commit
>
> ==============
> commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
> Date:   Sat Nov 17 12:29:27 2012 -0800
>
>     example1 succeeds
>
> ====================
>
> runs example 1 for me successfully in a fully distributed way which is
> first step (map-only thing) for the word count.
>
> (I think there's a hickup somewhere here because in the output i also seem
> to see some empty lines, so the strsplit() part is perhaps set up somewhat
> incorrectly here, but it's not the point right now):
>
> ====Example1.R===========
>
> library(crunchR)
>
> pipeline <- crunchR.MRPipeline$new("test-pipeline")
>
> inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
>
> outputPCol <- inputPCol$parallelDo(
> function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> )
>
> outputPCol$writeTextFile("/crunchr-examples/output")
>
> result <- pipeline$run()
>
> if ( !result$succeeded() ) stop ("pipeline failed.")
>
> ========================================
>
> I think R-java communication now should support multiple doFn ok and they
> will be properly shut down and executed and synchronized even if they emit
> in the cleanup phase.
>
> This example assumes a lot of defaults (such as RTypes which are by default
> character vector singleton in and character vector out for a DoFn). Also
> obviously uses text in-text out at this point only.
>
>
> To run, install the package and upload the test input (test-prep.sh)
> Assuming you have compiled the maven part, the R package snapshot could be
> installed by running "install-snapshot-rpkg.sh".
>
> You also need to make sure your backend tasks see JRI library. there are
> multiple ways to do it i guess but for the purposes of testing the
> following just works for me in my mapred-site:
>
> <property>
>    <name>mapred.child.java.opts</name>
>
>
>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> </value>
>    <final>false</final>
> </property>
>
>
> I think at this point you guys might help me by doing review of that stuff,
> asking questions and making suggestions how to go by incorporating other
> types of doFn and perhaps a way to complete the word count example, perhaps
> running comparative benchmarks with a java-only word count, how much
> overhead we seem to be suffering here.
>
> I use StatEt in eclipse. Although it is a huge way forward, the process is
> still extremely tedious since I don't know unit testing framework in R well
> (so i just scribble some stuff on the side to unit-test this and that) and
> the integration test running cycle is significant enough.
>
> Which is why any help and suggestions are very welcome!
>
> I will definitely add support for reading/writing sequence files and
> Protobufs, as well as Mahout DRM's .
>
>
> Thanks.
> -Dmitrity
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message