incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Crunch R first milestone
Date Sat, 17 Nov 2012 20:49:11 GMT

ok the following commit

commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
Author: Dmitriy Lyubimov <>
Date:   Sat Nov 17 12:29:27 2012 -0800

    example1 succeeds


runs example 1 for me successfully in a fully distributed way which is
first step (map-only thing) for the word count.

(I think there's a hickup somewhere here because in the output i also seem
to see some empty lines, so the strsplit() part is perhaps set up somewhat
incorrectly here, but it's not the point right now):



pipeline <- crunchR.MRPipeline$new("test-pipeline")

inputPCol <- pipeline$readTextFile("/crunchr-examples/input")

outputPCol <- inputPCol$parallelDo(
function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )


result <- pipeline$run()

if ( !result$succeeded() ) stop ("pipeline failed.")


I think R-java communication now should support multiple doFn ok and they
will be properly shut down and executed and synchronized even if they emit
in the cleanup phase.

This example assumes a lot of defaults (such as RTypes which are by default
character vector singleton in and character vector out for a DoFn). Also
obviously uses text in-text out at this point only.

To run, install the package and upload the test input (
Assuming you have compiled the maven part, the R package snapshot could be
installed by running "".

You also need to make sure your backend tasks see JRI library. there are
multiple ways to do it i guess but for the purposes of testing the
following just works for me in my mapred-site:



I think at this point you guys might help me by doing review of that stuff,
asking questions and making suggestions how to go by incorporating other
types of doFn and perhaps a way to complete the word count example, perhaps
running comparative benchmarks with a java-only word count, how much
overhead we seem to be suffering here.

I use StatEt in eclipse. Although it is a huge way forward, the process is
still extremely tedious since I don't know unit testing framework in R well
(so i just scribble some stuff on the side to unit-test this and that) and
the integration test running cycle is significant enough.

Which is why any help and suggestions are very welcome!

I will definitely add support for reading/writing sequence files and
Protobufs, as well as Mahout DRM's .


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message