crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Crunch R first milestone
Date Sun, 18 Nov 2012 18:13:42 GMT
Thank you, Josh. Your insights are greatly appreciated.

RProtoBuf has a bug with <<- operator. I already contacted the authors and
they confirmed it however it is not clear when they are going to fix it.

(code to reproduce:
> library(RProtoBuf)
> a <<- "A"
causes an error)

Actually RProtoBuf is not used right now. I will move it into "recommended"
realm again if it makes things easier.

For me, the hardest part was to make jvm +hadoop to see JRI library
actually. I am still not sure about the best course of action here but
there is definitely more than one way

Also my apologies for code styling, it is probably the ugliest code i've
ever written, but i will tidy it up once past the proof of concept stage.

-d


On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <josh.wills@gmail.com> wrote:

> Dmitrity,
>
> Just sent you a pull request based on playing with the code on OS X. It
> contains a README about my experience getting things working.
>
> Unfortunately, I haven't succeeded in getting crunchR loaded, I'm running
> into some issues w/RProtoBuf on OS X. I'll give it another go this week on
> my Linux machine at work.
>
> J
>
>
> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Josh,
> >
> > ok the following commit
> >
> > ==============
> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> > Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
> > Date:   Sat Nov 17 12:29:27 2012 -0800
> >
> >     example1 succeeds
> >
> > ====================
> >
> > runs example 1 for me successfully in a fully distributed way which is
> > first step (map-only thing) for the word count.
> >
> > (I think there's a hickup somewhere here because in the output i also
> seem
> > to see some empty lines, so the strsplit() part is perhaps set up
> somewhat
> > incorrectly here, but it's not the point right now):
> >
> > ====Example1.R===========
> >
> > library(crunchR)
> >
> > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> >
> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
> >
> > outputPCol <- inputPCol$parallelDo(
> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> > )
> >
> > outputPCol$writeTextFile("/crunchr-examples/output")
> >
> > result <- pipeline$run()
> >
> > if ( !result$succeeded() ) stop ("pipeline failed.")
> >
> > ========================================
> >
> > I think R-java communication now should support multiple doFn ok and they
> > will be properly shut down and executed and synchronized even if they
> emit
> > in the cleanup phase.
> >
> > This example assumes a lot of defaults (such as RTypes which are by
> default
> > character vector singleton in and character vector out for a DoFn). Also
> > obviously uses text in-text out at this point only.
> >
> >
> > To run, install the package and upload the test input (test-prep.sh)
> > Assuming you have compiled the maven part, the R package snapshot could
> be
> > installed by running "install-snapshot-rpkg.sh".
> >
> > You also need to make sure your backend tasks see JRI library. there are
> > multiple ways to do it i guess but for the purposes of testing the
> > following just works for me in my mapred-site:
> >
> > <property>
> >    <name>mapred.child.java.opts</name>
> >
> >
> >
>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> > </value>
> >    <final>false</final>
> > </property>
> >
> >
> > I think at this point you guys might help me by doing review of that
> stuff,
> > asking questions and making suggestions how to go by incorporating other
> > types of doFn and perhaps a way to complete the word count example,
> perhaps
> > running comparative benchmarks with a java-only word count, how much
> > overhead we seem to be suffering here.
> >
> > I use StatEt in eclipse. Although it is a huge way forward, the process
> is
> > still extremely tedious since I don't know unit testing framework in R
> well
> > (so i just scribble some stuff on the side to unit-test this and that)
> and
> > the integration test running cycle is significant enough.
> >
> > Which is why any help and suggestions are very welcome!
> >
> > I will definitely add support for reading/writing sequence files and
> > Protobufs, as well as Mahout DRM's .
> >
> >
> > Thanks.
> > -Dmitrity
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message