incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Crunch R first milestone
Date Thu, 22 Nov 2012 22:13:10 GMT
Ok ,  I guess i am going to work on the next milestone which is PTableType
serialization support between R and java sides.

once i am done with that, i guess i will be able to add other api and
complete word count example fairly easily.

Example1.R in its current state works.


On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <jwills@cloudera.com> wrote:

> I'm going to play with this again over the break-- BTW, did you see Renjin?
> I somehow missed this, but it looks interesting.
>
> http://code.google.com/p/renjin/
>
>
> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <josh.wills@gmail.com>
> wrote:
> >
> > > Dmitrity,
> > >
> > > Just sent you a pull request based on playing with the code on OS X. It
> > > contains a README about my experience getting things working.
> > >
> >
> > Are you sure it is doxygen package? I thought it was roxygen2 package?
> >
> > Actually there seems currently no best practice in existence for R5
> classes
> > + roxygen2 (and the guy ignores @import order of files, too). Hence the
> > hacks with file names.
> >
> >
> > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
> running
> > > into some issues w/RProtoBuf on OS X. I'll give it another go this week
> > on
> > > my Linux machine at work.
> > >
> > ok i removed @import RProtoBuf, you should be able to install w/o it.
> Maven
> > still compiles protoc  stuff though.
> >
> > >
> > > J
> > >
> > >
> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > > >wrote:
> > >
> > > > Josh,
> > > >
> > > > ok the following commit
> > > >
> > > > ==============
> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> > > > Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
> > > > Date:   Sat Nov 17 12:29:27 2012 -0800
> > > >
> > > >     example1 succeeds
> > > >
> > > > ====================
> > > >
> > > > runs example 1 for me successfully in a fully distributed way which
> is
> > > > first step (map-only thing) for the word count.
> > > >
> > > > (I think there's a hickup somewhere here because in the output i also
> > > seem
> > > > to see some empty lines, so the strsplit() part is perhaps set up
> > > somewhat
> > > > incorrectly here, but it's not the point right now):
> > > >
> > > > ====Example1.R===========
> > > >
> > > > library(crunchR)
> > > >
> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> > > >
> > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
> > > >
> > > > outputPCol <- inputPCol$parallelDo(
> > > > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> > > > )
> > > >
> > > > outputPCol$writeTextFile("/crunchr-examples/output")
> > > >
> > > > result <- pipeline$run()
> > > >
> > > > if ( !result$succeeded() ) stop ("pipeline failed.")
> > > >
> > > > ========================================
> > > >
> > > > I think R-java communication now should support multiple doFn ok and
> > they
> > > > will be properly shut down and executed and synchronized even if they
> > > emit
> > > > in the cleanup phase.
> > > >
> > > > This example assumes a lot of defaults (such as RTypes which are by
> > > default
> > > > character vector singleton in and character vector out for a DoFn).
> > Also
> > > > obviously uses text in-text out at this point only.
> > > >
> > > >
> > > > To run, install the package and upload the test input (test-prep.sh)
> > > > Assuming you have compiled the maven part, the R package snapshot
> could
> > > be
> > > > installed by running "install-snapshot-rpkg.sh".
> > > >
> > > > You also need to make sure your backend tasks see JRI library. there
> > are
> > > > multiple ways to do it i guess but for the purposes of testing the
> > > > following just works for me in my mapred-site:
> > > >
> > > > <property>
> > > >    <name>mapred.child.java.opts</name>
> > > >
> > > >
> > > >
> > >
> >
>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> > > > </value>
> > > >    <final>false</final>
> > > > </property>
> > > >
> > > >
> > > > I think at this point you guys might help me by doing review of that
> > > stuff,
> > > > asking questions and making suggestions how to go by incorporating
> > other
> > > > types of doFn and perhaps a way to complete the word count example,
> > > perhaps
> > > > running comparative benchmarks with a java-only word count, how much
> > > > overhead we seem to be suffering here.
> > > >
> > > > I use StatEt in eclipse. Although it is a huge way forward, the
> process
> > > is
> > > > still extremely tedious since I don't know unit testing framework in
> R
> > > well
> > > > (so i just scribble some stuff on the side to unit-test this and
> that)
> > > and
> > > > the integration test running cycle is significant enough.
> > > >
> > > > Which is why any help and suggestions are very welcome!
> > > >
> > > > I will definitely add support for reading/writing sequence files and
> > > > Protobufs, as well as Mahout DRM's .
> > > >
> > > >
> > > > Thanks.
> > > > -Dmitrity
> > > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message