incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Crunch R first milestone
Date Sun, 18 Nov 2012 19:08:33 GMT
Question: is the Crunch 0.4.0 release available thru a maven repository?
How have you installed it into your local repo?


On Sun, Nov 18, 2012 at 10:30 AM, Josh Wills <jwills@cloudera.com> wrote:

> On Sun, Nov 18, 2012 at 10:13 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> > Thank you, Josh. Your insights are greatly appreciated.
> >
> > RProtoBuf has a bug with <<- operator. I already contacted the authors
> and
> > they confirmed it however it is not clear when they are going to fix it.
> >
> > (code to reproduce:
> >> library(RProtoBuf)
> >> a <<- "A"
> > causes an error)
> >
> > Actually RProtoBuf is not used right now. I will move it into
> "recommended"
> > realm again if it makes things easier.
> >
> > For me, the hardest part was to make jvm +hadoop to see JRI library
> > actually. I am still not sure about the best course of action here but
> > there is definitely more than one way
> >
> > Also my apologies for code styling, it is probably the ugliest code i've
> > ever written, but i will tidy it up once past the proof of concept stage.
>
> No judgements, man. You should have seen the first rev of Crunch. ;-)
>
> >
> > -d
> >
> >
> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <josh.wills@gmail.com>
> wrote:
> >
> >> Dmitrity,
> >>
> >> Just sent you a pull request based on playing with the code on OS X. It
> >> contains a README about my experience getting things working.
> >>
> >> Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
> running
> >> into some issues w/RProtoBuf on OS X. I'll give it another go this week
> on
> >> my Linux machine at work.
> >>
> >> J
> >>
> >>
> >> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >> >wrote:
> >>
> >> > Josh,
> >> >
> >> > ok the following commit
> >> >
> >> > ==============
> >> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> >> > Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
> >> > Date:   Sat Nov 17 12:29:27 2012 -0800
> >> >
> >> >     example1 succeeds
> >> >
> >> > ====================
> >> >
> >> > runs example 1 for me successfully in a fully distributed way which is
> >> > first step (map-only thing) for the word count.
> >> >
> >> > (I think there's a hickup somewhere here because in the output i also
> >> seem
> >> > to see some empty lines, so the strsplit() part is perhaps set up
> >> somewhat
> >> > incorrectly here, but it's not the point right now):
> >> >
> >> > ====Example1.R===========
> >> >
> >> > library(crunchR)
> >> >
> >> > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> >> >
> >> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
> >> >
> >> > outputPCol <- inputPCol$parallelDo(
> >> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> >> > )
> >> >
> >> > outputPCol$writeTextFile("/crunchr-examples/output")
> >> >
> >> > result <- pipeline$run()
> >> >
> >> > if ( !result$succeeded() ) stop ("pipeline failed.")
> >> >
> >> > ========================================
> >> >
> >> > I think R-java communication now should support multiple doFn ok and
> they
> >> > will be properly shut down and executed and synchronized even if they
> >> emit
> >> > in the cleanup phase.
> >> >
> >> > This example assumes a lot of defaults (such as RTypes which are by
> >> default
> >> > character vector singleton in and character vector out for a DoFn).
> Also
> >> > obviously uses text in-text out at this point only.
> >> >
> >> >
> >> > To run, install the package and upload the test input (test-prep.sh)
> >> > Assuming you have compiled the maven part, the R package snapshot
> could
> >> be
> >> > installed by running "install-snapshot-rpkg.sh".
> >> >
> >> > You also need to make sure your backend tasks see JRI library. there
> are
> >> > multiple ways to do it i guess but for the purposes of testing the
> >> > following just works for me in my mapred-site:
> >> >
> >> > <property>
> >> >    <name>mapred.child.java.opts</name>
> >> >
> >> >
> >> >
> >>
>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> >> > </value>
> >> >    <final>false</final>
> >> > </property>
> >> >
> >> >
> >> > I think at this point you guys might help me by doing review of that
> >> stuff,
> >> > asking questions and making suggestions how to go by incorporating
> other
> >> > types of doFn and perhaps a way to complete the word count example,
> >> perhaps
> >> > running comparative benchmarks with a java-only word count, how much
> >> > overhead we seem to be suffering here.
> >> >
> >> > I use StatEt in eclipse. Although it is a huge way forward, the
> process
> >> is
> >> > still extremely tedious since I don't know unit testing framework in R
> >> well
> >> > (so i just scribble some stuff on the side to unit-test this and that)
> >> and
> >> > the integration test running cycle is significant enough.
> >> >
> >> > Which is why any help and suggestions are very welcome!
> >> >
> >> > I will definitely add support for reading/writing sequence files and
> >> > Protobufs, as well as Mahout DRM's .
> >> >
> >> >
> >> > Thanks.
> >> > -Dmitrity
> >> >
> >>
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message