incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Crunch R first milestone
Date Sun, 18 Nov 2012 19:28:49 GMT
ah, indeed i see it there, ok i will add it to pom repos, thanks
On Nov 18, 2012 11:20 AM, "Josh Wills" <jwills@cloudera.com> wrote:

> On Sun, Nov 18, 2012 at 11:08 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Question: is the Crunch 0.4.0 release available thru a maven repository?
> > How have you installed it into your local repo?
> >
>
> It should be-- I think Matthias published the Maven artifacts on Friday. Of
> course, I might have just had it installed locally b/c I was testing the
> release. :)
>
>
> >
> >
> > On Sun, Nov 18, 2012 at 10:30 AM, Josh Wills <jwills@cloudera.com>
> wrote:
> >
> > > On Sun, Nov 18, 2012 at 10:13 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > > > Thank you, Josh. Your insights are greatly appreciated.
> > > >
> > > > RProtoBuf has a bug with <<- operator. I already contacted the
> authors
> > > and
> > > > they confirmed it however it is not clear when they are going to fix
> > it.
> > > >
> > > > (code to reproduce:
> > > >> library(RProtoBuf)
> > > >> a <<- "A"
> > > > causes an error)
> > > >
> > > > Actually RProtoBuf is not used right now. I will move it into
> > > "recommended"
> > > > realm again if it makes things easier.
> > > >
> > > > For me, the hardest part was to make jvm +hadoop to see JRI library
> > > > actually. I am still not sure about the best course of action here
> but
> > > > there is definitely more than one way
> > > >
> > > > Also my apologies for code styling, it is probably the ugliest code
> > i've
> > > > ever written, but i will tidy it up once past the proof of concept
> > stage.
> > >
> > > No judgements, man. You should have seen the first rev of Crunch. ;-)
> > >
> > > >
> > > > -d
> > > >
> > > >
> > > > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <josh.wills@gmail.com>
> > > wrote:
> > > >
> > > >> Dmitrity,
> > > >>
> > > >> Just sent you a pull request based on playing with the code on OS
X.
> > It
> > > >> contains a README about my experience getting things working.
> > > >>
> > > >> Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
> > > running
> > > >> into some issues w/RProtoBuf on OS X. I'll give it another go this
> > week
> > > on
> > > >> my Linux machine at work.
> > > >>
> > > >> J
> > > >>
> > > >>
> > > >> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Josh,
> > > >> >
> > > >> > ok the following commit
> > > >> >
> > > >> > ==============
> > > >> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> > > >> > Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
> > > >> > Date:   Sat Nov 17 12:29:27 2012 -0800
> > > >> >
> > > >> >     example1 succeeds
> > > >> >
> > > >> > ====================
> > > >> >
> > > >> > runs example 1 for me successfully in a fully distributed way
> which
> > is
> > > >> > first step (map-only thing) for the word count.
> > > >> >
> > > >> > (I think there's a hickup somewhere here because in the output
i
> > also
> > > >> seem
> > > >> > to see some empty lines, so the strsplit() part is perhaps set
up
> > > >> somewhat
> > > >> > incorrectly here, but it's not the point right now):
> > > >> >
> > > >> > ====Example1.R===========
> > > >> >
> > > >> > library(crunchR)
> > > >> >
> > > >> > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> > > >> >
> > > >> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
> > > >> >
> > > >> > outputPCol <- inputPCol$parallelDo(
> > > >> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]]
)
> > > >> > )
> > > >> >
> > > >> > outputPCol$writeTextFile("/crunchr-examples/output")
> > > >> >
> > > >> > result <- pipeline$run()
> > > >> >
> > > >> > if ( !result$succeeded() ) stop ("pipeline failed.")
> > > >> >
> > > >> > ========================================
> > > >> >
> > > >> > I think R-java communication now should support multiple doFn
ok
> and
> > > they
> > > >> > will be properly shut down and executed and synchronized even
if
> > they
> > > >> emit
> > > >> > in the cleanup phase.
> > > >> >
> > > >> > This example assumes a lot of defaults (such as RTypes which
are
> by
> > > >> default
> > > >> > character vector singleton in and character vector out for a
> DoFn).
> > > Also
> > > >> > obviously uses text in-text out at this point only.
> > > >> >
> > > >> >
> > > >> > To run, install the package and upload the test input
> (test-prep.sh)
> > > >> > Assuming you have compiled the maven part, the R package snapshot
> > > could
> > > >> be
> > > >> > installed by running "install-snapshot-rpkg.sh".
> > > >> >
> > > >> > You also need to make sure your backend tasks see JRI library.
> there
> > > are
> > > >> > multiple ways to do it i guess but for the purposes of testing
the
> > > >> > following just works for me in my mapred-site:
> > > >> >
> > > >> > <property>
> > > >> >    <name>mapred.child.java.opts</name>
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > >
> >
>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> > > >> > </value>
> > > >> >    <final>false</final>
> > > >> > </property>
> > > >> >
> > > >> >
> > > >> > I think at this point you guys might help me by doing review
of
> that
> > > >> stuff,
> > > >> > asking questions and making suggestions how to go by incorporating
> > > other
> > > >> > types of doFn and perhaps a way to complete the word count
> example,
> > > >> perhaps
> > > >> > running comparative benchmarks with a java-only word count, how
> much
> > > >> > overhead we seem to be suffering here.
> > > >> >
> > > >> > I use StatEt in eclipse. Although it is a huge way forward, the
> > > process
> > > >> is
> > > >> > still extremely tedious since I don't know unit testing framework
> > in R
> > > >> well
> > > >> > (so i just scribble some stuff on the side to unit-test this
and
> > that)
> > > >> and
> > > >> > the integration test running cycle is significant enough.
> > > >> >
> > > >> > Which is why any help and suggestions are very welcome!
> > > >> >
> > > >> > I will definitely add support for reading/writing sequence files
> and
> > > >> > Protobufs, as well as Mahout DRM's .
> > > >> >
> > > >> >
> > > >> > Thanks.
> > > >> > -Dmitrity
> > > >> >
> > > >>
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera
> > > Twitter: @josh_wills
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message