crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Crunch R first milestone
Date Sun, 18 Nov 2012 19:34:06 GMT
oh, it is already in the central.. cool


On Sun, Nov 18, 2012 at 11:28 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:

> ah, indeed i see it there, ok i will add it to pom repos, thanks
> On Nov 18, 2012 11:20 AM, "Josh Wills" <jwills@cloudera.com> wrote:
>
>> On Sun, Nov 18, 2012 at 11:08 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>>
>> > Question: is the Crunch 0.4.0 release available thru a maven repository?
>> > How have you installed it into your local repo?
>> >
>>
>> It should be-- I think Matthias published the Maven artifacts on Friday.
>> Of
>> course, I might have just had it installed locally b/c I was testing the
>> release. :)
>>
>>
>> >
>> >
>> > On Sun, Nov 18, 2012 at 10:30 AM, Josh Wills <jwills@cloudera.com>
>> wrote:
>> >
>> > > On Sun, Nov 18, 2012 at 10:13 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> > > wrote:
>> > > > Thank you, Josh. Your insights are greatly appreciated.
>> > > >
>> > > > RProtoBuf has a bug with <<- operator. I already contacted the
>> authors
>> > > and
>> > > > they confirmed it however it is not clear when they are going to fix
>> > it.
>> > > >
>> > > > (code to reproduce:
>> > > >> library(RProtoBuf)
>> > > >> a <<- "A"
>> > > > causes an error)
>> > > >
>> > > > Actually RProtoBuf is not used right now. I will move it into
>> > > "recommended"
>> > > > realm again if it makes things easier.
>> > > >
>> > > > For me, the hardest part was to make jvm +hadoop to see JRI library
>> > > > actually. I am still not sure about the best course of action here
>> but
>> > > > there is definitely more than one way
>> > > >
>> > > > Also my apologies for code styling, it is probably the ugliest code
>> > i've
>> > > > ever written, but i will tidy it up once past the proof of concept
>> > stage.
>> > >
>> > > No judgements, man. You should have seen the first rev of Crunch. ;-)
>> > >
>> > > >
>> > > > -d
>> > > >
>> > > >
>> > > > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <josh.wills@gmail.com>
>> > > wrote:
>> > > >
>> > > >> Dmitrity,
>> > > >>
>> > > >> Just sent you a pull request based on playing with the code on
OS
>> X.
>> > It
>> > > >> contains a README about my experience getting things working.
>> > > >>
>> > > >> Unfortunately, I haven't succeeded in getting crunchR loaded,
I'm
>> > > running
>> > > >> into some issues w/RProtoBuf on OS X. I'll give it another go
this
>> > week
>> > > on
>> > > >> my Linux machine at work.
>> > > >>
>> > > >> J
>> > > >>
>> > > >>
>> > > >> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com
>> > > >> >wrote:
>> > > >>
>> > > >> > Josh,
>> > > >> >
>> > > >> > ok the following commit
>> > > >> >
>> > > >> > ==============
>> > > >> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
>> > > >> > Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
>> > > >> > Date:   Sat Nov 17 12:29:27 2012 -0800
>> > > >> >
>> > > >> >     example1 succeeds
>> > > >> >
>> > > >> > ====================
>> > > >> >
>> > > >> > runs example 1 for me successfully in a fully distributed
way
>> which
>> > is
>> > > >> > first step (map-only thing) for the word count.
>> > > >> >
>> > > >> > (I think there's a hickup somewhere here because in the output
i
>> > also
>> > > >> seem
>> > > >> > to see some empty lines, so the strsplit() part is perhaps
set up
>> > > >> somewhat
>> > > >> > incorrectly here, but it's not the point right now):
>> > > >> >
>> > > >> > ====Example1.R===========
>> > > >> >
>> > > >> > library(crunchR)
>> > > >> >
>> > > >> > pipeline <- crunchR.MRPipeline$new("test-pipeline")
>> > > >> >
>> > > >> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
>> > > >> >
>> > > >> > outputPCol <- inputPCol$parallelDo(
>> > > >> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]]
>> )
>> > > >> > )
>> > > >> >
>> > > >> > outputPCol$writeTextFile("/crunchr-examples/output")
>> > > >> >
>> > > >> > result <- pipeline$run()
>> > > >> >
>> > > >> > if ( !result$succeeded() ) stop ("pipeline failed.")
>> > > >> >
>> > > >> > ========================================
>> > > >> >
>> > > >> > I think R-java communication now should support multiple
doFn ok
>> and
>> > > they
>> > > >> > will be properly shut down and executed and synchronized
even if
>> > they
>> > > >> emit
>> > > >> > in the cleanup phase.
>> > > >> >
>> > > >> > This example assumes a lot of defaults (such as RTypes which
are
>> by
>> > > >> default
>> > > >> > character vector singleton in and character vector out for
a
>> DoFn).
>> > > Also
>> > > >> > obviously uses text in-text out at this point only.
>> > > >> >
>> > > >> >
>> > > >> > To run, install the package and upload the test input
>> (test-prep.sh)
>> > > >> > Assuming you have compiled the maven part, the R package
snapshot
>> > > could
>> > > >> be
>> > > >> > installed by running "install-snapshot-rpkg.sh".
>> > > >> >
>> > > >> > You also need to make sure your backend tasks see JRI library.
>> there
>> > > are
>> > > >> > multiple ways to do it i guess but for the purposes of testing
>> the
>> > > >> > following just works for me in my mapred-site:
>> > > >> >
>> > > >> > <property>
>> > > >> >    <name>mapred.child.java.opts</name>
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >>
>> > >
>> >
>>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
>> > > >> > </value>
>> > > >> >    <final>false</final>
>> > > >> > </property>
>> > > >> >
>> > > >> >
>> > > >> > I think at this point you guys might help me by doing review
of
>> that
>> > > >> stuff,
>> > > >> > asking questions and making suggestions how to go by
>> incorporating
>> > > other
>> > > >> > types of doFn and perhaps a way to complete the word count
>> example,
>> > > >> perhaps
>> > > >> > running comparative benchmarks with a java-only word count,
how
>> much
>> > > >> > overhead we seem to be suffering here.
>> > > >> >
>> > > >> > I use StatEt in eclipse. Although it is a huge way forward,
the
>> > > process
>> > > >> is
>> > > >> > still extremely tedious since I don't know unit testing framework
>> > in R
>> > > >> well
>> > > >> > (so i just scribble some stuff on the side to unit-test this
and
>> > that)
>> > > >> and
>> > > >> > the integration test running cycle is significant enough.
>> > > >> >
>> > > >> > Which is why any help and suggestions are very welcome!
>> > > >> >
>> > > >> > I will definitely add support for reading/writing sequence
files
>> and
>> > > >> > Protobufs, as well as Mahout DRM's .
>> > > >> >
>> > > >> >
>> > > >> > Thanks.
>> > > >> > -Dmitrity
>> > > >> >
>> > > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Director of Data Science
>> > > Cloudera
>> > > Twitter: @josh_wills
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message