incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Crunch R first milestone
Date Thu, 22 Nov 2012 00:51:21 GMT
Nope. Just looked thru the slides.

I am dubious though it gained significant compatibility level with packages
at CRAN, especially those heavily invested in the C/RCpp code. And even if
it did, i still don't see how it could avoid using JNI with that stuff, as
well as loading some native libraries (but it probably could support it
more elegantly than jri offers out of the box -- but then JRI could be
easily modified to do the same).

I have been doing java since 1998 (and C stuff even longer) and have
accumulated a fairly big gripe against some aspects of jvm and "use of
state of the art JVM and GC" doesn't actually sound like  such a big merit
to me (Hopefully, my future employers are not be reading this since i sell
my java skills as a major part of my skill portfolio:).  I would not go
into details, but in part my fascination with R stemmed from its highly
performant native solvers. I really would think it would outdo some, if not
most, of our block solver code in java, given a chance to parallelize it
the same way.

R is not without problems. R is pretty, should i say, quirky, as a language
and multithreading support is basically non-existent for practical
purposes. If we say we look for smart language semantics, I'd be the first
to admit R is not the place to look for all that. But, much like matlab, R
largely is not used for its superfashionable language semantics properties,
it's something else. To me, the value of that something is largely in its
wide range of available methods and dataset oriented data types and
operations(i.e. very conducive to rapid prototyping and hiding complexity
in ML world). Sorry for reiterating these well-beaten points.

Renjin seems to be touting to redeem some of those OOO ills (not really
that important to me on ML side of things) but my concern is that it has a
long way to go to match native R both on compatibility and performance
fronts (important to me). Just being able to load and run embedded scripts
over base package without making JNI calls is not exactly what i am after
as a part of this exercise. I am looking for unconstrained use. Of course
that's only a first impression. And of course i could be wrong.

I will be happy to hear your thoughts.


On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills <jwills@cloudera.com> wrote:

> I'm going to play with this again over the break-- BTW, did you see Renjin?
> I somehow missed this, but it looks interesting.
>
> http://code.google.com/p/renjin/
>
>
> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills <josh.wills@gmail.com>
> wrote:
> >
> > > Dmitrity,
> > >
> > > Just sent you a pull request based on playing with the code on OS X. It
> > > contains a README about my experience getting things working.
> > >
> >
> > Are you sure it is doxygen package? I thought it was roxygen2 package?
> >
> > Actually there seems currently no best practice in existence for R5
> classes
> > + roxygen2 (and the guy ignores @import order of files, too). Hence the
> > hacks with file names.
> >
> >
> > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm
> running
> > > into some issues w/RProtoBuf on OS X. I'll give it another go this week
> > on
> > > my Linux machine at work.
> > >
> > ok i removed @import RProtoBuf, you should be able to install w/o it.
> Maven
> > still compiles protoc  stuff though.
> >
> > >
> > > J
> > >
> > >
> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > > >wrote:
> > >
> > > > Josh,
> > > >
> > > > ok the following commit
> > > >
> > > > ==============
> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3
> > > > Author: Dmitriy Lyubimov <dlyubimov@inadco.com>
> > > > Date:   Sat Nov 17 12:29:27 2012 -0800
> > > >
> > > >     example1 succeeds
> > > >
> > > > ====================
> > > >
> > > > runs example 1 for me successfully in a fully distributed way which
> is
> > > > first step (map-only thing) for the word count.
> > > >
> > > > (I think there's a hickup somewhere here because in the output i also
> > > seem
> > > > to see some empty lines, so the strsplit() part is perhaps set up
> > > somewhat
> > > > incorrectly here, but it's not the point right now):
> > > >
> > > > ====Example1.R===========
> > > >
> > > > library(crunchR)
> > > >
> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline")
> > > >
> > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input")
> > > >
> > > > outputPCol <- inputPCol$parallelDo(
> > > > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] )
> > > > )
> > > >
> > > > outputPCol$writeTextFile("/crunchr-examples/output")
> > > >
> > > > result <- pipeline$run()
> > > >
> > > > if ( !result$succeeded() ) stop ("pipeline failed.")
> > > >
> > > > ========================================
> > > >
> > > > I think R-java communication now should support multiple doFn ok and
> > they
> > > > will be properly shut down and executed and synchronized even if they
> > > emit
> > > > in the cleanup phase.
> > > >
> > > > This example assumes a lot of defaults (such as RTypes which are by
> > > default
> > > > character vector singleton in and character vector out for a DoFn).
> > Also
> > > > obviously uses text in-text out at this point only.
> > > >
> > > >
> > > > To run, install the package and upload the test input (test-prep.sh)
> > > > Assuming you have compiled the maven part, the R package snapshot
> could
> > > be
> > > > installed by running "install-snapshot-rpkg.sh".
> > > >
> > > > You also need to make sure your backend tasks see JRI library. there
> > are
> > > > multiple ways to do it i guess but for the purposes of testing the
> > > > following just works for me in my mapred-site:
> > > >
> > > > <property>
> > > >    <name>mapred.child.java.opts</name>
> > > >
> > > >
> > > >
> > >
> >
>  <value>-Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri
> > > > </value>
> > > >    <final>false</final>
> > > > </property>
> > > >
> > > >
> > > > I think at this point you guys might help me by doing review of that
> > > stuff,
> > > > asking questions and making suggestions how to go by incorporating
> > other
> > > > types of doFn and perhaps a way to complete the word count example,
> > > perhaps
> > > > running comparative benchmarks with a java-only word count, how much
> > > > overhead we seem to be suffering here.
> > > >
> > > > I use StatEt in eclipse. Although it is a huge way forward, the
> process
> > > is
> > > > still extremely tedious since I don't know unit testing framework in
> R
> > > well
> > > > (so i just scribble some stuff on the side to unit-test this and
> that)
> > > and
> > > > the integration test running cycle is significant enough.
> > > >
> > > > Which is why any help and suggestions are very welcome!
> > > >
> > > > I will definitely add support for reading/writing sequence files and
> > > > Protobufs, as well as Mahout DRM's .
> > > >
> > > >
> > > > Thanks.
> > > > -Dmitrity
> > > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message