incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Sun, 18 Nov 2012 18:19:49 GMT
well, i figured out a way not to need it. So no, total node shutdown for R
side is not implemented but it is actually fine.

Basically, for backend, the two way pipeline never explitly shut down which
is fine as well as do function cleanup is synchronized.

Indeed, the problem of course is that in general we can't assume any
particular lifecycle of a DoFn and have to assume they may spring up to
life in arbitrary order, as well as cleanup.

The solution is just to flush the pipelines once doFn cleanpu is
encountered and wait for cleanup receipt from the R's DoFn doppleganger
before exiting the cleanup of a DoFn.

Once Crunch takes care of all DoFn being cleanup, it thus ensures all R
processing is also flushed and the queues are empty. It may be a little
less optimal than a single stage cleanup of everything but hopefully
cleanup is a smaller part compared to process(). (Actually, in Mahout
SSVD's cleanup emissions are often just as big or even larger than all the
process() emissions so I took care that these scenarios work just as good..)



On Sun, Nov 18, 2012 at 9:38 AM, Josh Wills <josh.wills@gmail.com> wrote:

> Curious-- did you figure out a hack to make this work, or is this still an
> open issue?
>
>
> On Fri, Nov 16, 2012 at 3:08 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > Or RTNode? I guess i am not sure what difference is.
> >
> > Bottom line, i need to do some task startup routines (e.g. establish
> > exchange queues between task and R) and also last thing cleanup before MR
> > tasks exits and _before all outputs are closed_. (kind of "flush all"
> > thing).
> >
> > Thanks.
> > -d
> >
> >
> > On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > > How do I hook into CrunchTaskContext to do a task cleanup (as opposed
> to
> > a
> > > DoFn etc.) ?
> > >
> > >
> > > On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> no it is fully distributed testing.
> > >>
> > >> It is ok, StatEt handles log4j logging for me so i see the logs. I was
> > >> wondering if any end-to-end diagnostics is already embedded in Crunch
> >  but
> > >> reporting backend errors to front end is notoriously hard (and
> > sometimes,
> > >> impossible) with hadoop, so I assume it doesn't make sense to report
> > >> client-only stuff thru exception while the other stuff still requires
> > >> checking isSucceeded().
> > >>
> > >>
> > >>
> > >> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jwills@cloudera.com>
> > wrote:
> > >>
> > >>> Are you running this using LocalJobRunner? Does calling
> > >>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
> > >>> settle a debate I'm having w/Matthias. ;-)
> > >>>
> > >>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > >>> wrote:
> > >>> > I see the error in the logs but Pipeline.run() has never thrown
> > >>> anything.
> > >>> > isSucceeded() subsequently returns false. Is there any way to
> extract
> > >>> > client-side problem rather than just being able to state that job
> > >>> failed?
> > >>> > or it is ok and the only diagnostics by design?
> > >>> >
> > >>> > ============
> > >>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
> > >>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> > path
> > >>> > does not exist: hdfs://localhost:11010/crunchr-example/input
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
> > >>> > at
> > >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
> > >>> > at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
> > >>> > at
> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
> > >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
> > >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
> > >>> > at java.security.AccessController.doPrivileged(Native Method)
> > >>> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> > >>> > at
> > >>>
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
> > >>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
> > >>> > at
> > org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
> > >>> > at java.lang.Thread.run(Thread.java:662)
> > >>> >
> > >>> >
> > >>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > >>> wrote:
> > >>> >
> > >>> >> for hadoop nodes i guess yet another option to soft-link the .so
> > into
> > >>> >> hadoop's native lib folder
> > >>> >>
> > >>> >>
> > >>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <
> > dlieu.7@gmail.com
> > >>> >wrote:
> > >>> >>
> > >>> >>> I actually want to defer this to hadoop admins, we just need to
> > >>> create a
> > >>> >>> procedure for setting up nodes. Ideally as simple as possible.
> > >>> something
> > >>> >>> like
> > >>> >>>
> > >>> >>> 1) setup R
> > >>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
> > >>> >>> 3) R CMD javareconf
> > >>> >>> 3) add result of R --vanilla <<< 'system.file("jri",
> > >>> package="rJava") to
> > >>> >>> either mapred command lines or LD_LIBRARY_PATH...
> > >>> >>>
> > >>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
> > >>> crunch
> > >>> >>> might have something to hide a lot of that complexity (since it
> is
> > >>> about
> > >>> >>> hiding complexities, for the most part :)  ) besides hadoop has a
> > >>> way to
> > >>> >>> ship .so's to the backend so if crunch had an api to do something
> > >>> similar
> > >>> >>> it is conceivable that driver might yank and ship it too to hide
> > that
> > >>> >>> complexity as well. But then there's a host of issues how to
> handle
> > >>> >>> potentially different rJava versions installed on different
> > nodes...
> > >>> So, it
> > >>> >>> increasingly looks like something we might want to defer to
> sysops
> > >>> to do
> > >>> >>> with approximate set of requirements .
> > >>> >>>
> > >>> >>>
> > >>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jwills@cloudera.com
> >
> > >>> wrote:
> > >>> >>>
> > >>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <
> > >>> dlieu.7@gmail.com>
> > >>> >>>> wrote:
> > >>> >>>>
> > >>> >>>> > so java tasks need to be able to load libjri.so from
> > >>> >>>> > whatever system.file("jri", package="rJava") says.
> > >>> >>>> >
> > >>> >>>> > Traditionally, these issues were handled with
> > -Djava.library.path.
> > >>> >>>> > Apparently there's nothing java task can do to enable
> > >>> loadLibrary()
> > >>> >>>> command
> > >>> >>>> > to see the damn library once started. But -Djava.library.path
> > >>> requires
> > >>> >>>> for
> > >>> >>>> > nodes to configure and lock jvm command line from
> modifications
> > >>> of the
> > >>> >>>> > client.  which is fine.
> > >>> >>>> >
> > >>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre
> > 1.6
> > >>> >>>> (again).
> > >>> >>>> >
> > >>> >>>> > but... any other suggestions about best practice configuring
> > >>> crunch to
> > >>> >>>> run
> > >>> >>>> > user's .so's?
> > >>> >>>> >
> > >>> >>>>
> > >>> >>>> Not off the top of my head. I suspect that whatever you come up
> > >>> with will
> > >>> >>>> become the "best practice." :)
> > >>> >>>>
> > >>> >>>> >
> > >>> >>>> > thanks.
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> >
> > >>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <
> > josh.wills@gmail.com
> > >>> >
> > >>> >>>> wrote:
> > >>> >>>> >
> > >>> >>>> > > I believe that is a safe assumption, at least right now.
> > >>> >>>> > >
> > >>> >>>> > >
> > >>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
> > >>> dlieu.7@gmail.com
> > >>> >>>> >
> > >>> >>>> > > wrote:
> > >>> >>>> > >
> > >>> >>>> > > > Question.
> > >>> >>>> > > >
> > >>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and
> > the
> > >>> >>>> process
> > >>> >>>> > > gets
> > >>> >>>> > > > emitter every time.
> > >>> >>>> > > >
> > >>> >>>> > > > However, my guess any single reincranation of a DoFn
> object
> > >>> in the
> > >>> >>>> > > backend
> > >>> >>>> > > > will always be getting the same emitter thru its
> lifecycle.
> > >>> Is it
> > >>> >>>> an
> > >>> >>>> > > > admissible assumption or there's currently a counter
> example
> > >>> to
> > >>> >>>> that?
> > >>> >>>> > > >
> > >>> >>>> > > > The problem is that as i implement the two way pipeline of
> > >>> input
> > >>> >>>> and
> > >>> >>>> > > > emitter data between R and Java, I am bulking these calls
> > >>> together
> > >>> >>>> for
> > >>> >>>> > > > performance reasons. Each individual datum in these chunks
> > of
> > >>> data
> > >>> >>>> will
> > >>> >>>> > > not
> > >>> >>>> > > > have attached emitter function information to them in any
> > way.
> > >>> >>>> (well it
> > >>> >>>> > > > could but it would be a performance killer and i bet
> emitter
> > >>> never
> > >>> >>>> > > > changes).
> > >>> >>>> > > >
> > >>> >>>> > > > So, thoughts? can i assume emitter never changes between
> > >>> first and
> > >>> >>>> lass
> > >>> >>>> > > > call to DoFn instance?
> > >>> >>>> > > >
> > >>> >>>> > > > thanks.
> > >>> >>>> > > >
> > >>> >>>> > > >
> > >>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
> > >>> >>>> dlieu.7@gmail.com>
> > >>> >>>> > > > wrote:
> > >>> >>>> > > >
> > >>> >>>> > > > > yes...
> > >>> >>>> > > > >
> > >>> >>>> > > > > i think it worked for me before, although just adding
> all
> > >>> jars
> > >>> >>>> from R
> > >>> >>>> > > > > package distribution would be a little bit more
> > appropriate
> > >>> >>>> approach
> > >>> >>>> > > > > -- but it creates a problem with jars in dependent R
> > >>> packages. I
> > >>> >>>> > think
> > >>> >>>> > > > > it would be much easier to just compile a hadoop-job
> file
> > >>> and
> > >>> >>>> stick
> > >>> >>>> > it
> > >>> >>>> > > > > in rather than doing cherry-picking of individual jars
> > from
> > >>> who
> > >>> >>>> knows
> > >>> >>>> > > > > how many locations.
> > >>> >>>> > > > >
> > >>> >>>> > > > > i think i used the hadoop job format with distributed
> > cache
> > >>> >>>> before
> > >>> >>>> > and
> > >>> >>>> > > > > it worked... at least with Pig "register jar"
> > functionality.
> > >>> >>>> > > > >
> > >>> >>>> > > > > ok i guess i will just try if it works.
> > >>> >>>> > > > >
> > >>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
> > >>> jwills@cloudera.com
> > >>> >>>> >
> > >>> >>>> > > wrote:
> > >>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> > >>> >>>> > dlieu.7@gmail.com
> > >>> >>>> > > >
> > >>> >>>> > > > > wrote:
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >> Great! so it is in Crunch.
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >> does it support hadoop-job jar format or only pure
> java
> > >>> jars?
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >
> > >>> >>>> > > > > > I think just pure jars-- you're referring to
> hadoop-job
> > >>> format
> > >>> >>>> as
> > >>> >>>> > > > having
> > >>> >>>> > > > > > all the dependencies in a lib/ directory within the
> jar?
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> > >>> >>>> jwills@cloudera.com>
> > >>> >>>> > > > > wrote:
> > >>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > >>> >>>> > > > dlieu.7@gmail.com>
> > >>> >>>> > > > > >> wrote:
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >> I think i need functionality to add more jars (or
> > >>> external
> > >>> >>>> > > > > hadoop-jar)
> > >>> >>>> > > > > >> >> to drive that from an R package. Just setting job
> > jar
> > >>> by
> > >>> >>>> class
> > >>> >>>> > is
> > >>> >>>> > > > not
> > >>> >>>> > > > > >> >> enough. I can push overall job-jar as an
> addiitonal
> > >>> jar to
> > >>> >>>> R
> > >>> >>>> > > > package;
> > >>> >>>> > > > > >> >> however, i cannot really run hadoop command line
> on
> > >>> it, i
> > >>> >>>> need
> > >>> >>>> > to
> > >>> >>>> > > > set
> > >>> >>>> > > > > >> >> up classpath thru RJava.
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely
> work
> > >>> here
> > >>> >>>> since
> > >>> >>>> > > we
> > >>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather
> > >>> have to
> > >>> >>>> > > construct
> > >>> >>>> > > > > >> >> them on the fly. (well, we could serialize
> pipeline
> > >>> >>>> definitions
> > >>> >>>> > > > from
> > >>> >>>> > > > > R
> > >>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
> > >>> >>>> cumbersome
> > >>> >>>> > and
> > >>> >>>> > > > > more
> > >>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
> > >>> shouldn't
> > >>> >>>> be
> > >>> >>>> > > able
> > >>> >>>> > > > to
> > >>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar"
> > >>> (mahout-like)
> > >>> >>>> when
> > >>> >>>> > > > kicking
> > >>> >>>> > > > > >> >> off a pipeline.
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy
> Lyubimov <
> > >>> >>>> > > > > dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> > Ok, sounds very promising...
> > >>> >>>> > > > > >> >> >
> > >>> >>>> > > > > >> >> > i'll try to start digging on the driver part
> this
> > >>> week
> > >>> >>>> then
> > >>> >>>> > > > > (Pipeline
> > >>> >>>> > > > > >> >> > wrapper in R5).
> > >>> >>>> > > > > >> >> >
> > >>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > >>> >>>> > > > josh.wills@gmail.com
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy
> > Lyubimov <
> > >>> >>>> > > > > dlieu.7@gmail.com
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >>> Ok, cool.
> > >>> >>>> > > > > >> >> >>>
> > >>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
> > >>> fairly
> > >>> >>>> > advanced
> > >>> >>>> > > > > state.
> > >>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper
> > is
> > >>> >>>> working ,
> > >>> >>>> > > > > right?
> > >>> >>>> > > > > >> Or
> > >>> >>>> > > > > >> >> >>> there's something that is not working
> > >>> specifically?
> > >>> >>>> > > > > >> >> >>
> > >>> >>>> > > > > >> >> >> I think the only thing in the paper that we
> don't
> > >>> have
> > >>> >>>> in a
> > >>> >>>> > > > > working
> > >>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a
> question
> > >>> of
> > >>> >>>> > > > prioritizing
> > >>> >>>> > > > > it
> > >>> >>>> > > > > >> >> >> and getting the work done.
> > >>> >>>> > > > > >> >> >>
> > >>> >>>> > > > > >> >> >>>
> > >>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> > >>> >>>> > > > jwills@cloudera.com
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >>>> Hey Dmitriy,
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to
> playing
> > >>> with
> > >>> >>>> > crunchR
> > >>> >>>> > > > > this
> > >>> >>>> > > > > >> >> weekend--
> > >>> >>>> > > > > >> >> >>>> thanks!
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> J
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
> > >>> Lyubimov <
> > >>> >>>> > > > > >> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> wrote:
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>>> Project template
> > >>> >>>> https://github.com/dlyubimov/crunchR
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact
> .
> > R
> > >>> >>>> profile
> > >>> >>>> > > > > compiles R
> > >>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
> > >>> >>>> supplying -DR
> > >>> >>>> > > to
> > >>> >>>> > > > > mvn
> > >>> >>>> > > > > >> >> >>>>> command line, e.g.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> mvn install -DR
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> there's also a helper that installs the
> > snapshot
> > >>> >>>> version
> > >>> >>>> > of
> > >>> >>>> > > > the
> > >>> >>>> > > > > >> >> >>>>> package in the crunchR module.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies
> which
> > i
> > >>> did
> > >>> >>>> not
> > >>> >>>> > > find
> > >>> >>>> > > > > >> anywhere
> > >>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed
> into
> > >>> my
> > >>> >>>> github
> > >>> >>>> > > > maven
> > >>> >>>> > > > > >> repo
> > >>> >>>> > > > > >> >> so
> > >>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and
> > >>> optionally,
> > >>> >>>> > > RProtoBuf.
> > >>> >>>> > > > R
> > >>> >>>> > > > > Doc
> > >>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import
> into
> > >>> >>>> another
> > >>> >>>> > > > package,
> > >>> >>>> > > > > >> got a
> > >>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf
> > >>> into
> > >>> >>>> > crunchR,
> > >>> >>>> > > so
> > >>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category.
> Down
> > >>> the
> > >>> >>>> road
> > >>> >>>> > that
> > >>> >>>> > > > may
> > >>> >>>> > > > > >> be a
> > >>> >>>> > > > > >> >> >>>>> problem though...
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> other than the template, not much else has
> > been
> > >>> done
> > >>> >>>> so
> > >>> >>>> > > > far...
> > >>> >>>> > > > > >> >> finding
> > >>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the
> package
> > >>> path on
> > >>> >>>> > > > > >> initialization
> > >>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars
> > >>> and its
> > >>> >>>> > > > > >> non-"provided"
> > >>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> No legal stuff...
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this
> point.
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
> > >>> Lyubimov <
> > >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project
> > template
> > >>> by
> > >>> >>>> some
> > >>> >>>> > > time
> > >>> >>>> > > > > next
> > >>> >>>> > > > > >> >> week.
> > >>> >>>> > > > > >> >> >>>>> > we can start with prototyping and
> > benchmarking
> > >>> >>>> > something
> > >>> >>>> > > > > really
> > >>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
> > >>> >>>> > > > > >> >> >>>>> >
> > >>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some
> more
> > >>> or
> > >>> >>>> less
> > >>> >>>> > > simple
> > >>> >>>> > > > > >> >> algorithm
> > >>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be
> solved
> > >>> with
> > >>> >>>> > Rcrunch
> > >>> >>>> > > > (or
> > >>> >>>> > > > > >> >> whatever
> > >>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
> > >>> >>>> (performance)
> > >>> >>>> > but
> > >>> >>>> > > > > with
> > >>> >>>> > > > > >> much
> > >>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of
> > >>> factorization or
> > >>> >>>> > > > clustering
> > >>> >>>> > > > > >> >> things)
> > >>> >>>> > > > > >> >> >>>>> >
> > >>> >>>> > > > > >> >> >>>>> >
> > >>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> > >>> >>>> > > rsharma@xebia.com
> > >>> >>>> > > > >
> > >>> >>>> > > > > >> wrote:
> > >>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am
> interested
> > >>> to
> > >>> >>>> see how
> > >>> >>>> > > > well
> > >>> >>>> > > > > we
> > >>> >>>> > > > > >> can
> > >>> >>>> > > > > >> >> >>>>> integrate
> > >>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
> > >>> >>>> > > > > >> >> >>>>> >>
> > >>> >>>> > > > > >> >> >>>>> >> regards,
> > >>> >>>> > > > > >> >> >>>>> >> Rahul
> > >>> >>>> > > > > >> >> >>>>> >>
> > >>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
> > >>> >>>> Lyubimov <
> > >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> >>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>
> > >>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
> > >>> >>>> > > > > >> >> >>>>> >>>>
> > >>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I
> > >>> can set
> > >>> >>>> up a
> > >>> >>>> > > > maven
> > >>> >>>> > > > > >> >> project
> > >>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been
> doing
> > >>> that a
> > >>> >>>> lot
> > >>> >>>> > > > > lately).
> > >>> >>>> > > > > >> Or
> > >>> >>>> > > > > >> >> if you
> > >>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
> > >>> useful i
> > >>> >>>> > guess
> > >>> >>>> > > > > too.
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>>>
> > >>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
> > >>> Wills <
> > >>> >>>> > > > > >> >> josh.wills@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first,
> but
> > >>> I am
> > >>> >>>> happy
> > >>> >>>> > > to
> > >>> >>>> > > > > help.
> > >>> >>>> > > > > >> >> Github
> > >>> >>>> > > > > >> >> >>>>> >>>>> repo?
> > >>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
> > >>> Lyubimov" <
> > >>> >>>> > > > > >> dlieu.7@gmail.com
> > >>> >>>> > > > > >> >> >
> > >>> >>>> > > > > >> >> >>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
> > >>> JRI/RJava
> > >>> >>>> > > > prototype
> > >>> >>>> > > > > on
> > >>> >>>> > > > > >> >> top of
> > >>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This
> > should
> > >>> both
> > >>> >>>> save
> > >>> >>>> > > > time
> > >>> >>>> > > > > and
> > >>> >>>> > > > > >> >> prove or
> > >>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava
> > integration
> > >>> is
> > >>> >>>> > viable.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
> > >>> Crunch
> > >>> >>>> > > framework
> > >>> >>>> > > > > or we
> > >>> >>>> > > > > >> >> can keep
> > >>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> -d
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
> > >>> Wills <
> > >>> >>>> > > > > >> >> jwills@cloudera.com>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be
> into
> > >>> it--
> > >>> >>>> who
> > >>> >>>> > gave
> > >>> >>>> > > > the
> > >>> >>>> > > > > >> >> talk? Was
> > >>> >>>> > > > > >> >> >>>>> it
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM,
> > Dmitriy
> > >>> >>>> > Lyubimov <
> > >>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
> > >>> Google's
> > >>> >>>> > > experience
> > >>> >>>> > > > > of R
> > >>> >>>> > > > > >> >> mapping
> > >>> >>>> > > > > >> >> >>>>> of
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent
> BARUGs. I
> > >>> think
> > >>> >>>> a
> > >>> >>>> > lot
> > >>> >>>> > > of
> > >>> >>>> > > > > >> >> applications
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout
> could
> > >>> be
> > >>> >>>> > > prototyped
> > >>> >>>> > > > > using
> > >>> >>>> > > > > >> >> flume R.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
> > >>> Google
> > >>> >>>> > > > > implementation
> > >>> >>>> > > > > >> of
> > >>> >>>> > > > > >> >> R
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
> > >>> mapping
> > >>> >>>> from
> > >>> >>>> > R
> > >>> >>>> > > to
> > >>> >>>> > > > > >> Crunch
> > >>> >>>> > > > > >> >> would
> > >>> >>>> > > > > >> >> >>>>> be
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
> > >>> efficient).
> > >>> >>>> > > > RJava/JRI
> > >>> >>>> > > > > and
> > >>> >>>> > > > > >> >> jni
> > >>> >>>> > > > > >> >> >>>>> seem to
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to
> do
> > >>> that
> > >>> >>>> > > directly.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
> > >>> project
> > >>> >>>> > could
> > >>> >>>> > > > > have a
> > >>> >>>> > > > > >> >> >>>>> contributed
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
> > >>> matrices,
> > >>> >>>> that
> > >>> >>>> > > would
> > >>> >>>> > > > > be
> > >>> >>>> > > > > >> >> just a
> > >>> >>>> > > > > >> >> >>>>> very
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> > >>> >>>> > > contributing/advising
> > >>> >>>> > > > > for
> > >>> >>>> > > > > >> open
> > >>> >>>> > > > > >> >> >>>>> source
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just
> > >>> gauging
> > >>> >>>> > interest,
> > >>> >>>> > > > > Crunch
> > >>> >>>> > > > > >> >> list
> > >>> >>>> > > > > >> >> >>>>> seems
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>>
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> --
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
> > >>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>>
> > >>> >>>> > > > > >> >> >>>>> >>
> > >>> >>>> > > > > >> >> >>>>>
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>>
> > >>> >>>> > > > > >> >> >>>> --
> > >>> >>>> > > > > >> >> >>>> Director of Data Science
> > >>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > >>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
> > >>> http://twitter.com/josh_wills>
> > >>> >>>> > > > > >> >>
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> >
> > >>> >>>> > > > > >> > --
> > >>> >>>> > > > > >> > Director of Data Science
> > >>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
> > >>> >>>> > > > > >> > Twitter: @josh_wills <
> http://twitter.com/josh_wills>
> > >>> >>>> > > > > >>
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >
> > >>> >>>> > > > > >
> > >>> >>>> > > > > > --
> > >>> >>>> > > > > > Director of Data Science
> > >>> >>>> > > > > > Cloudera <http://www.cloudera.com>
> > >>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >>> >>>> > > > >
> > >>> >>>> > > >
> > >>> >>>> > >
> > >>> >>>> >
> > >>> >>>>
> > >>> >>>>
> > >>> >>>>
> > >>> >>>> --
> > >>> >>>> Director of Data Science
> > >>> >>>> Cloudera <http://www.cloudera.com>
> > >>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > >>> >>>>
> > >>> >>>
> > >>> >>>
> > >>> >>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Director of Data Science
> > >>> Cloudera
> > >>> Twitter: @josh_wills
> > >>>
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message