crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Flume R -- any interest?
Date Sun, 18 Nov 2012 17:38:26 GMT
Curious-- did you figure out a hack to make this work, or is this still an
open issue?


On Fri, Nov 16, 2012 at 3:08 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> Or RTNode? I guess i am not sure what difference is.
>
> Bottom line, i need to do some task startup routines (e.g. establish
> exchange queues between task and R) and also last thing cleanup before MR
> tasks exits and _before all outputs are closed_. (kind of "flush all"
> thing).
>
> Thanks.
> -d
>
>
> On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > How do I hook into CrunchTaskContext to do a task cleanup (as opposed to
> a
> > DoFn etc.) ?
> >
> >
> > On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> no it is fully distributed testing.
> >>
> >> It is ok, StatEt handles log4j logging for me so i see the logs. I was
> >> wondering if any end-to-end diagnostics is already embedded in Crunch
>  but
> >> reporting backend errors to front end is notoriously hard (and
> sometimes,
> >> impossible) with hadoop, so I assume it doesn't make sense to report
> >> client-only stuff thru exception while the other stuff still requires
> >> checking isSucceeded().
> >>
> >>
> >>
> >> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jwills@cloudera.com>
> wrote:
> >>
> >>> Are you running this using LocalJobRunner? Does calling
> >>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
> >>> settle a debate I'm having w/Matthias. ;-)
> >>>
> >>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> >>> wrote:
> >>> > I see the error in the logs but Pipeline.run() has never thrown
> >>> anything.
> >>> > isSucceeded() subsequently returns false. Is there any way to extract
> >>> > client-side problem rather than just being able to state that job
> >>> failed?
> >>> > or it is ok and the only diagnostics by design?
> >>> >
> >>> > ============
> >>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
> >>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> path
> >>> > does not exist: hdfs://localhost:11010/crunchr-example/input
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
> >>> > at
> >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
> >>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
> >>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
> >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
> >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
> >>> > at java.security.AccessController.doPrivileged(Native Method)
> >>> > at javax.security.auth.Subject.doAs(Subject.java:396)
> >>> > at
> >>> >
> >>>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
> >>> > at
> >>>
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
> >>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
> >>> > at
> >>> >
> >>>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
> >>> > at
> org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
> >>> > at
> >>> >
> >>>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
> >>> > at
> >>> >
> >>>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
> >>> > at java.lang.Thread.run(Thread.java:662)
> >>> >
> >>> >
> >>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >>> wrote:
> >>> >
> >>> >> for hadoop nodes i guess yet another option to soft-link the .so
> into
> >>> >> hadoop's native lib folder
> >>> >>
> >>> >>
> >>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> >>> >wrote:
> >>> >>
> >>> >>> I actually want to defer this to hadoop admins, we just need to
> >>> create a
> >>> >>> procedure for setting up nodes. Ideally as simple as possible.
> >>> something
> >>> >>> like
> >>> >>>
> >>> >>> 1) setup R
> >>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
> >>> >>> 3) R CMD javareconf
> >>> >>> 3) add result of R --vanilla <<< 'system.file("jri",
> >>> package="rJava") to
> >>> >>> either mapred command lines or LD_LIBRARY_PATH...
> >>> >>>
> >>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
> >>> crunch
> >>> >>> might have something to hide a lot of that complexity (since it is
> >>> about
> >>> >>> hiding complexities, for the most part :)  ) besides hadoop has a
> >>> way to
> >>> >>> ship .so's to the backend so if crunch had an api to do something
> >>> similar
> >>> >>> it is conceivable that driver might yank and ship it too to hide
> that
> >>> >>> complexity as well. But then there's a host of issues how to handle
> >>> >>> potentially different rJava versions installed on different
> nodes...
> >>> So, it
> >>> >>> increasingly looks like something we might want to defer to sysops
> >>> to do
> >>> >>> with approximate set of requirements .
> >>> >>>
> >>> >>>
> >>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jwills@cloudera.com>
> >>> wrote:
> >>> >>>
> >>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <
> >>> dlieu.7@gmail.com>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>> > so java tasks need to be able to load libjri.so from
> >>> >>>> > whatever system.file("jri", package="rJava") says.
> >>> >>>> >
> >>> >>>> > Traditionally, these issues were handled with
> -Djava.library.path.
> >>> >>>> > Apparently there's nothing java task can do to enable
> >>> loadLibrary()
> >>> >>>> command
> >>> >>>> > to see the damn library once started. But -Djava.library.path
> >>> requires
> >>> >>>> for
> >>> >>>> > nodes to configure and lock jvm command line from modifications
> >>> of the
> >>> >>>> > client.  which is fine.
> >>> >>>> >
> >>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre
> 1.6
> >>> >>>> (again).
> >>> >>>> >
> >>> >>>> > but... any other suggestions about best practice configuring
> >>> crunch to
> >>> >>>> run
> >>> >>>> > user's .so's?
> >>> >>>> >
> >>> >>>>
> >>> >>>> Not off the top of my head. I suspect that whatever you come up
> >>> with will
> >>> >>>> become the "best practice." :)
> >>> >>>>
> >>> >>>> >
> >>> >>>> > thanks.
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> >
> >>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <
> josh.wills@gmail.com
> >>> >
> >>> >>>> wrote:
> >>> >>>> >
> >>> >>>> > > I believe that is a safe assumption, at least right now.
> >>> >>>> > >
> >>> >>>> > >
> >>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
> >>> dlieu.7@gmail.com
> >>> >>>> >
> >>> >>>> > > wrote:
> >>> >>>> > >
> >>> >>>> > > > Question.
> >>> >>>> > > >
> >>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and
> the
> >>> >>>> process
> >>> >>>> > > gets
> >>> >>>> > > > emitter every time.
> >>> >>>> > > >
> >>> >>>> > > > However, my guess any single reincranation of a DoFn object
> >>> in the
> >>> >>>> > > backend
> >>> >>>> > > > will always be getting the same emitter thru its lifecycle.
> >>> Is it
> >>> >>>> an
> >>> >>>> > > > admissible assumption or there's currently a counter example
> >>> to
> >>> >>>> that?
> >>> >>>> > > >
> >>> >>>> > > > The problem is that as i implement the two way pipeline of
> >>> input
> >>> >>>> and
> >>> >>>> > > > emitter data between R and Java, I am bulking these calls
> >>> together
> >>> >>>> for
> >>> >>>> > > > performance reasons. Each individual datum in these chunks
> of
> >>> data
> >>> >>>> will
> >>> >>>> > > not
> >>> >>>> > > > have attached emitter function information to them in any
> way.
> >>> >>>> (well it
> >>> >>>> > > > could but it would be a performance killer and i bet emitter
> >>> never
> >>> >>>> > > > changes).
> >>> >>>> > > >
> >>> >>>> > > > So, thoughts? can i assume emitter never changes between
> >>> first and
> >>> >>>> lass
> >>> >>>> > > > call to DoFn instance?
> >>> >>>> > > >
> >>> >>>> > > > thanks.
> >>> >>>> > > >
> >>> >>>> > > >
> >>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
> >>> >>>> dlieu.7@gmail.com>
> >>> >>>> > > > wrote:
> >>> >>>> > > >
> >>> >>>> > > > > yes...
> >>> >>>> > > > >
> >>> >>>> > > > > i think it worked for me before, although just adding all
> >>> jars
> >>> >>>> from R
> >>> >>>> > > > > package distribution would be a little bit more
> appropriate
> >>> >>>> approach
> >>> >>>> > > > > -- but it creates a problem with jars in dependent R
> >>> packages. I
> >>> >>>> > think
> >>> >>>> > > > > it would be much easier to just compile a hadoop-job file
> >>> and
> >>> >>>> stick
> >>> >>>> > it
> >>> >>>> > > > > in rather than doing cherry-picking of individual jars
> from
> >>> who
> >>> >>>> knows
> >>> >>>> > > > > how many locations.
> >>> >>>> > > > >
> >>> >>>> > > > > i think i used the hadoop job format with distributed
> cache
> >>> >>>> before
> >>> >>>> > and
> >>> >>>> > > > > it worked... at least with Pig "register jar"
> functionality.
> >>> >>>> > > > >
> >>> >>>> > > > > ok i guess i will just try if it works.
> >>> >>>> > > > >
> >>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
> >>> jwills@cloudera.com
> >>> >>>> >
> >>> >>>> > > wrote:
> >>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> >>> >>>> > dlieu.7@gmail.com
> >>> >>>> > > >
> >>> >>>> > > > > wrote:
> >>> >>>> > > > > >
> >>> >>>> > > > > >> Great! so it is in Crunch.
> >>> >>>> > > > > >>
> >>> >>>> > > > > >> does it support hadoop-job jar format or only pure java
> >>> jars?
> >>> >>>> > > > > >>
> >>> >>>> > > > > >
> >>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job
> >>> format
> >>> >>>> as
> >>> >>>> > > > having
> >>> >>>> > > > > > all the dependencies in a lib/ directory within the jar?
> >>> >>>> > > > > >
> >>> >>>> > > > > >
> >>> >>>> > > > > >>
> >>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> >>> >>>> jwills@cloudera.com>
> >>> >>>> > > > > wrote:
> >>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> >>> >>>> > > > dlieu.7@gmail.com>
> >>> >>>> > > > > >> wrote:
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >> I think i need functionality to add more jars (or
> >>> external
> >>> >>>> > > > > hadoop-jar)
> >>> >>>> > > > > >> >> to drive that from an R package. Just setting job
> jar
> >>> by
> >>> >>>> class
> >>> >>>> > is
> >>> >>>> > > > not
> >>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal
> >>> jar to
> >>> >>>> R
> >>> >>>> > > > package;
> >>> >>>> > > > > >> >> however, i cannot really run hadoop command line on
> >>> it, i
> >>> >>>> need
> >>> >>>> > to
> >>> >>>> > > > set
> >>> >>>> > > > > >> >> up classpath thru RJava.
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work
> >>> here
> >>> >>>> since
> >>> >>>> > > we
> >>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather
> >>> have to
> >>> >>>> > > construct
> >>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
> >>> >>>> definitions
> >>> >>>> > > > from
> >>> >>>> > > > > R
> >>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
> >>> >>>> cumbersome
> >>> >>>> > and
> >>> >>>> > > > > more
> >>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
> >>> shouldn't
> >>> >>>> be
> >>> >>>> > > able
> >>> >>>> > > > to
> >>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar"
> >>> (mahout-like)
> >>> >>>> when
> >>> >>>> > > > kicking
> >>> >>>> > > > > >> >> off a pipeline.
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> >>> >>>> > > > > dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> > Ok, sounds very promising...
> >>> >>>> > > > > >> >> >
> >>> >>>> > > > > >> >> > i'll try to start digging on the driver part this
> >>> week
> >>> >>>> then
> >>> >>>> > > > > (Pipeline
> >>> >>>> > > > > >> >> > wrapper in R5).
> >>> >>>> > > > > >> >> >
> >>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> >>> >>>> > > > josh.wills@gmail.com
> >>> >>>> > > > > >
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy
> Lyubimov <
> >>> >>>> > > > > dlieu.7@gmail.com
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >>> Ok, cool.
> >>> >>>> > > > > >> >> >>>
> >>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
> >>> fairly
> >>> >>>> > advanced
> >>> >>>> > > > > state.
> >>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper
> is
> >>> >>>> working ,
> >>> >>>> > > > > right?
> >>> >>>> > > > > >> Or
> >>> >>>> > > > > >> >> >>> there's something that is not working
> >>> specifically?
> >>> >>>> > > > > >> >> >>
> >>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't
> >>> have
> >>> >>>> in a
> >>> >>>> > > > > working
> >>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question
> >>> of
> >>> >>>> > > > prioritizing
> >>> >>>> > > > > it
> >>> >>>> > > > > >> >> >> and getting the work done.
> >>> >>>> > > > > >> >> >>
> >>> >>>> > > > > >> >> >>>
> >>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> >>> >>>> > > > jwills@cloudera.com
> >>> >>>> > > > > >
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >>>> Hey Dmitriy,
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing
> >>> with
> >>> >>>> > crunchR
> >>> >>>> > > > > this
> >>> >>>> > > > > >> >> weekend--
> >>> >>>> > > > > >> >> >>>> thanks!
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> J
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
> >>> Lyubimov <
> >>> >>>> > > > > >> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> wrote:
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>>> Project template
> >>> >>>> https://github.com/dlyubimov/crunchR
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact .
> R
> >>> >>>> profile
> >>> >>>> > > > > compiles R
> >>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
> >>> >>>> supplying -DR
> >>> >>>> > > to
> >>> >>>> > > > > mvn
> >>> >>>> > > > > >> >> >>>>> command line, e.g.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> mvn install -DR
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> there's also a helper that installs the
> snapshot
> >>> >>>> version
> >>> >>>> > of
> >>> >>>> > > > the
> >>> >>>> > > > > >> >> >>>>> package in the crunchR module.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which
> i
> >>> did
> >>> >>>> not
> >>> >>>> > > find
> >>> >>>> > > > > >> anywhere
> >>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into
> >>> my
> >>> >>>> github
> >>> >>>> > > > maven
> >>> >>>> > > > > >> repo
> >>> >>>> > > > > >> >> so
> >>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and
> >>> optionally,
> >>> >>>> > > RProtoBuf.
> >>> >>>> > > > R
> >>> >>>> > > > > Doc
> >>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
> >>> >>>> another
> >>> >>>> > > > package,
> >>> >>>> > > > > >> got a
> >>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf
> >>> into
> >>> >>>> > crunchR,
> >>> >>>> > > so
> >>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down
> >>> the
> >>> >>>> road
> >>> >>>> > that
> >>> >>>> > > > may
> >>> >>>> > > > > >> be a
> >>> >>>> > > > > >> >> >>>>> problem though...
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> other than the template, not much else has
> been
> >>> done
> >>> >>>> so
> >>> >>>> > > > far...
> >>> >>>> > > > > >> >> finding
> >>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package
> >>> path on
> >>> >>>> > > > > >> initialization
> >>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars
> >>> and its
> >>> >>>> > > > > >> non-"provided"
> >>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> No legal stuff...
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
> >>> Lyubimov <
> >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> >>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project
> template
> >>> by
> >>> >>>> some
> >>> >>>> > > time
> >>> >>>> > > > > next
> >>> >>>> > > > > >> >> week.
> >>> >>>> > > > > >> >> >>>>> > we can start with prototyping and
> benchmarking
> >>> >>>> > something
> >>> >>>> > > > > really
> >>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
> >>> >>>> > > > > >> >> >>>>> >
> >>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more
> >>> or
> >>> >>>> less
> >>> >>>> > > simple
> >>> >>>> > > > > >> >> algorithm
> >>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved
> >>> with
> >>> >>>> > Rcrunch
> >>> >>>> > > > (or
> >>> >>>> > > > > >> >> whatever
> >>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
> >>> >>>> (performance)
> >>> >>>> > but
> >>> >>>> > > > > with
> >>> >>>> > > > > >> much
> >>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of
> >>> factorization or
> >>> >>>> > > > clustering
> >>> >>>> > > > > >> >> things)
> >>> >>>> > > > > >> >> >>>>> >
> >>> >>>> > > > > >> >> >>>>> >
> >>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> >>> >>>> > > rsharma@xebia.com
> >>> >>>> > > > >
> >>> >>>> > > > > >> wrote:
> >>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested
> >>> to
> >>> >>>> see how
> >>> >>>> > > > well
> >>> >>>> > > > > we
> >>> >>>> > > > > >> can
> >>> >>>> > > > > >> >> >>>>> integrate
> >>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
> >>> >>>> > > > > >> >> >>>>> >>
> >>> >>>> > > > > >> >> >>>>> >> regards,
> >>> >>>> > > > > >> >> >>>>> >> Rahul
> >>> >>>> > > > > >> >> >>>>> >>
> >>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
> >>> >>>> Lyubimov <
> >>> >>>> > > > > >> >> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> >>>>> >>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>
> >>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
> >>> >>>> > > > > >> >> >>>>> >>>>
> >>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I
> >>> can set
> >>> >>>> up a
> >>> >>>> > > > maven
> >>> >>>> > > > > >> >> project
> >>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing
> >>> that a
> >>> >>>> lot
> >>> >>>> > > > > lately).
> >>> >>>> > > > > >> Or
> >>> >>>> > > > > >> >> if you
> >>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
> >>> useful i
> >>> >>>> > guess
> >>> >>>> > > > > too.
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>>>
> >>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
> >>> Wills <
> >>> >>>> > > > > >> >> josh.wills@gmail.com>
> >>> >>>> > > > > >> >> >>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but
> >>> I am
> >>> >>>> happy
> >>> >>>> > > to
> >>> >>>> > > > > help.
> >>> >>>> > > > > >> >> Github
> >>> >>>> > > > > >> >> >>>>> >>>>> repo?
> >>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
> >>> Lyubimov" <
> >>> >>>> > > > > >> dlieu.7@gmail.com
> >>> >>>> > > > > >> >> >
> >>> >>>> > > > > >> >> >>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
> >>> JRI/RJava
> >>> >>>> > > > prototype
> >>> >>>> > > > > on
> >>> >>>> > > > > >> >> top of
> >>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This
> should
> >>> both
> >>> >>>> save
> >>> >>>> > > > time
> >>> >>>> > > > > and
> >>> >>>> > > > > >> >> prove or
> >>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava
> integration
> >>> is
> >>> >>>> > viable.
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
> >>> Crunch
> >>> >>>> > > framework
> >>> >>>> > > > > or we
> >>> >>>> > > > > >> >> can keep
> >>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> -d
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
> >>> Wills <
> >>> >>>> > > > > >> >> jwills@cloudera.com>
> >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into
> >>> it--
> >>> >>>> who
> >>> >>>> > gave
> >>> >>>> > > > the
> >>> >>>> > > > > >> >> talk? Was
> >>> >>>> > > > > >> >> >>>>> it
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM,
> Dmitriy
> >>> >>>> > Lyubimov <
> >>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
> >>> >>>> > > > > >> >> >>>>> >>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
> >>> Google's
> >>> >>>> > > experience
> >>> >>>> > > > > of R
> >>> >>>> > > > > >> >> mapping
> >>> >>>> > > > > >> >> >>>>> of
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I
> >>> think
> >>> >>>> a
> >>> >>>> > lot
> >>> >>>> > > of
> >>> >>>> > > > > >> >> applications
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could
> >>> be
> >>> >>>> > > prototyped
> >>> >>>> > > > > using
> >>> >>>> > > > > >> >> flume R.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
> >>> Google
> >>> >>>> > > > > implementation
> >>> >>>> > > > > >> of
> >>> >>>> > > > > >> >> R
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
> >>> mapping
> >>> >>>> from
> >>> >>>> > R
> >>> >>>> > > to
> >>> >>>> > > > > >> Crunch
> >>> >>>> > > > > >> >> would
> >>> >>>> > > > > >> >> >>>>> be
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
> >>> efficient).
> >>> >>>> > > > RJava/JRI
> >>> >>>> > > > > and
> >>> >>>> > > > > >> >> jni
> >>> >>>> > > > > >> >> >>>>> seem to
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do
> >>> that
> >>> >>>> > > directly.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
> >>> project
> >>> >>>> > could
> >>> >>>> > > > > have a
> >>> >>>> > > > > >> >> >>>>> contributed
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
> >>> matrices,
> >>> >>>> that
> >>> >>>> > > would
> >>> >>>> > > > > be
> >>> >>>> > > > > >> >> just a
> >>> >>>> > > > > >> >> >>>>> very
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> >>> >>>> > > contributing/advising
> >>> >>>> > > > > for
> >>> >>>> > > > > >> open
> >>> >>>> > > > > >> >> >>>>> source
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just
> >>> gauging
> >>> >>>> > interest,
> >>> >>>> > > > > Crunch
> >>> >>>> > > > > >> >> list
> >>> >>>> > > > > >> >> >>>>> seems
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
> >>> >>>> > > > > >> >> >>>>> >>>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>>
> >>> >>>> > > > > >> >> >>>>> >>>>>>> --
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
> >>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>>
> >>> >>>> > > > > >> >> >>>>> >>
> >>> >>>> > > > > >> >> >>>>>
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>>
> >>> >>>> > > > > >> >> >>>> --
> >>> >>>> > > > > >> >> >>>> Director of Data Science
> >>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> >>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
> >>> http://twitter.com/josh_wills>
> >>> >>>> > > > > >> >>
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> >
> >>> >>>> > > > > >> > --
> >>> >>>> > > > > >> > Director of Data Science
> >>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
> >>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>> >>>> > > > > >>
> >>> >>>> > > > > >
> >>> >>>> > > > > >
> >>> >>>> > > > > >
> >>> >>>> > > > > > --
> >>> >>>> > > > > > Director of Data Science
> >>> >>>> > > > > > Cloudera <http://www.cloudera.com>
> >>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>> >>>> > > > >
> >>> >>>> > > >
> >>> >>>> > >
> >>> >>>> >
> >>> >>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> --
> >>> >>>> Director of Data Science
> >>> >>>> Cloudera <http://www.cloudera.com>
> >>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
> >>>
> >>>
> >>> --
> >>> Director of Data Science
> >>> Cloudera
> >>> Twitter: @josh_wills
> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message