incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Fri, 16 Nov 2012 23:04:52 GMT
How do I hook into CrunchTaskContext to do a task cleanup (as opposed to a
DoFn etc.) ?


On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> no it is fully distributed testing.
>
> It is ok, StatEt handles log4j logging for me so i see the logs. I was
> wondering if any end-to-end diagnostics is already embedded in Crunch  but
> reporting backend errors to front end is notoriously hard (and sometimes,
> impossible) with hadoop, so I assume it doesn't make sense to report
> client-only stuff thru exception while the other stuff still requires
> checking isSucceeded().
>
>
>
> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Are you running this using LocalJobRunner? Does calling
>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
>> settle a debate I'm having w/Matthias. ;-)
>>
>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> > I see the error in the logs but Pipeline.run() has never thrown
>> anything.
>> > isSucceeded() subsequently returns false. Is there any way to extract
>> > client-side problem rather than just being able to state that job
>> failed?
>> > or it is ok and the only diagnostics by design?
>> >
>> > ============
>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>> > does not exist: hdfs://localhost:11010/crunchr-example/input
>> > at
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
>> > at
>> >
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
>> > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>> > at java.security.AccessController.doPrivileged(Native Method)
>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>> > at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>> > at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
>> > at
>> >
>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
>> > at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
>> > at
>> >
>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
>> > at
>> >
>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
>> > at java.lang.Thread.run(Thread.java:662)
>> >
>> >
>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> >
>> >> for hadoop nodes i guess yet another option to soft-link the .so into
>> >> hadoop's native lib folder
>> >>
>> >>
>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >wrote:
>> >>
>> >>> I actually want to defer this to hadoop admins, we just need to
>> create a
>> >>> procedure for setting up nodes. Ideally as simple as possible.
>> something
>> >>> like
>> >>>
>> >>> 1) setup R
>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
>> >>> 3) R CMD javareconf
>> >>> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava")
>> to
>> >>> either mapred command lines or LD_LIBRARY_PATH...
>> >>>
>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
>> crunch
>> >>> might have something to hide a lot of that complexity (since it is
>> about
>> >>> hiding complexities, for the most part :)  ) besides hadoop has a way
>> to
>> >>> ship .so's to the backend so if crunch had an api to do something
>> similar
>> >>> it is conceivable that driver might yank and ship it too to hide that
>> >>> complexity as well. But then there's a host of issues how to handle
>> >>> potentially different rJava versions installed on different nodes...
>> So, it
>> >>> increasingly looks like something we might want to defer to sysops to
>> do
>> >>> with approximate set of requirements .
>> >>>
>> >>>
>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jwills@cloudera.com>
>> wrote:
>> >>>
>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> >>>> wrote:
>> >>>>
>> >>>> > so java tasks need to be able to load libjri.so from
>> >>>> > whatever system.file("jri", package="rJava") says.
>> >>>> >
>> >>>> > Traditionally, these issues were handled with -Djava.library.path.
>> >>>> > Apparently there's nothing java task can do to enable loadLibrary()
>> >>>> command
>> >>>> > to see the damn library once started. But -Djava.library.path
>> requires
>> >>>> for
>> >>>> > nodes to configure and lock jvm command line from modifications
of
>> the
>> >>>> > client.  which is fine.
>> >>>> >
>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with
jre 1.6
>> >>>> (again).
>> >>>> >
>> >>>> > but... any other suggestions about best practice configuring
>> crunch to
>> >>>> run
>> >>>> > user's .so's?
>> >>>> >
>> >>>>
>> >>>> Not off the top of my head. I suspect that whatever you come up
with
>> will
>> >>>> become the "best practice." :)
>> >>>>
>> >>>> >
>> >>>> > thanks.
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <josh.wills@gmail.com>
>> >>>> wrote:
>> >>>> >
>> >>>> > > I believe that is a safe assumption, at least right now.
>> >>>> > >
>> >>>> > >
>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com
>> >>>> >
>> >>>> > > wrote:
>> >>>> > >
>> >>>> > > > Question.
>> >>>> > > >
>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter.
and the
>> >>>> process
>> >>>> > > gets
>> >>>> > > > emitter every time.
>> >>>> > > >
>> >>>> > > > However, my guess any single reincranation of a DoFn
object in
>> the
>> >>>> > > backend
>> >>>> > > > will always be getting the same emitter thru its
lifecycle. Is
>> it
>> >>>> an
>> >>>> > > > admissible assumption or there's currently a counter
example to
>> >>>> that?
>> >>>> > > >
>> >>>> > > > The problem is that as i implement the two way pipeline
of
>> input
>> >>>> and
>> >>>> > > > emitter data between R and Java, I am bulking these
calls
>> together
>> >>>> for
>> >>>> > > > performance reasons. Each individual datum in these
chunks of
>> data
>> >>>> will
>> >>>> > > not
>> >>>> > > > have attached emitter function information to them
in any way.
>> >>>> (well it
>> >>>> > > > could but it would be a performance killer and i
bet emitter
>> never
>> >>>> > > > changes).
>> >>>> > > >
>> >>>> > > > So, thoughts? can i assume emitter never changes
between first
>> and
>> >>>> lass
>> >>>> > > > call to DoFn instance?
>> >>>> > > >
>> >>>> > > > thanks.
>> >>>> > > >
>> >>>> > > >
>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov
<
>> >>>> dlieu.7@gmail.com>
>> >>>> > > > wrote:
>> >>>> > > >
>> >>>> > > > > yes...
>> >>>> > > > >
>> >>>> > > > > i think it worked for me before, although just
adding all
>> jars
>> >>>> from R
>> >>>> > > > > package distribution would be a little bit more
appropriate
>> >>>> approach
>> >>>> > > > > -- but it creates a problem with jars in dependent
R
>> packages. I
>> >>>> > think
>> >>>> > > > > it would be much easier to just compile a hadoop-job
file and
>> >>>> stick
>> >>>> > it
>> >>>> > > > > in rather than doing cherry-picking of individual
jars from
>> who
>> >>>> knows
>> >>>> > > > > how many locations.
>> >>>> > > > >
>> >>>> > > > > i think i used the hadoop job format with distributed
cache
>> >>>> before
>> >>>> > and
>> >>>> > > > > it worked... at least with Pig "register jar"
functionality.
>> >>>> > > > >
>> >>>> > > > > ok i guess i will just try if it works.
>> >>>> > > > >
>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills
<
>> jwills@cloudera.com
>> >>>> >
>> >>>> > > wrote:
>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy
Lyubimov <
>> >>>> > dlieu.7@gmail.com
>> >>>> > > >
>> >>>> > > > > wrote:
>> >>>> > > > > >
>> >>>> > > > > >> Great! so it is in Crunch.
>> >>>> > > > > >>
>> >>>> > > > > >> does it support hadoop-job jar format
or only pure java
>> jars?
>> >>>> > > > > >>
>> >>>> > > > > >
>> >>>> > > > > > I think just pure jars-- you're referring
to hadoop-job
>> format
>> >>>> as
>> >>>> > > > having
>> >>>> > > > > > all the dependencies in a lib/ directory
within the jar?
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > > >>
>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh
Wills <
>> >>>> jwills@cloudera.com>
>> >>>> > > > > wrote:
>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM,
Dmitriy Lyubimov <
>> >>>> > > > dlieu.7@gmail.com>
>> >>>> > > > > >> wrote:
>> >>>> > > > > >> >
>> >>>> > > > > >> >> I think i need functionality
to add more jars (or
>> external
>> >>>> > > > > hadoop-jar)
>> >>>> > > > > >> >> to drive that from an R package.
Just setting job jar
>> by
>> >>>> class
>> >>>> > is
>> >>>> > > > not
>> >>>> > > > > >> >> enough. I can push overall
job-jar as an addiitonal
>> jar to
>> >>>> R
>> >>>> > > > package;
>> >>>> > > > > >> >> however, i cannot really run
hadoop command line on
>> it, i
>> >>>> need
>> >>>> > to
>> >>>> > > > set
>> >>>> > > > > >> >> up classpath thru RJava.
>> >>>> > > > > >> >>
>> >>>> > > > > >> >> Traditional single hadoop
job jar will unlikely work
>> here
>> >>>> since
>> >>>> > > we
>> >>>> > > > > >> >> cannot hardcode pipelines
in java code but rather have
>> to
>> >>>> > > construct
>> >>>> > > > > >> >> them on the fly. (well, we
could serialize pipeline
>> >>>> definitions
>> >>>> > > > from
>> >>>> > > > > R
>> >>>> > > > > >> >> and then replay them in a
driver -- but that's too
>> >>>> cumbersome
>> >>>> > and
>> >>>> > > > > more
>> >>>> > > > > >> >> work than it has to be.) There's
no reason why i
>> shouldn't
>> >>>> be
>> >>>> > > able
>> >>>> > > > to
>> >>>> > > > > >> >> do pig-like "register jar"
or "setJobJar" (mahout-like)
>> >>>> when
>> >>>> > > > kicking
>> >>>> > > > > >> >> off a pipeline.
>> >>>> > > > > >> >>
>> >>>> > > > > >> >
>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>> >>>> > > > > >> >
>> >>>> > > > > >> >
>> >>>> > > > > >> >>
>> >>>> > > > > >> >>
>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17
AM, Dmitriy Lyubimov <
>> >>>> > > > > dlieu.7@gmail.com>
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> > Ok, sounds very promising...
>> >>>> > > > > >> >> >
>> >>>> > > > > >> >> > i'll try to start digging
on the driver part this
>> week
>> >>>> then
>> >>>> > > > > (Pipeline
>> >>>> > > > > >> >> > wrapper in R5).
>> >>>> > > > > >> >> >
>> >>>> > > > > >> >> > On Sun, Oct 28, 2012
at 11:56 AM, Josh Wills <
>> >>>> > > > josh.wills@gmail.com
>> >>>> > > > > >
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012
at 2:40 PM, Dmitriy Lyubimov <
>> >>>> > > > > dlieu.7@gmail.com
>> >>>> > > > > >> >
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >>> Ok, cool.
>> >>>> > > > > >> >> >>>
>> >>>> > > > > >> >> >>> So what state
is Crunch in? I take it is in a
>> fairly
>> >>>> > advanced
>> >>>> > > > > state.
>> >>>> > > > > >> >> >>> So every api
mentioned in the  FlumeJava paper is
>> >>>> working ,
>> >>>> > > > > right?
>> >>>> > > > > >> Or
>> >>>> > > > > >> >> >>> there's something
that is not working specifically?
>> >>>> > > > > >> >> >>
>> >>>> > > > > >> >> >> I think the only
thing in the paper that we don't
>> have
>> >>>> in a
>> >>>> > > > > working
>> >>>> > > > > >> >> >> state is MSCR fusion.
It's mostly just a question of
>> >>>> > > > prioritizing
>> >>>> > > > > it
>> >>>> > > > > >> >> >> and getting the work
done.
>> >>>> > > > > >> >> >>
>> >>>> > > > > >> >> >>>
>> >>>> > > > > >> >> >>> On Fri, Oct 26,
2012 at 2:31 PM, Josh Wills <
>> >>>> > > > jwills@cloudera.com
>> >>>> > > > > >
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >>>> Hey Dmitriy,
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> Got a fork
going and looking forward to playing
>> with
>> >>>> > crunchR
>> >>>> > > > > this
>> >>>> > > > > >> >> weekend--
>> >>>> > > > > >> >> >>>> thanks!
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> J
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> On Wed, Oct
24, 2012 at 1:28 PM, Dmitriy Lyubimov
>> <
>> >>>> > > > > >> dlieu.7@gmail.com>
>> >>>> > > > > >> >> wrote:
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>>> Project
template
>> >>>> https://github.com/dlyubimov/crunchR
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> Default
profile does not compile R artifact . R
>> >>>> profile
>> >>>> > > > > compiles R
>> >>>> > > > > >> >> >>>>> artifact.
for convenience, it is enabled by
>> >>>> supplying -DR
>> >>>> > > to
>> >>>> > > > > mvn
>> >>>> > > > > >> >> >>>>> command
line, e.g.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> mvn install
-DR
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> there's
also a helper that installs the snapshot
>> >>>> version
>> >>>> > of
>> >>>> > > > the
>> >>>> > > > > >> >> >>>>> package
in the crunchR module.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> There's
RJava and JRI java dependencies which i
>> did
>> >>>> not
>> >>>> > > find
>> >>>> > > > > >> anywhere
>> >>>> > > > > >> >> >>>>> in public
maven repos; so it is installed into my
>> >>>> github
>> >>>> > > > maven
>> >>>> > > > > >> repo
>> >>>> > > > > >> >> so
>> >>>> > > > > >> >> >>>>> far.
Should compile for 3rd party.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> -DR compilation
requires R, RJava and optionally,
>> >>>> > > RProtoBuf.
>> >>>> > > > R
>> >>>> > > > > Doc
>> >>>> > > > > >> >> >>>>> compilation
requires roxygen2 (i think).
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> For some
reason RProtoBuf fails to import into
>> >>>> another
>> >>>> > > > package,
>> >>>> > > > > >> got a
>> >>>> > > > > >> >> >>>>> weird
exception when i put @import RProtoBuf into
>> >>>> > crunchR,
>> >>>> > > so
>> >>>> > > > > >> >> >>>>> RProtoBuf
is now in "Suggests" category. Down the
>> >>>> road
>> >>>> > that
>> >>>> > > > may
>> >>>> > > > > >> be a
>> >>>> > > > > >> >> >>>>> problem
though...
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> other
than the template, not much else has been
>> done
>> >>>> so
>> >>>> > > > far...
>> >>>> > > > > >> >> finding
>> >>>> > > > > >> >> >>>>> hadoop
libraries and adding it to the package
>> path on
>> >>>> > > > > >> initialization
>> >>>> > > > > >> >> >>>>> via "hadoop
classpath"... adding Crunch jars and
>> its
>> >>>> > > > > >> non-"provided"
>> >>>> > > > > >> >> >>>>> transitives
to the crunchR's java part...
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> No legal
stuff...
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> No readmes...
complete stealth at this point.
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>> On Thu,
Oct 18, 2012 at 12:35 PM, Dmitriy
>> Lyubimov <
>> >>>> > > > > >> >> dlieu.7@gmail.com>
>> >>>> > > > > >> >> >>>>> wrote:
>> >>>> > > > > >> >> >>>>> >
Ok, cool. I will try to roll project template
>> by
>> >>>> some
>> >>>> > > time
>> >>>> > > > > next
>> >>>> > > > > >> >> week.
>> >>>> > > > > >> >> >>>>> >
we can start with prototyping and benchmarking
>> >>>> > something
>> >>>> > > > > really
>> >>>> > > > > >> >> >>>>> >
simple, such as parallelDo().
>> >>>> > > > > >> >> >>>>> >
>> >>>> > > > > >> >> >>>>> >
My interim goal is to perhaps take some more or
>> >>>> less
>> >>>> > > simple
>> >>>> > > > > >> >> algorithm
>> >>>> > > > > >> >> >>>>> >
from Mahout and demonstrate it can be solved
>> with
>> >>>> > Rcrunch
>> >>>> > > > (or
>> >>>> > > > > >> >> whatever
>> >>>> > > > > >> >> >>>>> >
name it has to be) in a comparable time
>> >>>> (performance)
>> >>>> > but
>> >>>> > > > > with
>> >>>> > > > > >> much
>> >>>> > > > > >> >> >>>>> >
fewer lines of code. (say one of factorization
>> or
>> >>>> > > > clustering
>> >>>> > > > > >> >> things)
>> >>>> > > > > >> >> >>>>> >
>> >>>> > > > > >> >> >>>>> >
>> >>>> > > > > >> >> >>>>> >
On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>> >>>> > > rsharma@xebia.com
>> >>>> > > > >
>> >>>> > > > > >> wrote:
>> >>>> > > > > >> >> >>>>> >>
I am not much of R user but I am interested to
>> >>>> see how
>> >>>> > > > well
>> >>>> > > > > we
>> >>>> > > > > >> can
>> >>>> > > > > >> >> >>>>> integrate
>> >>>> > > > > >> >> >>>>> >>
the two. I would be happy to help.
>> >>>> > > > > >> >> >>>>> >>
>> >>>> > > > > >> >> >>>>> >>
regards,
>> >>>> > > > > >> >> >>>>> >>
Rahul
>> >>>> > > > > >> >> >>>>> >>
>> >>>> > > > > >> >> >>>>> >>
On 18-10-2012 04:04, Josh Wills wrote:
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>
On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
>> >>>> Lyubimov <
>> >>>> > > > > >> >> dlieu.7@gmail.com>
>> >>>> > > > > >> >> >>>>> >>>
wrote:
>> >>>> > > > > >> >> >>>>> >>>>
>> >>>> > > > > >> >> >>>>> >>>>
Yep, ok.
>> >>>> > > > > >> >> >>>>> >>>>
>> >>>> > > > > >> >> >>>>> >>>>
I imagine it has to be an R module so I can
>> set
>> >>>> up a
>> >>>> > > > maven
>> >>>> > > > > >> >> project
>> >>>> > > > > >> >> >>>>> >>>>
with java/R code tree (I have been doing
>> that a
>> >>>> lot
>> >>>> > > > > lately).
>> >>>> > > > > >> Or
>> >>>> > > > > >> >> if you
>> >>>> > > > > >> >> >>>>> >>>>
have a template to look at, it would be
>> useful i
>> >>>> > guess
>> >>>> > > > > too.
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>
No, please go right ahead.
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>>
>> >>>> > > > > >> >> >>>>> >>>>
On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills
>> <
>> >>>> > > > > >> >> josh.wills@gmail.com>
>> >>>> > > > > >> >> >>>>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>>
>> >>>> > > > > >> >> >>>>> >>>>>
I'd like it to be separate at first, but I
>> am
>> >>>> happy
>> >>>> > > to
>> >>>> > > > > help.
>> >>>> > > > > >> >> Github
>> >>>> > > > > >> >> >>>>> >>>>>
repo?
>> >>>> > > > > >> >> >>>>> >>>>>
On Oct 17, 2012 2:57 PM, "Dmitriy
>> Lyubimov" <
>> >>>> > > > > >> dlieu.7@gmail.com
>> >>>> > > > > >> >> >
>> >>>> > > > > >> >> >>>>> wrote:
>> >>>> > > > > >> >> >>>>> >>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>
Ok maybe there's a benefit to try a
>> JRI/RJava
>> >>>> > > > prototype
>> >>>> > > > > on
>> >>>> > > > > >> >> top of
>> >>>> > > > > >> >> >>>>> >>>>>>
Crunch for something simple. This should
>> both
>> >>>> save
>> >>>> > > > time
>> >>>> > > > > and
>> >>>> > > > > >> >> prove or
>> >>>> > > > > >> >> >>>>> >>>>>>
disprove if Crunch via RJava integration
>> is
>> >>>> > viable.
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>
On my part i can try to do it within
>> Crunch
>> >>>> > > framework
>> >>>> > > > > or we
>> >>>> > > > > >> >> can keep
>> >>>> > > > > >> >> >>>>> >>>>>>
it completely separate.
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>
-d
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>
On Wed, Oct 17, 2012 at 2:08 PM, Josh
>> Wills <
>> >>>> > > > > >> >> jwills@cloudera.com>
>> >>>> > > > > >> >> >>>>> >>>>>>
wrote:
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
I am an avid R user and would be into
>> it--
>> >>>> who
>> >>>> > gave
>> >>>> > > > the
>> >>>> > > > > >> >> talk? Was
>> >>>> > > > > >> >> >>>>> it
>> >>>> > > > > >> >> >>>>> >>>>>>>
Murray Stokely?
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>> >>>> > Lyubimov <
>> >>>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>> >>>> > > > > >> >> >>>>> >>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>
wrote:
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
Hello,
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
I was pretty excited to learn of
>> Google's
>> >>>> > > experience
>> >>>> > > > > of R
>> >>>> > > > > >> >> mapping
>> >>>> > > > > >> >> >>>>> of
>> >>>> > > > > >> >> >>>>> >>>>>>>>
flume java on one of recent BARUGs. I
>> think
>> >>>> a
>> >>>> > lot
>> >>>> > > of
>> >>>> > > > > >> >> applications
>> >>>> > > > > >> >> >>>>> >>>>>>>>
similar to what we do in Mahout could be
>> >>>> > > prototyped
>> >>>> > > > > using
>> >>>> > > > > >> >> flume R.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
I did not quite get the details of
>> Google
>> >>>> > > > > implementation
>> >>>> > > > > >> of
>> >>>> > > > > >> >> R
>> >>>> > > > > >> >> >>>>> >>>>>>>>
mapping,
>> >>>> > > > > >> >> >>>>> >>>>>>>>
but i am not sure if just a direct
>> mapping
>> >>>> from
>> >>>> > R
>> >>>> > > to
>> >>>> > > > > >> Crunch
>> >>>> > > > > >> >> would
>> >>>> > > > > >> >> >>>>> be
>> >>>> > > > > >> >> >>>>> >>>>>>>>
sufficient (and, for most part,
>> efficient).
>> >>>> > > > RJava/JRI
>> >>>> > > > > and
>> >>>> > > > > >> >> jni
>> >>>> > > > > >> >> >>>>> seem
to
>> >>>> > > > > >> >> >>>>> >>>>>>>>
be a pretty terrible performer to do
>> that
>> >>>> > > directly.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
on top of it, I am thinknig if this
>> project
>> >>>> > could
>> >>>> > > > > have a
>> >>>> > > > > >> >> >>>>> contributed
>> >>>> > > > > >> >> >>>>> >>>>>>>>
adapter to Mahout's distributed
>> matrices,
>> >>>> that
>> >>>> > > would
>> >>>> > > > > be
>> >>>> > > > > >> >> just a
>> >>>> > > > > >> >> >>>>> very
>> >>>> > > > > >> >> >>>>> >>>>>>>>
good synergy.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
Is there anyone interested in
>> >>>> > > contributing/advising
>> >>>> > > > > for
>> >>>> > > > > >> open
>> >>>> > > > > >> >> >>>>> source
>> >>>> > > > > >> >> >>>>> >>>>>>>>
version of flume R support? Just gauging
>> >>>> > interest,
>> >>>> > > > > Crunch
>> >>>> > > > > >> >> list
>> >>>> > > > > >> >> >>>>> seems
>> >>>> > > > > >> >> >>>>> >>>>>>>>
like a natural place to poke.
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
Thanks .
>> >>>> > > > > >> >> >>>>> >>>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>>
-Dmitriy
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
>> >>>> > > > > >> >> >>>>> >>>>>>>
--
>> >>>> > > > > >> >> >>>>> >>>>>>>
Director of Data Science
>> >>>> > > > > >> >> >>>>> >>>>>>>
Cloudera
>> >>>> > > > > >> >> >>>>> >>>>>>>
Twitter: @josh_wills
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>>
>> >>>> > > > > >> >> >>>>> >>
>> >>>> > > > > >> >> >>>>>
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>>
>> >>>> > > > > >> >> >>>> --
>> >>>> > > > > >> >> >>>> Director
of Data Science
>> >>>> > > > > >> >> >>>> Cloudera
<http://www.cloudera.com>
>> >>>> > > > > >> >> >>>> Twitter:
@josh_wills <
>> http://twitter.com/josh_wills>
>> >>>> > > > > >> >>
>> >>>> > > > > >> >
>> >>>> > > > > >> >
>> >>>> > > > > >> >
>> >>>> > > > > >> > --
>> >>>> > > > > >> > Director of Data Science
>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>>> > > > > >>
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > > > --
>> >>>> > > > > > Director of Data Science
>> >>>> > > > > > Cloudera <http://www.cloudera.com>
>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>>> > > > >
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Director of Data Science
>> >>>> Cloudera <http://www.cloudera.com>
>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera
>> Twitter: @josh_wills
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message