incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Tue, 30 Oct 2012 01:32:27 GMT
yes...

i think it worked for me before, although just adding all jars from R
package distribution would be a little bit more appropriate approach
-- but it creates a problem with jars in dependent R packages. I think
it would be much easier to just compile a hadoop-job file and stick it
in rather than doing cherry-picking of individual jars from who knows
how many locations.

i think i used the hadoop job format with distributed cache before and
it worked... at least with Pig "register jar" functionality.

ok i guess i will just try if it works.

On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jwills@cloudera.com> wrote:
> On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
>> Great! so it is in Crunch.
>>
>> does it support hadoop-job jar format or only pure java jars?
>>
>
> I think just pure jars-- you're referring to hadoop-job format as having
> all the dependencies in a lib/ directory within the jar?
>
>
>>
>> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jwills@cloudera.com> wrote:
>> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>> >
>> >> I think i need functionality to add more jars (or external hadoop-jar)
>> >> to drive that from an R package. Just setting job jar by class is not
>> >> enough. I can push overall job-jar as an addiitonal jar to R package;
>> >> however, i cannot really run hadoop command line on it, i need to set
>> >> up classpath thru RJava.
>> >>
>> >> Traditional single hadoop job jar will unlikely work here since we
>> >> cannot hardcode pipelines in java code but rather have to construct
>> >> them on the fly. (well, we could serialize pipeline definitions from R
>> >> and then replay them in a driver -- but that's too cumbersome and more
>> >> work than it has to be.) There's no reason why i shouldn't be able to
>> >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
>> >> off a pipeline.
>> >>
>> >
>> > o.a.c.util.DistCache.addJarToDistributedCache?
>> >
>> >
>> >>
>> >>
>> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> >> wrote:
>> >> > Ok, sounds very promising...
>> >> >
>> >> > i'll try to start digging on the driver part this week then (Pipeline
>> >> > wrapper in R5).
>> >> >
>> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <josh.wills@gmail.com>
>> >> wrote:
>> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> >> wrote:
>> >> >>> Ok, cool.
>> >> >>>
>> >> >>> So what state is Crunch in? I take it is in a fairly advanced
state.
>> >> >>> So every api mentioned in the  FlumeJava paper is working ,
right?
>> Or
>> >> >>> there's something that is not working specifically?
>> >> >>
>> >> >> I think the only thing in the paper that we don't have in a working
>> >> >> state is MSCR fusion. It's mostly just a question of prioritizing
it
>> >> >> and getting the work done.
>> >> >>
>> >> >>>
>> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <jwills@cloudera.com>
>> >> wrote:
>> >> >>>> Hey Dmitriy,
>> >> >>>>
>> >> >>>> Got a fork going and looking forward to playing with crunchR
this
>> >> weekend--
>> >> >>>> thanks!
>> >> >>>>
>> >> >>>> J
>> >> >>>>
>> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> >> wrote:
>> >> >>>>
>> >> >>>>> Project template https://github.com/dlyubimov/crunchR
>> >> >>>>>
>> >> >>>>> Default profile does not compile R artifact . R profile
compiles R
>> >> >>>>> artifact. for convenience, it is enabled by supplying
-DR to mvn
>> >> >>>>> command line, e.g.
>> >> >>>>>
>> >> >>>>> mvn install -DR
>> >> >>>>>
>> >> >>>>> there's also a helper that installs the snapshot version
of the
>> >> >>>>> package in the crunchR module.
>> >> >>>>>
>> >> >>>>> There's RJava and JRI java dependencies which i did
not find
>> anywhere
>> >> >>>>> in public maven repos; so it is installed into my github
maven
>> repo
>> >> so
>> >> >>>>> far. Should compile for 3rd party.
>> >> >>>>>
>> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf.
R Doc
>> >> >>>>> compilation requires roxygen2 (i think).
>> >> >>>>>
>> >> >>>>> For some reason RProtoBuf fails to import into another
package,
>> got a
>> >> >>>>> weird exception when i put @import RProtoBuf into crunchR,
so
>> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
that may
>> be a
>> >> >>>>> problem though...
>> >> >>>>>
>> >> >>>>> other than the template, not much else has been done
so far...
>> >> finding
>> >> >>>>> hadoop libraries and adding it to the package path
on
>> initialization
>> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
>> non-"provided"
>> >> >>>>> transitives to the crunchR's java part...
>> >> >>>>>
>> >> >>>>> No legal stuff...
>> >> >>>>>
>> >> >>>>> No readmes... complete stealth at this point.
>> >> >>>>>
>> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov
<
>> >> dlieu.7@gmail.com>
>> >> >>>>> wrote:
>> >> >>>>> > Ok, cool. I will try to roll project template
by some time next
>> >> week.
>> >> >>>>> > we can start with prototyping and benchmarking
something really
>> >> >>>>> > simple, such as parallelDo().
>> >> >>>>> >
>> >> >>>>> > My interim goal is to perhaps take some more or
less simple
>> >> algorithm
>> >> >>>>> > from Mahout and demonstrate it can be solved with
Rcrunch (or
>> >> whatever
>> >> >>>>> > name it has to be) in a comparable time (performance)
but with
>> much
>> >> >>>>> > fewer lines of code. (say one of factorization
or clustering
>> >> things)
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <rsharma@xebia.com>
>> wrote:
>> >> >>>>> >> I am not much of R user but I am interested
to see how well we
>> can
>> >> >>>>> integrate
>> >> >>>>> >> the two. I would be happy to help.
>> >> >>>>> >>
>> >> >>>>> >> regards,
>> >> >>>>> >> Rahul
>> >> >>>>> >>
>> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>> >> >>>>> >>>
>> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
Lyubimov <
>> >> dlieu.7@gmail.com>
>> >> >>>>> >>> wrote:
>> >> >>>>> >>>>
>> >> >>>>> >>>> Yep, ok.
>> >> >>>>> >>>>
>> >> >>>>> >>>> I imagine it has to be an R module
so I can set up a maven
>> >> project
>> >> >>>>> >>>> with java/R code tree (I have been
doing that a lot lately).
>> Or
>> >> if you
>> >> >>>>> >>>> have a template to look at, it would
be useful i guess too.
>> >> >>>>> >>>
>> >> >>>>> >>> No, please go right ahead.
>> >> >>>>> >>>
>> >> >>>>> >>>>
>> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
Wills <
>> >> josh.wills@gmail.com>
>> >> >>>>> wrote:
>> >> >>>>> >>>>>
>> >> >>>>> >>>>> I'd like it to be separate at
first, but I am happy to help.
>> >> Github
>> >> >>>>> >>>>> repo?
>> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
Lyubimov" <
>> dlieu.7@gmail.com
>> >> >
>> >> >>>>> wrote:
>> >> >>>>> >>>>>
>> >> >>>>> >>>>>> Ok maybe there's a benefit
to try a JRI/RJava prototype on
>> >> top of
>> >> >>>>> >>>>>> Crunch for something simple.
This should both save time and
>> >> prove or
>> >> >>>>> >>>>>> disprove if Crunch via RJava
integration is viable.
>> >> >>>>> >>>>>>
>> >> >>>>> >>>>>> On my part i can try to do
it within Crunch framework or we
>> >> can keep
>> >> >>>>> >>>>>> it completely separate.
>> >> >>>>> >>>>>>
>> >> >>>>> >>>>>> -d
>> >> >>>>> >>>>>>
>> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08
PM, Josh Wills <
>> >> jwills@cloudera.com>
>> >> >>>>> >>>>>> wrote:
>> >> >>>>> >>>>>>>
>> >> >>>>> >>>>>>> I am an avid R user and
would be into it-- who gave the
>> >> talk? Was
>> >> >>>>> it
>> >> >>>>> >>>>>>> Murray Stokely?
>> >> >>>>> >>>>>>>
>> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at
2:05 PM, Dmitriy Lyubimov <
>> >> >>>>> dlieu.7@gmail.com>
>> >> >>>>> >>>>>>
>> >> >>>>> >>>>>> wrote:
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> Hello,
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> I was pretty excited
to learn of Google's experience of R
>> >> mapping
>> >> >>>>> of
>> >> >>>>> >>>>>>>> flume java on one
of recent BARUGs. I think a lot of
>> >> applications
>> >> >>>>> >>>>>>>> similar to what we
do in Mahout could be prototyped using
>> >> flume R.
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> I did not quite get
the details of Google implementation
>> of
>> >> R
>> >> >>>>> >>>>>>>> mapping,
>> >> >>>>> >>>>>>>> but i am not sure
if just a direct mapping from R to
>> Crunch
>> >> would
>> >> >>>>> be
>> >> >>>>> >>>>>>>> sufficient (and, for
most part, efficient). RJava/JRI and
>> >> jni
>> >> >>>>> seem to
>> >> >>>>> >>>>>>>> be a pretty terrible
performer to do that directly.
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> on top of it, I am
thinknig if this project could have a
>> >> >>>>> contributed
>> >> >>>>> >>>>>>>> adapter to Mahout's
distributed matrices, that would be
>> >> just a
>> >> >>>>> very
>> >> >>>>> >>>>>>>> good synergy.
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> Is there anyone interested
in contributing/advising for
>> open
>> >> >>>>> source
>> >> >>>>> >>>>>>>> version of flume R
support? Just gauging interest, Crunch
>> >> list
>> >> >>>>> seems
>> >> >>>>> >>>>>>>> like a natural place
to poke.
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> Thanks .
>> >> >>>>> >>>>>>>>
>> >> >>>>> >>>>>>>> -Dmitriy
>> >> >>>>> >>>>>>>
>> >> >>>>> >>>>>>>
>> >> >>>>> >>>>>>>
>> >> >>>>> >>>>>>> --
>> >> >>>>> >>>>>>> Director of Data Science
>> >> >>>>> >>>>>>> Cloudera
>> >> >>>>> >>>>>>> Twitter: @josh_wills
>> >> >>>>> >>>
>> >> >>>>> >>>
>> >> >>>>> >>>
>> >> >>>>> >>
>> >> >>>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> --
>> >> >>>> Director of Data Science
>> >> >>>> Cloudera <http://www.cloudera.com>
>> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >>
>> >
>> >
>> >
>> > --
>> > Director of Data Science
>> > Cloudera <http://www.cloudera.com>
>> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message