incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Flume R -- any interest?
Date Tue, 13 Nov 2012 01:41:21 GMT
for hadoop nodes i guess yet another option to soft-link the .so into
hadoop's native lib folder


On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> I actually want to defer this to hadoop admins, we just need to create a
> procedure for setting up nodes. Ideally as simple as possible. something
> like
>
> 1) setup R
> 2) install.packages("rJava","RProtoBuf","crunchR")
> 3) R CMD javareconf
> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
> either mapred command lines or LD_LIBRARY_PATH...
>
> but it will depend on their versions of hadoop, jre etc. I hoped crunch
> might have something to hide a lot of that complexity (since it is about
> hiding complexities, for the most part :)  ) besides hadoop has a way to
> ship .so's to the backend so if crunch had an api to do something similar
> it is conceivable that driver might yank and ship it too to hide that
> complexity as well. But then there's a host of issues how to handle
> potentially different rJava versions installed on different nodes... So, it
> increasingly looks like something we might want to defer to sysops to do
> with approximate set of requirements .
>
>
> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>>
>> > so java tasks need to be able to load libjri.so from
>> > whatever system.file("jri", package="rJava") says.
>> >
>> > Traditionally, these issues were handled with -Djava.library.path.
>> > Apparently there's nothing java task can do to enable loadLibrary()
>> command
>> > to see the damn library once started. But -Djava.library.path requires
>> for
>> > nodes to configure and lock jvm command line from modifications of the
>> > client.  which is fine.
>> >
>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>> (again).
>> >
>> > but... any other suggestions about best practice configuring crunch to
>> run
>> > user's .so's?
>> >
>>
>> Not off the top of my head. I suspect that whatever you come up with will
>> become the "best practice." :)
>>
>> >
>> > thanks.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <josh.wills@gmail.com>
>> wrote:
>> >
>> > > I believe that is a safe assumption, at least right now.
>> > >
>> > >
>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> > > wrote:
>> > >
>> > > > Question.
>> > > >
>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>> process
>> > > gets
>> > > > emitter every time.
>> > > >
>> > > > However, my guess any single reincranation of a DoFn object in the
>> > > backend
>> > > > will always be getting the same emitter thru its lifecycle. Is it
an
>> > > > admissible assumption or there's currently a counter example to
>> that?
>> > > >
>> > > > The problem is that as i implement the two way pipeline of input and
>> > > > emitter data between R and Java, I am bulking these calls together
>> for
>> > > > performance reasons. Each individual datum in these chunks of data
>> will
>> > > not
>> > > > have attached emitter function information to them in any way.
>> (well it
>> > > > could but it would be a performance killer and i bet emitter never
>> > > > changes).
>> > > >
>> > > > So, thoughts? can i assume emitter never changes between first and
>> lass
>> > > > call to DoFn instance?
>> > > >
>> > > > thanks.
>> > > >
>> > > >
>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > yes...
>> > > > >
>> > > > > i think it worked for me before, although just adding all jars
>> from R
>> > > > > package distribution would be a little bit more appropriate
>> approach
>> > > > > -- but it creates a problem with jars in dependent R packages.
I
>> > think
>> > > > > it would be much easier to just compile a hadoop-job file and
>> stick
>> > it
>> > > > > in rather than doing cherry-picking of individual jars from who
>> knows
>> > > > > how many locations.
>> > > > >
>> > > > > i think i used the hadoop job format with distributed cache before
>> > and
>> > > > > it worked... at least with Pig "register jar" functionality.
>> > > > >
>> > > > > ok i guess i will just try if it works.
>> > > > >
>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jwills@cloudera.com>
>> > > wrote:
>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>> > dlieu.7@gmail.com
>> > > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> Great! so it is in Crunch.
>> > > > > >>
>> > > > > >> does it support hadoop-job jar format or only pure java
jars?
>> > > > > >>
>> > > > > >
>> > > > > > I think just pure jars-- you're referring to hadoop-job
format
>> as
>> > > > having
>> > > > > > all the dependencies in a lib/ directory within the jar?
>> > > > > >
>> > > > > >
>> > > > > >>
>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>> jwills@cloudera.com>
>> > > > > wrote:
>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov
<
>> > > > dlieu.7@gmail.com>
>> > > > > >> wrote:
>> > > > > >> >
>> > > > > >> >> I think i need functionality to add more jars
(or external
>> > > > > hadoop-jar)
>> > > > > >> >> to drive that from an R package. Just setting
job jar by
>> class
>> > is
>> > > > not
>> > > > > >> >> enough. I can push overall job-jar as an addiitonal
jar to R
>> > > > package;
>> > > > > >> >> however, i cannot really run hadoop command
line on it, i
>> need
>> > to
>> > > > set
>> > > > > >> >> up classpath thru RJava.
>> > > > > >> >>
>> > > > > >> >> Traditional single hadoop job jar will unlikely
work here
>> since
>> > > we
>> > > > > >> >> cannot hardcode pipelines in java code but
rather have to
>> > > construct
>> > > > > >> >> them on the fly. (well, we could serialize
pipeline
>> definitions
>> > > > from
>> > > > > R
>> > > > > >> >> and then replay them in a driver -- but that's
too
>> cumbersome
>> > and
>> > > > > more
>> > > > > >> >> work than it has to be.) There's no reason
why i shouldn't
>> be
>> > > able
>> > > > to
>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
when
>> > > > kicking
>> > > > > >> >> off a pipeline.
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >>
>> > > > > >> >>
>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov
<
>> > > > > dlieu.7@gmail.com>
>> > > > > >> >> wrote:
>> > > > > >> >> > Ok, sounds very promising...
>> > > > > >> >> >
>> > > > > >> >> > i'll try to start digging on the driver
part this week
>> then
>> > > > > (Pipeline
>> > > > > >> >> > wrapper in R5).
>> > > > > >> >> >
>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh
Wills <
>> > > > josh.wills@gmail.com
>> > > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy
Lyubimov <
>> > > > > dlieu.7@gmail.com
>> > > > > >> >
>> > > > > >> >> wrote:
>> > > > > >> >> >>> Ok, cool.
>> > > > > >> >> >>>
>> > > > > >> >> >>> So what state is Crunch in? I
take it is in a fairly
>> > advanced
>> > > > > state.
>> > > > > >> >> >>> So every api mentioned in the
 FlumeJava paper is
>> working ,
>> > > > > right?
>> > > > > >> Or
>> > > > > >> >> >>> there's something that is not
working specifically?
>> > > > > >> >> >>
>> > > > > >> >> >> I think the only thing in the paper
that we don't have
>> in a
>> > > > > working
>> > > > > >> >> >> state is MSCR fusion. It's mostly
just a question of
>> > > > prioritizing
>> > > > > it
>> > > > > >> >> >> and getting the work done.
>> > > > > >> >> >>
>> > > > > >> >> >>>
>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM,
Josh Wills <
>> > > > jwills@cloudera.com
>> > > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >>>> Hey Dmitriy,
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> Got a fork going and looking
forward to playing with
>> > crunchR
>> > > > > this
>> > > > > >> >> weekend--
>> > > > > >> >> >>>> thanks!
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> J
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28
PM, Dmitriy Lyubimov <
>> > > > > >> dlieu.7@gmail.com>
>> > > > > >> >> wrote:
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> Default profile does not
compile R artifact . R
>> profile
>> > > > > compiles R
>> > > > > >> >> >>>>> artifact. for convenience,
it is enabled by supplying
>> -DR
>> > > to
>> > > > > mvn
>> > > > > >> >> >>>>> command line, e.g.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> mvn install -DR
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> there's also a helper
that installs the snapshot
>> version
>> > of
>> > > > the
>> > > > > >> >> >>>>> package in the crunchR
module.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> There's RJava and JRI
java dependencies which i did
>> not
>> > > find
>> > > > > >> anywhere
>> > > > > >> >> >>>>> in public maven repos;
so it is installed into my
>> github
>> > > > maven
>> > > > > >> repo
>> > > > > >> >> so
>> > > > > >> >> >>>>> far. Should compile for
3rd party.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> -DR compilation requires
R, RJava and optionally,
>> > > RProtoBuf.
>> > > > R
>> > > > > Doc
>> > > > > >> >> >>>>> compilation requires roxygen2
(i think).
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> For some reason RProtoBuf
fails to import into another
>> > > > package,
>> > > > > >> got a
>> > > > > >> >> >>>>> weird exception when i
put @import RProtoBuf into
>> > crunchR,
>> > > so
>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests"
category. Down the road
>> > that
>> > > > may
>> > > > > >> be a
>> > > > > >> >> >>>>> problem though...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> other than the template,
not much else has been done
>> so
>> > > > far...
>> > > > > >> >> finding
>> > > > > >> >> >>>>> hadoop libraries and adding
it to the package path on
>> > > > > >> initialization
>> > > > > >> >> >>>>> via "hadoop classpath"...
adding Crunch jars and its
>> > > > > >> non-"provided"
>> > > > > >> >> >>>>> transitives to the crunchR's
java part...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> No legal stuff...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> No readmes... complete
stealth at this point.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at
12:35 PM, Dmitriy Lyubimov <
>> > > > > >> >> dlieu.7@gmail.com>
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> > Ok, cool. I will
try to roll project template by
>> some
>> > > time
>> > > > > next
>> > > > > >> >> week.
>> > > > > >> >> >>>>> > we can start with
prototyping and benchmarking
>> > something
>> > > > > really
>> > > > > >> >> >>>>> > simple, such as parallelDo().
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> > My interim goal is
to perhaps take some more or less
>> > > simple
>> > > > > >> >> algorithm
>> > > > > >> >> >>>>> > from Mahout and demonstrate
it can be solved with
>> > Rcrunch
>> > > > (or
>> > > > > >> >> whatever
>> > > > > >> >> >>>>> > name it has to be)
in a comparable time
>> (performance)
>> > but
>> > > > > with
>> > > > > >> much
>> > > > > >> >> >>>>> > fewer lines of code.
(say one of factorization or
>> > > > clustering
>> > > > > >> >> things)
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012
at 10:24 PM, Rahul <
>> > > rsharma@xebia.com
>> > > > >
>> > > > > >> wrote:
>> > > > > >> >> >>>>> >> I am not much
of R user but I am interested to see
>> how
>> > > > well
>> > > > > we
>> > > > > >> can
>> > > > > >> >> >>>>> integrate
>> > > > > >> >> >>>>> >> the two. I would
be happy to help.
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>> >> regards,
>> > > > > >> >> >>>>> >> Rahul
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>> >> On 18-10-2012
04:04, Josh Wills wrote:
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>> On Wed, Oct
17, 2012 at 3:07 PM, Dmitriy Lyubimov
>> <
>> > > > > >> >> dlieu.7@gmail.com>
>> > > > > >> >> >>>>> >>> wrote:
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> Yep,
ok.
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> I imagine
it has to be an R module so I can set
>> up a
>> > > > maven
>> > > > > >> >> project
>> > > > > >> >> >>>>> >>>> with
java/R code tree (I have been doing that a
>> lot
>> > > > > lately).
>> > > > > >> Or
>> > > > > >> >> if you
>> > > > > >> >> >>>>> >>>> have
a template to look at, it would be useful i
>> > guess
>> > > > > too.
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>> No, please
go right ahead.
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> On Wed,
Oct 17, 2012 at 3:02 PM, Josh Wills <
>> > > > > >> >> josh.wills@gmail.com>
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> >>>>>
>> > > > > >> >> >>>>> >>>>> I'd
like it to be separate at first, but I am
>> happy
>> > > to
>> > > > > help.
>> > > > > >> >> Github
>> > > > > >> >> >>>>> >>>>> repo?
>> > > > > >> >> >>>>> >>>>> On
Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
>> > > > > >> dlieu.7@gmail.com
>> > > > > >> >> >
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> >>>>>
>> > > > > >> >> >>>>> >>>>>>
Ok maybe there's a benefit to try a JRI/RJava
>> > > > prototype
>> > > > > on
>> > > > > >> >> top of
>> > > > > >> >> >>>>> >>>>>>
Crunch for something simple. This should both
>> save
>> > > > time
>> > > > > and
>> > > > > >> >> prove or
>> > > > > >> >> >>>>> >>>>>>
disprove if Crunch via RJava integration is
>> > viable.
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>>
On my part i can try to do it within Crunch
>> > > framework
>> > > > > or we
>> > > > > >> >> can keep
>> > > > > >> >> >>>>> >>>>>>
it completely separate.
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>>
-d
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>>
On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
>> > > > > >> >> jwills@cloudera.com>
>> > > > > >> >> >>>>> >>>>>>
wrote:
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
I am an avid R user and would be into it-- who
>> > gave
>> > > > the
>> > > > > >> >> talk? Was
>> > > > > >> >> >>>>> it
>> > > > > >> >> >>>>> >>>>>>>
Murray Stokely?
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>> > Lyubimov <
>> > > > > >> >> >>>>> dlieu.7@gmail.com>
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>>
wrote:
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
Hello,
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
I was pretty excited to learn of Google's
>> > > experience
>> > > > > of R
>> > > > > >> >> mapping
>> > > > > >> >> >>>>> of
>> > > > > >> >> >>>>> >>>>>>>>
flume java on one of recent BARUGs. I think a
>> > lot
>> > > of
>> > > > > >> >> applications
>> > > > > >> >> >>>>> >>>>>>>>
similar to what we do in Mahout could be
>> > > prototyped
>> > > > > using
>> > > > > >> >> flume R.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
I did not quite get the details of Google
>> > > > > implementation
>> > > > > >> of
>> > > > > >> >> R
>> > > > > >> >> >>>>> >>>>>>>>
mapping,
>> > > > > >> >> >>>>> >>>>>>>>
but i am not sure if just a direct mapping
>> from
>> > R
>> > > to
>> > > > > >> Crunch
>> > > > > >> >> would
>> > > > > >> >> >>>>> be
>> > > > > >> >> >>>>> >>>>>>>>
sufficient (and, for most part, efficient).
>> > > > RJava/JRI
>> > > > > and
>> > > > > >> >> jni
>> > > > > >> >> >>>>> seem to
>> > > > > >> >> >>>>> >>>>>>>>
be a pretty terrible performer to do that
>> > > directly.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
on top of it, I am thinknig if this project
>> > could
>> > > > > have a
>> > > > > >> >> >>>>> contributed
>> > > > > >> >> >>>>> >>>>>>>>
adapter to Mahout's distributed matrices,
>> that
>> > > would
>> > > > > be
>> > > > > >> >> just a
>> > > > > >> >> >>>>> very
>> > > > > >> >> >>>>> >>>>>>>>
good synergy.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
Is there anyone interested in
>> > > contributing/advising
>> > > > > for
>> > > > > >> open
>> > > > > >> >> >>>>> source
>> > > > > >> >> >>>>> >>>>>>>>
version of flume R support? Just gauging
>> > interest,
>> > > > > Crunch
>> > > > > >> >> list
>> > > > > >> >> >>>>> seems
>> > > > > >> >> >>>>> >>>>>>>>
like a natural place to poke.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
Thanks .
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
-Dmitriy
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
--
>> > > > > >> >> >>>>> >>>>>>>
Director of Data Science
>> > > > > >> >> >>>>> >>>>>>>
Cloudera
>> > > > > >> >> >>>>> >>>>>>>
Twitter: @josh_wills
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> --
>> > > > > >> >> >>>> Director of Data Science
>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > --
>> > > > > >> > Director of Data Science
>> > > > > >> > Cloudera <http://www.cloudera.com>
>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > > >>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Director of Data Science
>> > > > > > Cloudera <http://www.cloudera.com>
>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message