incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Flume R -- any interest?
Date Tue, 13 Nov 2012 01:29:07 GMT
On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> so java tasks need to be able to load libjri.so from
> whatever system.file("jri", package="rJava") says.
>
> Traditionally, these issues were handled with -Djava.library.path.
> Apparently there's nothing java task can do to enable loadLibrary() command
> to see the damn library once started. But -Djava.library.path requires for
> nodes to configure and lock jvm command line from modifications of the
> client.  which is fine.
>
> I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 (again).
>
> but... any other suggestions about best practice configuring crunch to run
> user's .so's?
>

Not off the top of my head. I suspect that whatever you come up with will
become the "best practice." :)

>
> thanks.
>
>
>
>
>
>
> On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
> > I believe that is a safe assumption, at least right now.
> >
> >
> > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > > Question.
> > >
> > > So in Crunch api, initialize() doesn't get an emitter. and the process
> > gets
> > > emitter every time.
> > >
> > > However, my guess any single reincranation of a DoFn object in the
> > backend
> > > will always be getting the same emitter thru its lifecycle. Is it an
> > > admissible assumption or there's currently a counter example to that?
> > >
> > > The problem is that as i implement the two way pipeline of input and
> > > emitter data between R and Java, I am bulking these calls together for
> > > performance reasons. Each individual datum in these chunks of data will
> > not
> > > have attached emitter function information to them in any way. (well it
> > > could but it would be a performance killer and i bet emitter never
> > > changes).
> > >
> > > So, thoughts? can i assume emitter never changes between first and lass
> > > call to DoFn instance?
> > >
> > > thanks.
> > >
> > >
> > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > > wrote:
> > >
> > > > yes...
> > > >
> > > > i think it worked for me before, although just adding all jars from R
> > > > package distribution would be a little bit more appropriate approach
> > > > -- but it creates a problem with jars in dependent R packages. I
> think
> > > > it would be much easier to just compile a hadoop-job file and stick
> it
> > > > in rather than doing cherry-picking of individual jars from who knows
> > > > how many locations.
> > > >
> > > > i think i used the hadoop job format with distributed cache before
> and
> > > > it worked... at least with Pig "register jar" functionality.
> > > >
> > > > ok i guess i will just try if it works.
> > > >
> > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <jwills@cloudera.com>
> > wrote:
> > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> Great! so it is in Crunch.
> > > > >>
> > > > >> does it support hadoop-job jar format or only pure java jars?
> > > > >>
> > > > >
> > > > > I think just pure jars-- you're referring to hadoop-job format as
> > > having
> > > > > all the dependencies in a lib/ directory within the jar?
> > > > >
> > > > >
> > > > >>
> > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <jwills@cloudera.com>
> > > > wrote:
> > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > > dlieu.7@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> >> I think i need functionality to add more jars (or external
> > > > hadoop-jar)
> > > > >> >> to drive that from an R package. Just setting job jar
by class
> is
> > > not
> > > > >> >> enough. I can push overall job-jar as an addiitonal
jar to R
> > > package;
> > > > >> >> however, i cannot really run hadoop command line on
it, i need
> to
> > > set
> > > > >> >> up classpath thru RJava.
> > > > >> >>
> > > > >> >> Traditional single hadoop job jar will unlikely work
here since
> > we
> > > > >> >> cannot hardcode pipelines in java code but rather have
to
> > construct
> > > > >> >> them on the fly. (well, we could serialize pipeline
definitions
> > > from
> > > > R
> > > > >> >> and then replay them in a driver -- but that's too cumbersome
> and
> > > > more
> > > > >> >> work than it has to be.) There's no reason why i shouldn't
be
> > able
> > > to
> > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like)
when
> > > kicking
> > > > >> >> off a pipeline.
> > > > >> >>
> > > > >> >
> > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > > > >> >
> > > > >> >
> > > > >> >>
> > > > >> >>
> > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> > > > dlieu.7@gmail.com>
> > > > >> >> wrote:
> > > > >> >> > Ok, sounds very promising...
> > > > >> >> >
> > > > >> >> > i'll try to start digging on the driver part this
week then
> > > > (Pipeline
> > > > >> >> > wrapper in R5).
> > > > >> >> >
> > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > > josh.wills@gmail.com
> > > > >
> > > > >> >> wrote:
> > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov
<
> > > > dlieu.7@gmail.com
> > > > >> >
> > > > >> >> wrote:
> > > > >> >> >>> Ok, cool.
> > > > >> >> >>>
> > > > >> >> >>> So what state is Crunch in? I take it is
in a fairly
> advanced
> > > > state.
> > > > >> >> >>> So every api mentioned in the  FlumeJava
paper is working ,
> > > > right?
> > > > >> Or
> > > > >> >> >>> there's something that is not working specifically?
> > > > >> >> >>
> > > > >> >> >> I think the only thing in the paper that we
don't have in a
> > > > working
> > > > >> >> >> state is MSCR fusion. It's mostly just a question
of
> > > prioritizing
> > > > it
> > > > >> >> >> and getting the work done.
> > > > >> >> >>
> > > > >> >> >>>
> > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills
<
> > > jwills@cloudera.com
> > > > >
> > > > >> >> wrote:
> > > > >> >> >>>> Hey Dmitriy,
> > > > >> >> >>>>
> > > > >> >> >>>> Got a fork going and looking forward
to playing with
> crunchR
> > > > this
> > > > >> >> weekend--
> > > > >> >> >>>> thanks!
> > > > >> >> >>>>
> > > > >> >> >>>> J
> > > > >> >> >>>>
> > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
Lyubimov <
> > > > >> dlieu.7@gmail.com>
> > > > >> >> wrote:
> > > > >> >> >>>>
> > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > > > >> >> >>>>>
> > > > >> >> >>>>> Default profile does not compile
R artifact . R profile
> > > > compiles R
> > > > >> >> >>>>> artifact. for convenience, it is
enabled by supplying -DR
> > to
> > > > mvn
> > > > >> >> >>>>> command line, e.g.
> > > > >> >> >>>>>
> > > > >> >> >>>>> mvn install -DR
> > > > >> >> >>>>>
> > > > >> >> >>>>> there's also a helper that installs
the snapshot version
> of
> > > the
> > > > >> >> >>>>> package in the crunchR module.
> > > > >> >> >>>>>
> > > > >> >> >>>>> There's RJava and JRI java dependencies
which i did not
> > find
> > > > >> anywhere
> > > > >> >> >>>>> in public maven repos; so it is
installed into my github
> > > maven
> > > > >> repo
> > > > >> >> so
> > > > >> >> >>>>> far. Should compile for 3rd party.
> > > > >> >> >>>>>
> > > > >> >> >>>>> -DR compilation requires R, RJava
and optionally,
> > RProtoBuf.
> > > R
> > > > Doc
> > > > >> >> >>>>> compilation requires roxygen2 (i
think).
> > > > >> >> >>>>>
> > > > >> >> >>>>> For some reason RProtoBuf fails
to import into another
> > > package,
> > > > >> got a
> > > > >> >> >>>>> weird exception when i put @import
RProtoBuf into
> crunchR,
> > so
> > > > >> >> >>>>> RProtoBuf is now in "Suggests"
category. Down the road
> that
> > > may
> > > > >> be a
> > > > >> >> >>>>> problem though...
> > > > >> >> >>>>>
> > > > >> >> >>>>> other than the template, not much
else has been done so
> > > far...
> > > > >> >> finding
> > > > >> >> >>>>> hadoop libraries and adding it
to the package path on
> > > > >> initialization
> > > > >> >> >>>>> via "hadoop classpath"... adding
Crunch jars and its
> > > > >> non-"provided"
> > > > >> >> >>>>> transitives to the crunchR's java
part...
> > > > >> >> >>>>>
> > > > >> >> >>>>> No legal stuff...
> > > > >> >> >>>>>
> > > > >> >> >>>>> No readmes... complete stealth
at this point.
> > > > >> >> >>>>>
> > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM,
Dmitriy Lyubimov <
> > > > >> >> dlieu.7@gmail.com>
> > > > >> >> >>>>> wrote:
> > > > >> >> >>>>> > Ok, cool. I will try to roll
project template by some
> > time
> > > > next
> > > > >> >> week.
> > > > >> >> >>>>> > we can start with prototyping
and benchmarking
> something
> > > > really
> > > > >> >> >>>>> > simple, such as parallelDo().
> > > > >> >> >>>>> >
> > > > >> >> >>>>> > My interim goal is to perhaps
take some more or less
> > simple
> > > > >> >> algorithm
> > > > >> >> >>>>> > from Mahout and demonstrate
it can be solved with
> Rcrunch
> > > (or
> > > > >> >> whatever
> > > > >> >> >>>>> > name it has to be) in a comparable
time (performance)
> but
> > > > with
> > > > >> much
> > > > >> >> >>>>> > fewer lines of code. (say
one of factorization or
> > > clustering
> > > > >> >> things)
> > > > >> >> >>>>> >
> > > > >> >> >>>>> >
> > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24
PM, Rahul <
> > rsharma@xebia.com
> > > >
> > > > >> wrote:
> > > > >> >> >>>>> >> I am not much of R user
but I am interested to see how
> > > well
> > > > we
> > > > >> can
> > > > >> >> >>>>> integrate
> > > > >> >> >>>>> >> the two. I would be happy
to help.
> > > > >> >> >>>>> >>
> > > > >> >> >>>>> >> regards,
> > > > >> >> >>>>> >> Rahul
> > > > >> >> >>>>> >>
> > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh
Wills wrote:
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>> On Wed, Oct 17, 2012
at 3:07 PM, Dmitriy Lyubimov <
> > > > >> >> dlieu.7@gmail.com>
> > > > >> >> >>>>> >>> wrote:
> > > > >> >> >>>>> >>>>
> > > > >> >> >>>>> >>>> Yep, ok.
> > > > >> >> >>>>> >>>>
> > > > >> >> >>>>> >>>> I imagine it has
to be an R module so I can set up a
> > > maven
> > > > >> >> project
> > > > >> >> >>>>> >>>> with java/R code
tree (I have been doing that a lot
> > > > lately).
> > > > >> Or
> > > > >> >> if you
> > > > >> >> >>>>> >>>> have a template
to look at, it would be useful i
> guess
> > > > too.
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>> No, please go right
ahead.
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>>>
> > > > >> >> >>>>> >>>> On Wed, Oct 17,
2012 at 3:02 PM, Josh Wills <
> > > > >> >> josh.wills@gmail.com>
> > > > >> >> >>>>> wrote:
> > > > >> >> >>>>> >>>>>
> > > > >> >> >>>>> >>>>> I'd like it
to be separate at first, but I am happy
> > to
> > > > help.
> > > > >> >> Github
> > > > >> >> >>>>> >>>>> repo?
> > > > >> >> >>>>> >>>>> On Oct 17,
2012 2:57 PM, "Dmitriy Lyubimov" <
> > > > >> dlieu.7@gmail.com
> > > > >> >> >
> > > > >> >> >>>>> wrote:
> > > > >> >> >>>>> >>>>>
> > > > >> >> >>>>> >>>>>> Ok maybe
there's a benefit to try a JRI/RJava
> > > prototype
> > > > on
> > > > >> >> top of
> > > > >> >> >>>>> >>>>>> Crunch
for something simple. This should both save
> > > time
> > > > and
> > > > >> >> prove or
> > > > >> >> >>>>> >>>>>> disprove
if Crunch via RJava integration is
> viable.
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> On my
part i can try to do it within Crunch
> > framework
> > > > or we
> > > > >> >> can keep
> > > > >> >> >>>>> >>>>>> it completely
separate.
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> -d
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> On Wed,
Oct 17, 2012 at 2:08 PM, Josh Wills <
> > > > >> >> jwills@cloudera.com>
> > > > >> >> >>>>> >>>>>> wrote:
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>> I
am an avid R user and would be into it-- who
> gave
> > > the
> > > > >> >> talk? Was
> > > > >> >> >>>>> it
> > > > >> >> >>>>> >>>>>>> Murray
Stokely?
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>> On
Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
> Lyubimov <
> > > > >> >> >>>>> dlieu.7@gmail.com>
> > > > >> >> >>>>> >>>>>>
> > > > >> >> >>>>> >>>>>> wrote:
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
Hello,
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
I was pretty excited to learn of Google's
> > experience
> > > > of R
> > > > >> >> mapping
> > > > >> >> >>>>> of
> > > > >> >> >>>>> >>>>>>>>
flume java on one of recent BARUGs. I think a
> lot
> > of
> > > > >> >> applications
> > > > >> >> >>>>> >>>>>>>>
similar to what we do in Mahout could be
> > prototyped
> > > > using
> > > > >> >> flume R.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
I did not quite get the details of Google
> > > > implementation
> > > > >> of
> > > > >> >> R
> > > > >> >> >>>>> >>>>>>>>
mapping,
> > > > >> >> >>>>> >>>>>>>>
but i am not sure if just a direct mapping from
> R
> > to
> > > > >> Crunch
> > > > >> >> would
> > > > >> >> >>>>> be
> > > > >> >> >>>>> >>>>>>>>
sufficient (and, for most part, efficient).
> > > RJava/JRI
> > > > and
> > > > >> >> jni
> > > > >> >> >>>>> seem to
> > > > >> >> >>>>> >>>>>>>>
be a pretty terrible performer to do that
> > directly.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
on top of it, I am thinknig if this project
> could
> > > > have a
> > > > >> >> >>>>> contributed
> > > > >> >> >>>>> >>>>>>>>
adapter to Mahout's distributed matrices, that
> > would
> > > > be
> > > > >> >> just a
> > > > >> >> >>>>> very
> > > > >> >> >>>>> >>>>>>>>
good synergy.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
Is there anyone interested in
> > contributing/advising
> > > > for
> > > > >> open
> > > > >> >> >>>>> source
> > > > >> >> >>>>> >>>>>>>>
version of flume R support? Just gauging
> interest,
> > > > Crunch
> > > > >> >> list
> > > > >> >> >>>>> seems
> > > > >> >> >>>>> >>>>>>>>
like a natural place to poke.
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
Thanks .
> > > > >> >> >>>>> >>>>>>>>
> > > > >> >> >>>>> >>>>>>>>
-Dmitriy
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>>
> > > > >> >> >>>>> >>>>>>> --
> > > > >> >> >>>>> >>>>>>> Director
of Data Science
> > > > >> >> >>>>> >>>>>>> Cloudera
> > > > >> >> >>>>> >>>>>>> Twitter:
@josh_wills
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>>
> > > > >> >> >>>>> >>
> > > > >> >> >>>>>
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>>
> > > > >> >> >>>> --
> > > > >> >> >>>> Director of Data Science
> > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Director of Data Science
> > > > >> > Cloudera <http://www.cloudera.com>
> > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Director of Data Science
> > > > > Cloudera <http://www.cloudera.com>
> > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >
> > >
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message