Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F870DBF6 for ; Tue, 30 Oct 2012 00:46:48 +0000 (UTC) Received: (qmail 12052 invoked by uid 500); 30 Oct 2012 00:46:48 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 12025 invoked by uid 500); 30 Oct 2012 00:46:48 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 12016 invoked by uid 99); 30 Oct 2012 00:46:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2012 00:46:48 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlieu.7@gmail.com designates 209.85.215.47 as permitted sender) Received: from [209.85.215.47] (HELO mail-la0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2012 00:46:43 +0000 Received: by mail-la0-f47.google.com with SMTP id h5so4143502lam.6 for ; Mon, 29 Oct 2012 17:46:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=eND0yHs5FhShovolBD3XcDywPODkZQbm/AwXokdQ8Kk=; b=R66l5WtkGsoMed6IaoKJ6ZiDEpDDXh9o4XWW8Ed/jX75RMbQuFMZBu74TGpLbUUs4i f0BjX1dudxZFykaChIVXnvkjVcIWdQrQlUDQNLwAJQ8R5SSGV0+EvdWvkRE1FGiWGHKW akmm5gmAXYsNIbKFLUojyMX1pZzDj5WJjijcBO8cdTIJBMkXSK7VtVIcv5ScNbU2nH46 1n4n38EVPk5uQdPNHRWTGd35PzRU1HmOVv53dc3GMYoBgPEG2Es2dPet9oD3SA+vEMUE dE+xXZ6VnCCuYoVr5CiuD0PSDU1mUvs0p0ZniQbGTuxwzEaWa1M+INswfUEe10FtfuBe 1VQw== MIME-Version: 1.0 Received: by 10.112.51.206 with SMTP id m14mr12127717lbo.45.1351557982253; Mon, 29 Oct 2012 17:46:22 -0700 (PDT) Received: by 10.112.29.232 with HTTP; Mon, 29 Oct 2012 17:46:22 -0700 (PDT) In-Reply-To: References: <507F92A7.50406@xebia.com> Date: Mon, 29 Oct 2012 17:46:22 -0700 Message-ID: Subject: Re: Flume R -- any interest? From: Dmitriy Lyubimov To: crunch-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Great! so it is in Crunch. does it support hadoop-job jar format or only pure java jars? On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills wrote: > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov wrote: > >> I think i need functionality to add more jars (or external hadoop-jar) >> to drive that from an R package. Just setting job jar by class is not >> enough. I can push overall job-jar as an addiitonal jar to R package; >> however, i cannot really run hadoop command line on it, i need to set >> up classpath thru RJava. >> >> Traditional single hadoop job jar will unlikely work here since we >> cannot hardcode pipelines in java code but rather have to construct >> them on the fly. (well, we could serialize pipeline definitions from R >> and then replay them in a driver -- but that's too cumbersome and more >> work than it has to be.) There's no reason why i shouldn't be able to >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking >> off a pipeline. >> > > o.a.c.util.DistCache.addJarToDistributedCache? > > >> >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov >> wrote: >> > Ok, sounds very promising... >> > >> > i'll try to start digging on the driver part this week then (Pipeline >> > wrapper in R5). >> > >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills >> wrote: >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov >> wrote: >> >>> Ok, cool. >> >>> >> >>> So what state is Crunch in? I take it is in a fairly advanced state. >> >>> So every api mentioned in the FlumeJava paper is working , right? Or >> >>> there's something that is not working specifically? >> >> >> >> I think the only thing in the paper that we don't have in a working >> >> state is MSCR fusion. It's mostly just a question of prioritizing it >> >> and getting the work done. >> >> >> >>> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills >> wrote: >> >>>> Hey Dmitriy, >> >>>> >> >>>> Got a fork going and looking forward to playing with crunchR this >> weekend-- >> >>>> thanks! >> >>>> >> >>>> J >> >>>> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov >> wrote: >> >>>> >> >>>>> Project template https://github.com/dlyubimov/crunchR >> >>>>> >> >>>>> Default profile does not compile R artifact . R profile compiles R >> >>>>> artifact. for convenience, it is enabled by supplying -DR to mvn >> >>>>> command line, e.g. >> >>>>> >> >>>>> mvn install -DR >> >>>>> >> >>>>> there's also a helper that installs the snapshot version of the >> >>>>> package in the crunchR module. >> >>>>> >> >>>>> There's RJava and JRI java dependencies which i did not find anywhere >> >>>>> in public maven repos; so it is installed into my github maven repo >> so >> >>>>> far. Should compile for 3rd party. >> >>>>> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc >> >>>>> compilation requires roxygen2 (i think). >> >>>>> >> >>>>> For some reason RProtoBuf fails to import into another package, got a >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may be a >> >>>>> problem though... >> >>>>> >> >>>>> other than the template, not much else has been done so far... >> finding >> >>>>> hadoop libraries and adding it to the package path on initialization >> >>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided" >> >>>>> transitives to the crunchR's java part... >> >>>>> >> >>>>> No legal stuff... >> >>>>> >> >>>>> No readmes... complete stealth at this point. >> >>>>> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < >> dlieu.7@gmail.com> >> >>>>> wrote: >> >>>>> > Ok, cool. I will try to roll project template by some time next >> week. >> >>>>> > we can start with prototyping and benchmarking something really >> >>>>> > simple, such as parallelDo(). >> >>>>> > >> >>>>> > My interim goal is to perhaps take some more or less simple >> algorithm >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or >> whatever >> >>>>> > name it has to be) in a comparable time (performance) but with much >> >>>>> > fewer lines of code. (say one of factorization or clustering >> things) >> >>>>> > >> >>>>> > >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul wrote: >> >>>>> >> I am not much of R user but I am interested to see how well we can >> >>>>> integrate >> >>>>> >> the two. I would be happy to help. >> >>>>> >> >> >>>>> >> regards, >> >>>>> >> Rahul >> >>>>> >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: >> >>>>> >>> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov < >> dlieu.7@gmail.com> >> >>>>> >>> wrote: >> >>>>> >>>> >> >>>>> >>>> Yep, ok. >> >>>>> >>>> >> >>>>> >>>> I imagine it has to be an R module so I can set up a maven >> project >> >>>>> >>>> with java/R code tree (I have been doing that a lot lately). Or >> if you >> >>>>> >>>> have a template to look at, it would be useful i guess too. >> >>>>> >>> >> >>>>> >>> No, please go right ahead. >> >>>>> >>> >> >>>>> >>>> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < >> josh.wills@gmail.com> >> >>>>> wrote: >> >>>>> >>>>> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to help. >> Github >> >>>>> >>>>> repo? >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" > > >> >>>>> wrote: >> >>>>> >>>>> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype on >> top of >> >>>>> >>>>>> Crunch for something simple. This should both save time and >> prove or >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable. >> >>>>> >>>>>> >> >>>>> >>>>>> On my part i can try to do it within Crunch framework or we >> can keep >> >>>>> >>>>>> it completely separate. >> >>>>> >>>>>> >> >>>>> >>>>>> -d >> >>>>> >>>>>> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < >> jwills@cloudera.com> >> >>>>> >>>>>> wrote: >> >>>>> >>>>>>> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the >> talk? Was >> >>>>> it >> >>>>> >>>>>>> Murray Stokely? >> >>>>> >>>>>>> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov < >> >>>>> dlieu.7@gmail.com> >> >>>>> >>>>>> >> >>>>> >>>>>> wrote: >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> Hello, >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience of R >> mapping >> >>>>> of >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of >> applications >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped using >> flume R. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> I did not quite get the details of Google implementation of >> R >> >>>>> >>>>>>>> mapping, >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to Crunch >> would >> >>>>> be >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI and >> jni >> >>>>> seem to >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could have a >> >>>>> contributed >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would be >> just a >> >>>>> very >> >>>>> >>>>>>>> good synergy. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising for open >> >>>>> source >> >>>>> >>>>>>>> version of flume R support? Just gauging interest, Crunch >> list >> >>>>> seems >> >>>>> >>>>>>>> like a natural place to poke. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> Thanks . >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> -Dmitriy >> >>>>> >>>>>>> >> >>>>> >>>>>>> >> >>>>> >>>>>>> >> >>>>> >>>>>>> -- >> >>>>> >>>>>>> Director of Data Science >> >>>>> >>>>>>> Cloudera >> >>>>> >>>>>>> Twitter: @josh_wills >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >> >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Director of Data Science >> >>>> Cloudera >> >>>> Twitter: @josh_wills >> > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills