Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F079AE3FA for ; Sat, 24 Nov 2012 18:55:15 +0000 (UTC) Received: (qmail 8293 invoked by uid 500); 24 Nov 2012 18:55:15 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 8255 invoked by uid 500); 24 Nov 2012 18:55:15 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 8244 invoked by uid 99); 24 Nov 2012 18:55:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Nov 2012 18:55:15 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlieu.7@gmail.com designates 209.85.215.47 as permitted sender) Received: from [209.85.215.47] (HELO mail-la0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Nov 2012 18:55:11 +0000 Received: by mail-la0-f47.google.com with SMTP id u2so7226400lag.6 for ; Sat, 24 Nov 2012 10:54:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=trDo9n8WO/xW+pi8P0Zri/+J2o5iqUeejZ7C9EreeZc=; b=llQVhiRydXrk03LD2UVBn7TIQojuhrxVsOYRFK0x/IHlLlOtrc2iSh4fXYtku9qLDm l+yy1I8KN/820A6P9AuAnFbOWTXBVqhXtzsnaKLanBtTTOHaQzafZ1j8nU4iDSW8Ac+y LvAUK67E79w9u2h7IAe2ZcDpxGvwre14pmg4XREipksk4U7E6VZMDhiuhwuIL/h/1cFg 5XFIEdPjGgkerL788ZiUnIxCxPXi/qCDFRq4cB7l3JpmWTEWVy9MT7fZEem6duwLm+oR V8T9vK269K/1KdjEVHlR263k4CVzrVwx7y8U3Q+q6ZuOBJWsIXejBJ/YbETz8VEAGYkg uGGw== MIME-Version: 1.0 Received: by 10.112.29.10 with SMTP id f10mr3178391lbh.4.1353783289515; Sat, 24 Nov 2012 10:54:49 -0800 (PST) Received: by 10.112.29.232 with HTTP; Sat, 24 Nov 2012 10:54:49 -0800 (PST) In-Reply-To: References: Date: Sat, 24 Nov 2012 10:54:49 -0800 Message-ID: Subject: Re: Crunch R first milestone From: Dmitriy Lyubimov To: crunch-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=bcaec55556541310bc04cf423cf8 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec55556541310bc04cf423cf8 Content-Type: text/plain; charset=ISO-8859-1 On Sat, Nov 24, 2012 at 10:29 AM, Josh Wills wrote: > Hey Dmitriy, > > I'm up and running w/Example1.R on my Linux machine-- very cool! My Mac is > having some sort of issue w/creating /tmp/crunch* directories that I need > to sort out. > > In the example you sent of the broken chaining of DoFns, why didn't the > first line (quoted below) require a PType? Because the implementation assumes a default type which is character vector as below. Also, if it detects that key type was specified explicitly, it returns PTable automatically instead of PCollection. Further on, PTable's emits automatically assume emit(key,value) invocation for concise of notation (instead of java's Pair.of(key,value) ) and PCollections assume just emit(value). parallelDo = function ( FUN_PROCESS, FUN_INITIALIZE=NULL,FUN_CLEANUP=NULL, valueType=crunchR.RStrings$new(), keyType) { if (missing(keyType)) { .parallelDo.PCollection(FUN_PROCESS,FUN_INITIALIZE,FUN_CLEANUP,valueType) } else { .parallelDo.PTable(FUN_PROCESS,FUN_INITIALIZE,FUN_CLEANUP,keyType,valueType) } }, Is there a shortcut for the case > when the PType of the child is the same as the PType of the parent? > er... no. it kind of always assume RStrings (which assumes PType but corresponding R type is multi-emit, i.e. you can emit a vector once and internally it will translate into bunch of calls of emit(String). This is a notion that i made specifically for R since R operates with vectors and lists, so i can emit just one vector type and declare it a multi-emit. It is not clear to me if this notion will have a benefit. Obviously, you still can emit R character vector as a single value, too, but you would have to select different RType thing there to imply your intent. Word count is a good example where multi-emit RType serves you well: you output result of split[[1]] which is a character vector as one R call emit(split...) but it translates into bunch of individual emits (the variant i had before this last one with PTable, or the one commented one here : # wordsPCol <- inputPCol$parallelDo( > # function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] ) > # ) > > # wordsPCol <- inputPCol$parallelDo( > # function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] ) > # ) > > Josh > > > > On Fri, Nov 23, 2012 at 1:59 PM, Dmitriy Lyubimov > wrote: > > > ok support for PTable emission (key,value) pairs work in the latest > commit. > > > > My current problem is that composition of doFunctions doesn't work, > > probably because of the sequence of cleanup() calls. I have to figure > out: > > > > ============= > > this composition of 2 functions (PCollection, PTable) is a problem > > > > # wordsPCol <- inputPCol$parallelDo( > > # function(line) emit( strsplit(tolower(line),"[^[:alnum:]]+")[[1]] ) > > # ) > > # > > # wordsPTab <- wordsPCol$parallelDo(function(word) emit(word,1), > > # keyType = crunchR.RString$new(), > > # valueType = crunchR.RUint32$new()) > > > > but this equivalent works: > > wordsPTab <- inputPCol$parallelDo( > > function(line) { > > words<- strsplit(tolower(line),"[^[:alnum:]]+")[[1]] > > sapply(words, function(x) emit(x,1)) > > }, > > keyType = crunchR.RString$new(), > > valueType = crunchR.RUint32$new() > > ) > > > > > > > > On Thu, Nov 22, 2012 at 2:13 PM, Dmitriy Lyubimov > > wrote: > > > > > Ok , I guess i am going to work on the next milestone which is > > PTableType > > > serialization support between R and java sides. > > > > > > once i am done with that, i guess i will be able to add other api and > > > complete word count example fairly easily. > > > > > > Example1.R in its current state works. > > > > > > > > > On Wed, Nov 21, 2012 at 12:11 PM, Josh Wills > > wrote: > > > > > >> I'm going to play with this again over the break-- BTW, did you see > > >> Renjin? > > >> I somehow missed this, but it looks interesting. > > >> > > >> http://code.google.com/p/renjin/ > > >> > > >> > > >> On Sun, Nov 18, 2012 at 11:44 AM, Dmitriy Lyubimov > >> >wrote: > > >> > > >> > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills > > >> wrote: > > >> > > > >> > > Dmitrity, > > >> > > > > >> > > Just sent you a pull request based on playing with the code on OS > X. > > >> It > > >> > > contains a README about my experience getting things working. > > >> > > > > >> > > > >> > Are you sure it is doxygen package? I thought it was roxygen2 > package? > > >> > > > >> > Actually there seems currently no best practice in existence for R5 > > >> classes > > >> > + roxygen2 (and the guy ignores @import order of files, too). Hence > > the > > >> > hacks with file names. > > >> > > > >> > > > >> > > Unfortunately, I haven't succeeded in getting crunchR loaded, I'm > > >> running > > >> > > into some issues w/RProtoBuf on OS X. I'll give it another go this > > >> week > > >> > on > > >> > > my Linux machine at work. > > >> > > > > >> > ok i removed @import RProtoBuf, you should be able to install w/o > it. > > >> Maven > > >> > still compiles protoc stuff though. > > >> > > > >> > > > > >> > > J > > >> > > > > >> > > > > >> > > On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov < > > dlieu.7@gmail.com > > >> > > >wrote: > > >> > > > > >> > > > Josh, > > >> > > > > > >> > > > ok the following commit > > >> > > > > > >> > > > ============== > > >> > > > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3 > > >> > > > Author: Dmitriy Lyubimov > > >> > > > Date: Sat Nov 17 12:29:27 2012 -0800 > > >> > > > > > >> > > > example1 succeeds > > >> > > > > > >> > > > ==================== > > >> > > > > > >> > > > runs example 1 for me successfully in a fully distributed way > > which > > >> is > > >> > > > first step (map-only thing) for the word count. > > >> > > > > > >> > > > (I think there's a hickup somewhere here because in the output i > > >> also > > >> > > seem > > >> > > > to see some empty lines, so the strsplit() part is perhaps set > up > > >> > > somewhat > > >> > > > incorrectly here, but it's not the point right now): > > >> > > > > > >> > > > ====Example1.R=========== > > >> > > > > > >> > > > library(crunchR) > > >> > > > > > >> > > > pipeline <- crunchR.MRPipeline$new("test-pipeline") > > >> > > > > > >> > > > inputPCol <- pipeline$readTextFile("/crunchr-examples/input") > > >> > > > > > >> > > > outputPCol <- inputPCol$parallelDo( > > >> > > > function(line) emit( > strsplit(tolower(line),"[^[:alnum:]]")[[1]] ) > > >> > > > ) > > >> > > > > > >> > > > outputPCol$writeTextFile("/crunchr-examples/output") > > >> > > > > > >> > > > result <- pipeline$run() > > >> > > > > > >> > > > if ( !result$succeeded() ) stop ("pipeline failed.") > > >> > > > > > >> > > > ======================================== > > >> > > > > > >> > > > I think R-java communication now should support multiple doFn ok > > and > > >> > they > > >> > > > will be properly shut down and executed and synchronized even if > > >> they > > >> > > emit > > >> > > > in the cleanup phase. > > >> > > > > > >> > > > This example assumes a lot of defaults (such as RTypes which are > > by > > >> > > default > > >> > > > character vector singleton in and character vector out for a > > DoFn). > > >> > Also > > >> > > > obviously uses text in-text out at this point only. > > >> > > > > > >> > > > > > >> > > > To run, install the package and upload the test input > > (test-prep.sh) > > >> > > > Assuming you have compiled the maven part, the R package > snapshot > > >> could > > >> > > be > > >> > > > installed by running "install-snapshot-rpkg.sh". > > >> > > > > > >> > > > You also need to make sure your backend tasks see JRI library. > > there > > >> > are > > >> > > > multiple ways to do it i guess but for the purposes of testing > the > > >> > > > following just works for me in my mapred-site: > > >> > > > > > >> > > > > > >> > > > mapred.child.java.opts > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > -Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri > > >> > > > > > >> > > > false > > >> > > > > > >> > > > > > >> > > > > > >> > > > I think at this point you guys might help me by doing review of > > that > > >> > > stuff, > > >> > > > asking questions and making suggestions how to go by > incorporating > > >> > other > > >> > > > types of doFn and perhaps a way to complete the word count > > example, > > >> > > perhaps > > >> > > > running comparative benchmarks with a java-only word count, how > > much > > >> > > > overhead we seem to be suffering here. > > >> > > > > > >> > > > I use StatEt in eclipse. Although it is a huge way forward, the > > >> process > > >> > > is > > >> > > > still extremely tedious since I don't know unit testing > framework > > >> in R > > >> > > well > > >> > > > (so i just scribble some stuff on the side to unit-test this and > > >> that) > > >> > > and > > >> > > > the integration test running cycle is significant enough. > > >> > > > > > >> > > > Which is why any help and suggestions are very welcome! > > >> > > > > > >> > > > I will definitely add support for reading/writing sequence files > > and > > >> > > > Protobufs, as well as Mahout DRM's . > > >> > > > > > >> > > > > > >> > > > Thanks. > > >> > > > -Dmitrity > > >> > > > > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> Director of Data Science > > >> Cloudera > > >> Twitter: @josh_wills > > >> > > > > > > > > > --bcaec55556541310bc04cf423cf8--