Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DD41EDF28 for ; Sun, 18 Nov 2012 19:34:32 +0000 (UTC) Received: (qmail 59522 invoked by uid 500); 18 Nov 2012 19:34:32 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 59485 invoked by uid 500); 18 Nov 2012 19:34:32 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 59477 invoked by uid 99); 18 Nov 2012 19:34:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Nov 2012 19:34:32 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlieu.7@gmail.com designates 209.85.215.47 as permitted sender) Received: from [209.85.215.47] (HELO mail-la0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Nov 2012 19:34:28 +0000 Received: by mail-la0-f47.google.com with SMTP id u2so3032796lag.6 for ; Sun, 18 Nov 2012 11:34:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+SQ/JYhsf9Qe3hgwCvs5cfK2EJS+nW372Icpdl1Kjkg=; b=Qp+DHT+j0WyVCcueCi1FugGuiwFnACHjrL91DvnOlh/S1vT9XB9Ij7lVz2w2NexrF7 2dkSFn/V7tXI26r/NJ22DsrW9VK4d20RZuj3MCgrV4f1kJv7H0SIQrfbxGRVgtO66mRc Cym82AzP3WeZGu+3aXPF0AW/yA2OaGEqeH8+gX9AGmnR0o8xGBJt08jRbX/w5ScSPHRk l3TvmcpBKwgcKVELDIXgYLsMHJMD8Cv9/rKC+bsUAhrLrzml2rhrUTFLkwjOl8nrAaqV KyCgJJ4bTQ+2rFZ+u12s0b9JomzSvAU0zHdvFBq7lltqRQlW8uLQkcHv7Zu6/+U/C6ME qQEg== MIME-Version: 1.0 Received: by 10.152.162.1 with SMTP id xw1mr9780231lab.3.1353267246704; Sun, 18 Nov 2012 11:34:06 -0800 (PST) Received: by 10.112.29.232 with HTTP; Sun, 18 Nov 2012 11:34:06 -0800 (PST) In-Reply-To: References: Date: Sun, 18 Nov 2012 11:34:06 -0800 Message-ID: Subject: Re: Crunch R first milestone From: Dmitriy Lyubimov To: crunch-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=f46d042ef4a586a88704ceca1512 X-Virus-Checked: Checked by ClamAV on apache.org --f46d042ef4a586a88704ceca1512 Content-Type: text/plain; charset=ISO-8859-1 oh, it is already in the central.. cool On Sun, Nov 18, 2012 at 11:28 AM, Dmitriy Lyubimov wrote: > ah, indeed i see it there, ok i will add it to pom repos, thanks > On Nov 18, 2012 11:20 AM, "Josh Wills" wrote: > >> On Sun, Nov 18, 2012 at 11:08 AM, Dmitriy Lyubimov > >wrote: >> >> > Question: is the Crunch 0.4.0 release available thru a maven repository? >> > How have you installed it into your local repo? >> > >> >> It should be-- I think Matthias published the Maven artifacts on Friday. >> Of >> course, I might have just had it installed locally b/c I was testing the >> release. :) >> >> >> > >> > >> > On Sun, Nov 18, 2012 at 10:30 AM, Josh Wills >> wrote: >> > >> > > On Sun, Nov 18, 2012 at 10:13 AM, Dmitriy Lyubimov > > >> > > wrote: >> > > > Thank you, Josh. Your insights are greatly appreciated. >> > > > >> > > > RProtoBuf has a bug with <<- operator. I already contacted the >> authors >> > > and >> > > > they confirmed it however it is not clear when they are going to fix >> > it. >> > > > >> > > > (code to reproduce: >> > > >> library(RProtoBuf) >> > > >> a <<- "A" >> > > > causes an error) >> > > > >> > > > Actually RProtoBuf is not used right now. I will move it into >> > > "recommended" >> > > > realm again if it makes things easier. >> > > > >> > > > For me, the hardest part was to make jvm +hadoop to see JRI library >> > > > actually. I am still not sure about the best course of action here >> but >> > > > there is definitely more than one way >> > > > >> > > > Also my apologies for code styling, it is probably the ugliest code >> > i've >> > > > ever written, but i will tidy it up once past the proof of concept >> > stage. >> > > >> > > No judgements, man. You should have seen the first rev of Crunch. ;-) >> > > >> > > > >> > > > -d >> > > > >> > > > >> > > > On Sun, Nov 18, 2012 at 9:37 AM, Josh Wills >> > > wrote: >> > > > >> > > >> Dmitrity, >> > > >> >> > > >> Just sent you a pull request based on playing with the code on OS >> X. >> > It >> > > >> contains a README about my experience getting things working. >> > > >> >> > > >> Unfortunately, I haven't succeeded in getting crunchR loaded, I'm >> > > running >> > > >> into some issues w/RProtoBuf on OS X. I'll give it another go this >> > week >> > > on >> > > >> my Linux machine at work. >> > > >> >> > > >> J >> > > >> >> > > >> >> > > >> On Sat, Nov 17, 2012 at 12:49 PM, Dmitriy Lyubimov < >> dlieu.7@gmail.com >> > > >> >wrote: >> > > >> >> > > >> > Josh, >> > > >> > >> > > >> > ok the following commit >> > > >> > >> > > >> > ============== >> > > >> > commit 67605360838f810fa5ddf99abb3ef2962d3f05e3 >> > > >> > Author: Dmitriy Lyubimov >> > > >> > Date: Sat Nov 17 12:29:27 2012 -0800 >> > > >> > >> > > >> > example1 succeeds >> > > >> > >> > > >> > ==================== >> > > >> > >> > > >> > runs example 1 for me successfully in a fully distributed way >> which >> > is >> > > >> > first step (map-only thing) for the word count. >> > > >> > >> > > >> > (I think there's a hickup somewhere here because in the output i >> > also >> > > >> seem >> > > >> > to see some empty lines, so the strsplit() part is perhaps set up >> > > >> somewhat >> > > >> > incorrectly here, but it's not the point right now): >> > > >> > >> > > >> > ====Example1.R=========== >> > > >> > >> > > >> > library(crunchR) >> > > >> > >> > > >> > pipeline <- crunchR.MRPipeline$new("test-pipeline") >> > > >> > >> > > >> > inputPCol <- pipeline$readTextFile("/crunchr-examples/input") >> > > >> > >> > > >> > outputPCol <- inputPCol$parallelDo( >> > > >> > function(line) emit( strsplit(tolower(line),"[^[:alnum:]]")[[1]] >> ) >> > > >> > ) >> > > >> > >> > > >> > outputPCol$writeTextFile("/crunchr-examples/output") >> > > >> > >> > > >> > result <- pipeline$run() >> > > >> > >> > > >> > if ( !result$succeeded() ) stop ("pipeline failed.") >> > > >> > >> > > >> > ======================================== >> > > >> > >> > > >> > I think R-java communication now should support multiple doFn ok >> and >> > > they >> > > >> > will be properly shut down and executed and synchronized even if >> > they >> > > >> emit >> > > >> > in the cleanup phase. >> > > >> > >> > > >> > This example assumes a lot of defaults (such as RTypes which are >> by >> > > >> default >> > > >> > character vector singleton in and character vector out for a >> DoFn). >> > > Also >> > > >> > obviously uses text in-text out at this point only. >> > > >> > >> > > >> > >> > > >> > To run, install the package and upload the test input >> (test-prep.sh) >> > > >> > Assuming you have compiled the maven part, the R package snapshot >> > > could >> > > >> be >> > > >> > installed by running "install-snapshot-rpkg.sh". >> > > >> > >> > > >> > You also need to make sure your backend tasks see JRI library. >> there >> > > are >> > > >> > multiple ways to do it i guess but for the purposes of testing >> the >> > > >> > following just works for me in my mapred-site: >> > > >> > >> > > >> > >> > > >> > mapred.child.java.opts >> > > >> > >> > > >> > >> > > >> > >> > > >> >> > > >> > >> -Djava.library.path=/home/dmitriy/R/x86_64-pc-linux-gnu-library/2/rJava/jri >> > > >> > >> > > >> > false >> > > >> > >> > > >> > >> > > >> > >> > > >> > I think at this point you guys might help me by doing review of >> that >> > > >> stuff, >> > > >> > asking questions and making suggestions how to go by >> incorporating >> > > other >> > > >> > types of doFn and perhaps a way to complete the word count >> example, >> > > >> perhaps >> > > >> > running comparative benchmarks with a java-only word count, how >> much >> > > >> > overhead we seem to be suffering here. >> > > >> > >> > > >> > I use StatEt in eclipse. Although it is a huge way forward, the >> > > process >> > > >> is >> > > >> > still extremely tedious since I don't know unit testing framework >> > in R >> > > >> well >> > > >> > (so i just scribble some stuff on the side to unit-test this and >> > that) >> > > >> and >> > > >> > the integration test running cycle is significant enough. >> > > >> > >> > > >> > Which is why any help and suggestions are very welcome! >> > > >> > >> > > >> > I will definitely add support for reading/writing sequence files >> and >> > > >> > Protobufs, as well as Mahout DRM's . >> > > >> > >> > > >> > >> > > >> > Thanks. >> > > >> > -Dmitrity >> > > >> > >> > > >> >> > > >> > > >> > > >> > > -- >> > > Director of Data Science >> > > Cloudera >> > > Twitter: @josh_wills >> > > >> > >> >> >> >> -- >> Director of Data Science >> Cloudera >> Twitter: @josh_wills >> > --f46d042ef4a586a88704ceca1512--