Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 90431 invoked from network); 4 Feb 2008 23:59:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Feb 2008 23:59:23 -0000 Received: (qmail 51851 invoked by uid 500); 4 Feb 2008 23:59:06 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 51827 invoked by uid 500); 4 Feb 2008 23:59:06 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 51816 invoked by uid 99); 4 Feb 2008 23:59:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Feb 2008 15:59:06 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of k.honsali@gmail.com designates 72.14.246.250 as permitted sender) Received: from [72.14.246.250] (HELO ag-out-0708.google.com) (72.14.246.250) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Feb 2008 23:58:38 +0000 Received: by ag-out-0708.google.com with SMTP id 9so2466435agd.9 for ; Mon, 04 Feb 2008 15:58:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=Pj3CMTWbLTi9ZNWNqjkRhPrgzCqZV/fM4UTrQZ3ej+U=; b=kv57x0DQFv0tnPU9OGQON03utdQZr+Le1vmcg2KBB22KCfphZAcAs5MFvh+L8gF1uBRZY8ZigVYgN10XWifbRS1xUZrm/+YjSxCa6AkpOo2hQ0yf07WbrYLVvNVv/S1d3bTiqe5ByjpVbwO4iUmT25f7n0yzhZQNxoRWc01A36s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=TN5yqFbUXg6Oh6r9i3Puf0eAM/1bwrepOdTSWPy09wGFDNta94c0psb76RPjiqkWRMwGBEy3AycdZ+3bocJu8HzqIpoSpnLNPaS76HKy0X46UoXfn9ueMxPvoJcKJ0CQ2kwlHGpvTR/mIqUCrX8U9dc6MbfgcbHdr8zu3UNNMEA= Received: by 10.142.131.18 with SMTP id e18mr3903283wfd.147.1202169521222; Mon, 04 Feb 2008 15:58:41 -0800 (PST) Received: by 10.142.143.21 with HTTP; Mon, 4 Feb 2008 15:58:41 -0800 (PST) Message-ID: <583355c00802041558k26b25a32ib5b1f396609ea44b@mail.gmail.com> Date: Tue, 5 Feb 2008 08:58:41 +0900 From: "Khalil Honsali" To: core-user@hadoop.apache.org Subject: Re: Groovy integration details In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_6500_29031016.1202169521216" References: <583355c00802041528w57c0d52ep448c2a1c6ae7aa92@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_6500_29031016.1202169521216 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I am not in a position to decide this, but it'll be clearer if the code is available (as contrib first ?)... K. Honsali On 05/02/2008, Ted Dunning wrote: > > > The system as it stands supports the following major features > > - map-reduce programs can be constructed for interactive, local use or > hadoop based execution > > - map-reduce programs are functions that can be nested and composed > > - inputs to map-reduce programs can be strings, lists of strings, local > files or HDFS files > > - outputs are stored in HDFS > > - outputs can be consumed by multiple other functions > > The current minor(ish) limitations include > > - combiners, partition functions and sorting aren't supported yet > > - you can't pass conventional java Mappers or Reducers to the framework > > - only one input file can be given > > - the system doesn't clean up afterwards > > These are all easily addressed and should be fixed over the next week or > two. > > The major limitations include: > > - only one script can be specified > > - additional jars cannot be submitted > > - no explicit group/co-group syntactic sugar is provided > > These will take a bit longer to resolve. I hope to incorporate jar > building > code similar to that used by the streaming system to address most of this. > The group/co-group stuff is just the matter of a bit of work. > > Pig is very different from this Groovy integration. They are trying to > build a new relational algebra language. I am just trying to write > map-reduce programs. They explicitly do not want to support general > coding > of functions except in a very limited way or via integration of Java code > while that is my primary goal. The other big difference is that my system > is simple enough that I was able to implement it with a week of coding > (after a few weeks of noodling about how to make it possible at all). > > > > On 2/4/08 3:28 PM, "Khalil Honsali" wrote: > > > sorry for the unclarity, > > > > - I think I understand that Groovy is already usable and stable, but > > requires some testing ? what others things required? > > - what is the next step, i.e., roadmap if any, what evolution / growth > > direction? > > - I haven't tried Pig but it also seems to support submitting a function > to > > be transformed to map/reduce, though pig is higher level? > > > > PS: > > - maybe Groovy requires another mailinglist thread ... > > > > K. Honsali > > > > > > > > On 05/02/2008, Ted Dunning wrote: > >> > >> > >> Did you mean who, what, when, where and how? > >> > >> Who is me. I am the only author so far. > >> > >> What is a groovy/java program that supports running groovy/hadoop > scripts > >> > >> When is nearly now. > >> > >> Where is everywhere (this is the internet) > >> > >> How is an open question. I think that Doug's suggested evolution of > Jira > >> with patches -> contrib -> sub-project is appropriate. > >> > >> > >> On 2/4/08 2:59 PM, "Khalil Honsali" wrote: > >> > >>> Hi all, Mr. Dunning; > >>> > >>> I am interested in the Groovy idea, especially for processing text, I > >> think > >>> it can be a good opensource alternative to Google's Sawzall. > >>> > >>> Please let me know the 5-Ws of the matter if possible. > >>> > >>> K. Honsali > >>> > >>> On 05/02/2008, Miles Osborne wrote: > >>>> > >>>> sorry, I meant Groovy > >>>> > >>>> Miles > >>>> > >>>> On 04/02/2008, Tarandeep Singh wrote: > >>>>> > >>>>> On Feb 4, 2008 2:40 PM, Miles Osborne wrote: > >>>>>> How stable is the code? I could quite easily set some > undergraduate > >>>>> project > >>>>>> to do something with it, for example process query logs > >>>>>> > >>>>> > >>>>> I started learning and using hadoop few days back. The program that > I > >>>>> have is similar to word count except that it processes a querylog in > >>>>> special format. I have another program that reads the output of this > >>>>> program and computes the top N keywords. Want to make it a one > program > >>>>> (single map reduce) > >>>>> > >>>>> -Taran > >>>>> > >>>>>> Miles > >>>>>> > >>>>>> > >>>>>> On 04/02/2008, Ted Dunning wrote: > >>>>>>> > >>>>>>> > >>>>>>> This is a great opportunity for me to talk about the Groovy > support > >>>>> that I > >>>>>>> have just gotten running. I am looking for friendly testers as > this > >>>>> code > >>>>>>> is > >>>>>>> definitely not ready for full release. > >>>>>>> > >>>>>>> The program you need in groovy is this: > >>>>>>> > >>>>>>> // define the map-reduce function by specifying map and reduce > >>>>> functions > >>>>>>> logCount = Hadoop.mr( > >>>>>>> {key, value, out, report -> out.collect(value.split[0], 1)}, > >>>>>>> {keyword, counts, out, report -> > >>>>>>> sum = 0; > >>>>>>> counts.each { sum += it} > >>>>>>> out.collect(keyword, sum) > >>>>>>> }) > >>>>>>> > >>>>>>> // apply the function to an input file and collect the results in > a > >>>>> map > >>>>>>> results = [:] > >>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine { > >>>>>>> line -> > >>>>>>> parts = line.split(\t) > >>>>>>> results[parts[0]] = parts[1] > >>>>>>> } > >>>>>>> > >>>>>>> // sort the entries in the map by descending count and print the > >>>>> results > >>>>>>> for (x in results.entrySet().sort( {-it.value} )) { > >>>>>>> println x > >>>>>>> } > >>>>>>> > >>>>>>> // delete the temporary results > >>>>>>> Hadoop.cleanup(results) > >>>>>>> > >>>>>>> The important points here are: > >>>>>>> > >>>>>>> 1) the groovy binding lets you express the map-reduce part of your > >>>>> program > >>>>>>> simply. > >>>>>>> > >>>>>>> 2) collecting the results is trivial ... You don't have to worry > >>>> about > >>>>>>> where > >>>>>>> or how the results are kept. You would use the same code to read > a > >>>>> local > >>>>>>> file as to read the results of the map-reduce computation > >>>>>>> > >>>>>>> 3) because of (2), you can do some computation locally (the sort) > >>>> and > >>>>> some > >>>>>>> in parallel (the counting). You could easily translate the sort > to > >>>> a > >>>>>>> hadoop > >>>>>>> call as well. > >>>>>>> > >>>>>>> I know that this doesn't quite answer the question because my > >>>>>>> groovy-hadoop > >>>>>>> bridge isn't available yet, but it hopefully will spark some > >>>> interest. > >>>>>>> > >>>>>>> The question I would like to pose to the community is this: > >>>>>>> > >>>>>>> What is the best way to proceed with code like this that is not > >>>>> ready > >>>>>>> for > >>>>>>> prime time, but is ready for others to contribute and possibly > also > >>>>> use? > >>>>>>> Should I follow the Jaql and Cascading course and build a separate > >>>>>>> repository and web site or should I try to add this as a contrib > >>>>> package > >>>>>>> like streaming? Or should I just hand out source by hand for a > >>>> little > >>>>>>> while > >>>>>>> to get feedback? > >>>>>>> > >>>>>>> > >>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Can someone guide me on how to write program using hadoop > >>>> framework > >>>>>>>> that analyze the log files and find out the top most frequently > >>>>>>>> occurring keywords. The log file has the format - > >>>>>>>> > >>>>>>>> keyword source dateId > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Tarandeep > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> The University of Edinburgh is a charitable body, registered in > >>>>> Scotland, > >>>>>> with registration number SC005336. > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> The University of Edinburgh is a charitable body, registered in > >> Scotland, > >>>> with registration number SC005336. > >>>> > >> > >> > > ------=_Part_6500_29031016.1202169521216--