Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of k.honsali@gmail.com
 designates 72.14.246.250 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=TN5yqFbUXg6Oh6r9i3Puf0eAM/1bwrepOdTSWPy09wGFDNta94c0psb76RPjiqkWRMwGBEy3AycdZ+3bocJu8HzqIpoSpnLNPaS76HKy0X46UoXfn9ueMxPvoJcKJ0CQ2kwlHGpvTR/mIqUCrX8U9dc6MbfgcbHdr8zu3UNNMEA=
Message-ID: <583355c00802041558k26b25a32ib5b1f396609ea44b@mail.gmail.com>
Date: Tue, 5 Feb 2008 08:58:41 +0900
From: "Khalil Honsali" <k.honsali@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: Groovy integration details
In-Reply-To: <C3CCE21A.37AB6%tdunning@veoh.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_6500_29031016.1202169521216"
References: <583355c00802041528w57c0d52ep448c2a1c6ae7aa92@mail.gmail.com>
	 <C3CCE21A.37AB6%tdunning@veoh.com>

------=_Part_6500_29031016.1202169521216
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I am not in a position to decide this, but it'll be clearer if the code is
available (as contrib first ?)...

K. Honsali

On 05/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
>
>
> The system as it stands supports the following major features
>
> - map-reduce programs can be constructed for interactive, local use or
> hadoop based execution
>
> - map-reduce programs are functions that can be nested and composed
>
> - inputs to map-reduce programs can be strings, lists of strings, local
> files or HDFS files
>
> - outputs are stored in HDFS
>
> - outputs can be consumed by multiple other functions
>
> The current minor(ish) limitations include
>
> - combiners, partition functions and sorting aren't supported yet
>
> - you can't pass conventional java Mappers or Reducers to the framework
>
> - only one input file can be given
>
> - the system doesn't clean up afterwards
>
> These are all easily addressed and should be fixed over the next week or
> two.
>
> The major limitations include:
>
> - only one script can be specified
>
> - additional jars cannot be submitted
>
> - no explicit group/co-group syntactic sugar is provided
>
> These will take a bit longer to resolve.  I hope to incorporate jar
> building
> code similar to that used by the streaming system to address most of this.
> The group/co-group stuff is just the matter of a bit of work.
>
> Pig is very different from this Groovy integration.  They are trying to
> build a new relational algebra language.  I am just trying to write
> map-reduce programs.  They explicitly do not want to support general
> coding
> of functions except in a very limited way or via integration of Java code
> while that is my primary goal.  The other big difference is that my system
> is simple enough that I was able to implement it with a week of coding
> (after a few weeks of noodling about how to make it possible at all).
>
>
>
> On 2/4/08 3:28 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:
>
> > sorry for the unclarity,
> >
> > - I think I understand that Groovy is already usable and stable, but
> > requires some testing ? what others things required?
> > - what is the next step, i.e., roadmap if any, what evolution / growth
> > direction?
> > - I haven't tried Pig but it also seems to support submitting a function
> to
> > be transformed to map/reduce, though pig is higher level?
> >
> > PS:
> >  -  maybe Groovy requires another mailinglist thread ...
> >
> > K. Honsali
> >
> >
> >
> > On 05/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
> >>
> >>
> >> Did you mean who, what, when, where and how?
> >>
> >> Who is me.  I am the only author so far.
> >>
> >> What is a groovy/java program that supports running groovy/hadoop
> scripts
> >>
> >> When is nearly now.
> >>
> >> Where is everywhere (this is the internet)
> >>
> >> How is an open question.  I think that Doug's suggested evolution of
> Jira
> >> with patches -> contrib -> sub-project is appropriate.
> >>
> >>
> >> On 2/4/08 2:59 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:
> >>
> >>> Hi all, Mr. Dunning;
> >>>
> >>> I am interested in the Groovy idea, especially for processing text, I
> >> think
> >>> it can be a good opensource alternative to Google's Sawzall.
> >>>
> >>> Please let me know the 5-Ws of the matter if possible.
> >>>
> >>> K. Honsali
> >>>
> >>> On 05/02/2008, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> >>>>
> >>>> sorry, I meant Groovy
> >>>>
> >>>> Miles
> >>>>
> >>>> On 04/02/2008, Tarandeep Singh <tarandeep@gmail.com> wrote:
> >>>>>
> >>>>> On Feb 4, 2008 2:40 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> >>>>>> How stable is the code?  I could quite easily set some
> undergraduate
> >>>>> project
> >>>>>> to do something with it, for example process query logs
> >>>>>>
> >>>>>
> >>>>> I started learning and using hadoop few days back. The program that
> I
> >>>>> have is similar to word count except that it processes a querylog in
> >>>>> special format. I have another program that reads the output of this
> >>>>> program and computes the top N keywords. Want to make it a one
> program
> >>>>> (single map reduce)
> >>>>>
> >>>>> -Taran
> >>>>>
> >>>>>> Miles
> >>>>>>
> >>>>>>
> >>>>>> On 04/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> This is a great opportunity for me to talk about the Groovy
> support
> >>>>> that I
> >>>>>>> have just gotten running.  I am looking for friendly testers as
> this
> >>>>> code
> >>>>>>> is
> >>>>>>> definitely not ready for full release.
> >>>>>>>
> >>>>>>> The program you need in groovy is this:
> >>>>>>>
> >>>>>>> // define the map-reduce function by specifying map and reduce
> >>>>> functions
> >>>>>>> logCount = Hadoop.mr(
> >>>>>>>    {key, value, out, report -> out.collect(value.split[0], 1)},
> >>>>>>>    {keyword, counts, out, report ->
> >>>>>>>       sum = 0;
> >>>>>>>       counts.each { sum += it}
> >>>>>>>       out.collect(keyword, sum)
> >>>>>>>    })
> >>>>>>>
> >>>>>>> // apply the function to an input file and collect the results in
> a
> >>>>> map
> >>>>>>> results = [:]
> >>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> >>>>>>>     line ->
> >>>>>>>       parts = line.split(\t)
> >>>>>>>       results[parts[0]] = parts[1]
> >>>>>>> }
> >>>>>>>
> >>>>>>> // sort the entries in the map by descending count and print the
> >>>>> results
> >>>>>>> for (x in results.entrySet().sort( {-it.value} )) {
> >>>>>>>    println x
> >>>>>>> }
> >>>>>>>
> >>>>>>> // delete the temporary results
> >>>>>>> Hadoop.cleanup(results)
> >>>>>>>
> >>>>>>> The important points here are:
> >>>>>>>
> >>>>>>> 1) the groovy binding lets you express the map-reduce part of your
> >>>>> program
> >>>>>>> simply.
> >>>>>>>
> >>>>>>> 2) collecting the results is trivial ... You don't have to worry
> >>>> about
> >>>>>>> where
> >>>>>>> or how the results are kept.  You would use the same code to read
> a
> >>>>> local
> >>>>>>> file as to read the results of the map-reduce computation
> >>>>>>>
> >>>>>>> 3) because of (2), you can do some computation locally (the sort)
> >>>> and
> >>>>> some
> >>>>>>> in parallel (the counting).  You could easily translate the sort
> to
> >>>> a
> >>>>>>> hadoop
> >>>>>>> call as well.
> >>>>>>>
> >>>>>>> I know that this doesn't quite answer the question because my
> >>>>>>> groovy-hadoop
> >>>>>>> bridge isn't available yet, but it hopefully will spark some
> >>>> interest.
> >>>>>>>
> >>>>>>> The question I would like to pose to the community is this:
> >>>>>>>
> >>>>>>>   What is the best way to proceed with code like this that is not
> >>>>> ready
> >>>>>>> for
> >>>>>>> prime time, but is ready for others to contribute and possibly
> also
> >>>>> use?
> >>>>>>> Should I follow the Jaql and Cascading course and build a separate
> >>>>>>> repository and web site or should I try to add this as a contrib
> >>>>> package
> >>>>>>> like streaming?  Or should I just hand out source by hand for a
> >>>> little
> >>>>>>> while
> >>>>>>> to get feedback?
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Can someone guide me on how to write program using hadoop
> >>>> framework
> >>>>>>>> that analyze the log files and find out the top most frequently
> >>>>>>>> occurring keywords. The log file has the format -
> >>>>>>>>
> >>>>>>>> keyword source dateId
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Tarandeep
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> The University of Edinburgh is a charitable body, registered in
> >>>>> Scotland,
> >>>>>> with registration number SC005336.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> The University of Edinburgh is a charitable body, registered in
> >> Scotland,
> >>>> with registration number SC005336.
> >>>>
> >>
> >>
>
>

------=_Part_6500_29031016.1202169521216--