hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khalil Honsali" <k.hons...@gmail.com>
Subject Re: Groovy integration details
Date Mon, 04 Feb 2008 23:58:41 GMT
I am not in a position to decide this, but it'll be clearer if the code is
available (as contrib first ?)...

K. Honsali

On 05/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
>
>
> The system as it stands supports the following major features
>
> - map-reduce programs can be constructed for interactive, local use or
> hadoop based execution
>
> - map-reduce programs are functions that can be nested and composed
>
> - inputs to map-reduce programs can be strings, lists of strings, local
> files or HDFS files
>
> - outputs are stored in HDFS
>
> - outputs can be consumed by multiple other functions
>
> The current minor(ish) limitations include
>
> - combiners, partition functions and sorting aren't supported yet
>
> - you can't pass conventional java Mappers or Reducers to the framework
>
> - only one input file can be given
>
> - the system doesn't clean up afterwards
>
> These are all easily addressed and should be fixed over the next week or
> two.
>
> The major limitations include:
>
> - only one script can be specified
>
> - additional jars cannot be submitted
>
> - no explicit group/co-group syntactic sugar is provided
>
> These will take a bit longer to resolve.  I hope to incorporate jar
> building
> code similar to that used by the streaming system to address most of this.
> The group/co-group stuff is just the matter of a bit of work.
>
> Pig is very different from this Groovy integration.  They are trying to
> build a new relational algebra language.  I am just trying to write
> map-reduce programs.  They explicitly do not want to support general
> coding
> of functions except in a very limited way or via integration of Java code
> while that is my primary goal.  The other big difference is that my system
> is simple enough that I was able to implement it with a week of coding
> (after a few weeks of noodling about how to make it possible at all).
>
>
>
> On 2/4/08 3:28 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:
>
> > sorry for the unclarity,
> >
> > - I think I understand that Groovy is already usable and stable, but
> > requires some testing ? what others things required?
> > - what is the next step, i.e., roadmap if any, what evolution / growth
> > direction?
> > - I haven't tried Pig but it also seems to support submitting a function
> to
> > be transformed to map/reduce, though pig is higher level?
> >
> > PS:
> >  -  maybe Groovy requires another mailinglist thread ...
> >
> > K. Honsali
> >
> >
> >
> > On 05/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
> >>
> >>
> >> Did you mean who, what, when, where and how?
> >>
> >> Who is me.  I am the only author so far.
> >>
> >> What is a groovy/java program that supports running groovy/hadoop
> scripts
> >>
> >> When is nearly now.
> >>
> >> Where is everywhere (this is the internet)
> >>
> >> How is an open question.  I think that Doug's suggested evolution of
> Jira
> >> with patches -> contrib -> sub-project is appropriate.
> >>
> >>
> >> On 2/4/08 2:59 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:
> >>
> >>> Hi all, Mr. Dunning;
> >>>
> >>> I am interested in the Groovy idea, especially for processing text, I
> >> think
> >>> it can be a good opensource alternative to Google's Sawzall.
> >>>
> >>> Please let me know the 5-Ws of the matter if possible.
> >>>
> >>> K. Honsali
> >>>
> >>> On 05/02/2008, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> >>>>
> >>>> sorry, I meant Groovy
> >>>>
> >>>> Miles
> >>>>
> >>>> On 04/02/2008, Tarandeep Singh <tarandeep@gmail.com> wrote:
> >>>>>
> >>>>> On Feb 4, 2008 2:40 PM, Miles Osborne <miles@inf.ed.ac.uk>
wrote:
> >>>>>> How stable is the code?  I could quite easily set some
> undergraduate
> >>>>> project
> >>>>>> to do something with it, for example process query logs
> >>>>>>
> >>>>>
> >>>>> I started learning and using hadoop few days back. The program that
> I
> >>>>> have is similar to word count except that it processes a querylog
in
> >>>>> special format. I have another program that reads the output of
this
> >>>>> program and computes the top N keywords. Want to make it a one
> program
> >>>>> (single map reduce)
> >>>>>
> >>>>> -Taran
> >>>>>
> >>>>>> Miles
> >>>>>>
> >>>>>>
> >>>>>> On 04/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> This is a great opportunity for me to talk about the Groovy
> support
> >>>>> that I
> >>>>>>> have just gotten running.  I am looking for friendly testers
as
> this
> >>>>> code
> >>>>>>> is
> >>>>>>> definitely not ready for full release.
> >>>>>>>
> >>>>>>> The program you need in groovy is this:
> >>>>>>>
> >>>>>>> // define the map-reduce function by specifying map and
reduce
> >>>>> functions
> >>>>>>> logCount = Hadoop.mr(
> >>>>>>>    {key, value, out, report -> out.collect(value.split[0],
1)},
> >>>>>>>    {keyword, counts, out, report ->
> >>>>>>>       sum = 0;
> >>>>>>>       counts.each { sum += it}
> >>>>>>>       out.collect(keyword, sum)
> >>>>>>>    })
> >>>>>>>
> >>>>>>> // apply the function to an input file and collect the results
in
> a
> >>>>> map
> >>>>>>> results = [:]
> >>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> >>>>>>>     line ->
> >>>>>>>       parts = line.split(\t)
> >>>>>>>       results[parts[0]] = parts[1]
> >>>>>>> }
> >>>>>>>
> >>>>>>> // sort the entries in the map by descending count and print
the
> >>>>> results
> >>>>>>> for (x in results.entrySet().sort( {-it.value} )) {
> >>>>>>>    println x
> >>>>>>> }
> >>>>>>>
> >>>>>>> // delete the temporary results
> >>>>>>> Hadoop.cleanup(results)
> >>>>>>>
> >>>>>>> The important points here are:
> >>>>>>>
> >>>>>>> 1) the groovy binding lets you express the map-reduce part
of your
> >>>>> program
> >>>>>>> simply.
> >>>>>>>
> >>>>>>> 2) collecting the results is trivial ... You don't have
to worry
> >>>> about
> >>>>>>> where
> >>>>>>> or how the results are kept.  You would use the same code
to read
> a
> >>>>> local
> >>>>>>> file as to read the results of the map-reduce computation
> >>>>>>>
> >>>>>>> 3) because of (2), you can do some computation locally (the
sort)
> >>>> and
> >>>>> some
> >>>>>>> in parallel (the counting).  You could easily translate
the sort
> to
> >>>> a
> >>>>>>> hadoop
> >>>>>>> call as well.
> >>>>>>>
> >>>>>>> I know that this doesn't quite answer the question because
my
> >>>>>>> groovy-hadoop
> >>>>>>> bridge isn't available yet, but it hopefully will spark
some
> >>>> interest.
> >>>>>>>
> >>>>>>> The question I would like to pose to the community is this:
> >>>>>>>
> >>>>>>>   What is the best way to proceed with code like this that
is not
> >>>>> ready
> >>>>>>> for
> >>>>>>> prime time, but is ready for others to contribute and possibly
> also
> >>>>> use?
> >>>>>>> Should I follow the Jaql and Cascading course and build
a separate
> >>>>>>> repository and web site or should I try to add this as a
contrib
> >>>>> package
> >>>>>>> like streaming?  Or should I just hand out source by hand
for a
> >>>> little
> >>>>>>> while
> >>>>>>> to get feedback?
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <tarandeep@gmail.com>
wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Can someone guide me on how to write program using hadoop
> >>>> framework
> >>>>>>>> that analyze the log files and find out the top most
frequently
> >>>>>>>> occurring keywords. The log file has the format -
> >>>>>>>>
> >>>>>>>> keyword source dateId
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Tarandeep
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> The University of Edinburgh is a charitable body, registered
in
> >>>>> Scotland,
> >>>>>> with registration number SC005336.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> The University of Edinburgh is a charitable body, registered in
> >> Scotland,
> >>>> with registration number SC005336.
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message