hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tarandeep Singh" <tarand...@gmail.com>
Subject Re: hadoop: how to find top N frequently occurring words
Date Mon, 04 Feb 2008 22:48:18 GMT
On Feb 4, 2008 2:40 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> How stable is the code?  I could quite easily set some undergraduate project
> to do something with it, for example process query logs
>

I started learning and using hadoop few days back. The program that I
have is similar to word count except that it processes a querylog in
special format. I have another program that reads the output of this
program and computes the top N keywords. Want to make it a one program
(single map reduce)

-Taran

> Miles
>
>
> On 04/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
> >
> >
> > This is a great opportunity for me to talk about the Groovy support that I
> > have just gotten running.  I am looking for friendly testers as this code
> > is
> > definitely not ready for full release.
> >
> > The program you need in groovy is this:
> >
> > // define the map-reduce function by specifying map and reduce functions
> > logCount = Hadoop.mr(
> >    {key, value, out, report -> out.collect(value.split[0], 1)},
> >    {keyword, counts, out, report ->
> >       sum = 0;
> >       counts.each { sum += it}
> >       out.collect(keyword, sum)
> >    })
> >
> > // apply the function to an input file and collect the results in a map
> > results = [:]
> > LogCount(inputFileEitherLocallyOnHDFS).eachLine {
> >     line ->
> >       parts = line.split(\t)
> >       results[parts[0]] = parts[1]
> > }
> >
> > // sort the entries in the map by descending count and print the results
> > for (x in results.entrySet().sort( {-it.value} )) {
> >    println x
> > }
> >
> > // delete the temporary results
> > Hadoop.cleanup(results)
> >
> > The important points here are:
> >
> > 1) the groovy binding lets you express the map-reduce part of your program
> > simply.
> >
> > 2) collecting the results is trivial ... You don't have to worry about
> > where
> > or how the results are kept.  You would use the same code to read a local
> > file as to read the results of the map-reduce computation
> >
> > 3) because of (2), you can do some computation locally (the sort) and some
> > in parallel (the counting).  You could easily translate the sort to a
> > hadoop
> > call as well.
> >
> > I know that this doesn't quite answer the question because my
> > groovy-hadoop
> > bridge isn't available yet, but it hopefully will spark some interest.
> >
> > The question I would like to pose to the community is this:
> >
> >   What is the best way to proceed with code like this that is not ready
> > for
> > prime time, but is ready for others to contribute and possibly also use?
> > Should I follow the Jaql and Cascading course and build a separate
> > repository and web site or should I try to add this as a contrib package
> > like streaming?  Or should I just hand out source by hand for a little
> > while
> > to get feedback?
> >
> >
> > On 2/4/08 2:04 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Can someone guide me on how to write program using hadoop framework
> > > that analyze the log files and find out the top most frequently
> > > occurring keywords. The log file has the format -
> > >
> > > keyword source dateId
> > >
> > > Thanks,
> > > Tarandeep
> >
> >
>
>
> --
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>

Mime
View raw message