hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: hadoop: how to find top N frequently occurring words
Date Mon, 04 Feb 2008 23:20:21 GMT

Did you mean who, what, when, where and how?

Who is me.  I am the only author so far.

What is a groovy/java program that supports running groovy/hadoop scripts

When is nearly now.

Where is everywhere (this is the internet)

How is an open question.  I think that Doug's suggested evolution of Jira
with patches -> contrib -> sub-project is appropriate.


On 2/4/08 2:59 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:

> Hi all, Mr. Dunning;
> 
> I am interested in the Groovy idea, especially for processing text, I think
> it can be a good opensource alternative to Google's Sawzall.
> 
> Please let me know the 5-Ws of the matter if possible.
> 
> K. Honsali
> 
> On 05/02/2008, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>> 
>> sorry, I meant Groovy
>> 
>> Miles
>> 
>> On 04/02/2008, Tarandeep Singh <tarandeep@gmail.com> wrote:
>>> 
>>> On Feb 4, 2008 2:40 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>>> How stable is the code?  I could quite easily set some undergraduate
>>> project
>>>> to do something with it, for example process query logs
>>>> 
>>> 
>>> I started learning and using hadoop few days back. The program that I
>>> have is similar to word count except that it processes a querylog in
>>> special format. I have another program that reads the output of this
>>> program and computes the top N keywords. Want to make it a one program
>>> (single map reduce)
>>> 
>>> -Taran
>>> 
>>>> Miles
>>>> 
>>>> 
>>>> On 04/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
>>>>> 
>>>>> 
>>>>> This is a great opportunity for me to talk about the Groovy support
>>> that I
>>>>> have just gotten running.  I am looking for friendly testers as this
>>> code
>>>>> is
>>>>> definitely not ready for full release.
>>>>> 
>>>>> The program you need in groovy is this:
>>>>> 
>>>>> // define the map-reduce function by specifying map and reduce
>>> functions
>>>>> logCount = Hadoop.mr(
>>>>>    {key, value, out, report -> out.collect(value.split[0], 1)},
>>>>>    {keyword, counts, out, report ->
>>>>>       sum = 0;
>>>>>       counts.each { sum += it}
>>>>>       out.collect(keyword, sum)
>>>>>    })
>>>>> 
>>>>> // apply the function to an input file and collect the results in a
>>> map
>>>>> results = [:]
>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>>>>>     line ->
>>>>>       parts = line.split(\t)
>>>>>       results[parts[0]] = parts[1]
>>>>> }
>>>>> 
>>>>> // sort the entries in the map by descending count and print the
>>> results
>>>>> for (x in results.entrySet().sort( {-it.value} )) {
>>>>>    println x
>>>>> }
>>>>> 
>>>>> // delete the temporary results
>>>>> Hadoop.cleanup(results)
>>>>> 
>>>>> The important points here are:
>>>>> 
>>>>> 1) the groovy binding lets you express the map-reduce part of your
>>> program
>>>>> simply.
>>>>> 
>>>>> 2) collecting the results is trivial ... You don't have to worry
>> about
>>>>> where
>>>>> or how the results are kept.  You would use the same code to read a
>>> local
>>>>> file as to read the results of the map-reduce computation
>>>>> 
>>>>> 3) because of (2), you can do some computation locally (the sort)
>> and
>>> some
>>>>> in parallel (the counting).  You could easily translate the sort to
>> a
>>>>> hadoop
>>>>> call as well.
>>>>> 
>>>>> I know that this doesn't quite answer the question because my
>>>>> groovy-hadoop
>>>>> bridge isn't available yet, but it hopefully will spark some
>> interest.
>>>>> 
>>>>> The question I would like to pose to the community is this:
>>>>> 
>>>>>   What is the best way to proceed with code like this that is not
>>> ready
>>>>> for
>>>>> prime time, but is ready for others to contribute and possibly also
>>> use?
>>>>> Should I follow the Jaql and Cascading course and build a separate
>>>>> repository and web site or should I try to add this as a contrib
>>> package
>>>>> like streaming?  Or should I just hand out source by hand for a
>> little
>>>>> while
>>>>> to get feedback?
>>>>> 
>>>>> 
>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Can someone guide me on how to write program using hadoop
>> framework
>>>>>> that analyze the log files and find out the top most frequently
>>>>>> occurring keywords. The log file has the format -
>>>>>> 
>>>>>> keyword source dateId
>>>>>> 
>>>>>> Thanks,
>>>>>> Tarandeep
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland,
>>>> with registration number SC005336.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>> 


Mime
View raw message