hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Groovy integration details
Date Mon, 04 Feb 2008 23:39:06 GMT

The system as it stands supports the following major features

- map-reduce programs can be constructed for interactive, local use or
hadoop based execution

- map-reduce programs are functions that can be nested and composed

- inputs to map-reduce programs can be strings, lists of strings, local
files or HDFS files

- outputs are stored in HDFS

- outputs can be consumed by multiple other functions

The current minor(ish) limitations include

- combiners, partition functions and sorting aren't supported yet

- you can't pass conventional java Mappers or Reducers to the framework

- only one input file can be given

- the system doesn't clean up afterwards

These are all easily addressed and should be fixed over the next week or
two.

The major limitations include:

- only one script can be specified

- additional jars cannot be submitted

- no explicit group/co-group syntactic sugar is provided

These will take a bit longer to resolve.  I hope to incorporate jar building
code similar to that used by the streaming system to address most of this.
The group/co-group stuff is just the matter of a bit of work.

Pig is very different from this Groovy integration.  They are trying to
build a new relational algebra language.  I am just trying to write
map-reduce programs.  They explicitly do not want to support general coding
of functions except in a very limited way or via integration of Java code
while that is my primary goal.  The other big difference is that my system
is simple enough that I was able to implement it with a week of coding
(after a few weeks of noodling about how to make it possible at all).



On 2/4/08 3:28 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:

> sorry for the unclarity,
> 
> - I think I understand that Groovy is already usable and stable, but
> requires some testing ? what others things required?
> - what is the next step, i.e., roadmap if any, what evolution / growth
> direction?
> - I haven't tried Pig but it also seems to support submitting a function to
> be transformed to map/reduce, though pig is higher level?
> 
> PS:
>  -  maybe Groovy requires another mailinglist thread ...
> 
> K. Honsali
> 
> 
> 
> On 05/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
>> 
>> 
>> Did you mean who, what, when, where and how?
>> 
>> Who is me.  I am the only author so far.
>> 
>> What is a groovy/java program that supports running groovy/hadoop scripts
>> 
>> When is nearly now.
>> 
>> Where is everywhere (this is the internet)
>> 
>> How is an open question.  I think that Doug's suggested evolution of Jira
>> with patches -> contrib -> sub-project is appropriate.
>> 
>> 
>> On 2/4/08 2:59 PM, "Khalil Honsali" <k.honsali@gmail.com> wrote:
>> 
>>> Hi all, Mr. Dunning;
>>> 
>>> I am interested in the Groovy idea, especially for processing text, I
>> think
>>> it can be a good opensource alternative to Google's Sawzall.
>>> 
>>> Please let me know the 5-Ws of the matter if possible.
>>> 
>>> K. Honsali
>>> 
>>> On 05/02/2008, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>>> 
>>>> sorry, I meant Groovy
>>>> 
>>>> Miles
>>>> 
>>>> On 04/02/2008, Tarandeep Singh <tarandeep@gmail.com> wrote:
>>>>> 
>>>>> On Feb 4, 2008 2:40 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>>>>> How stable is the code?  I could quite easily set some undergraduate
>>>>> project
>>>>>> to do something with it, for example process query logs
>>>>>> 
>>>>> 
>>>>> I started learning and using hadoop few days back. The program that I
>>>>> have is similar to word count except that it processes a querylog in
>>>>> special format. I have another program that reads the output of this
>>>>> program and computes the top N keywords. Want to make it a one program
>>>>> (single map reduce)
>>>>> 
>>>>> -Taran
>>>>> 
>>>>>> Miles
>>>>>> 
>>>>>> 
>>>>>> On 04/02/2008, Ted Dunning <tdunning@veoh.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> This is a great opportunity for me to talk about the Groovy support
>>>>> that I
>>>>>>> have just gotten running.  I am looking for friendly testers
as this
>>>>> code
>>>>>>> is
>>>>>>> definitely not ready for full release.
>>>>>>> 
>>>>>>> The program you need in groovy is this:
>>>>>>> 
>>>>>>> // define the map-reduce function by specifying map and reduce
>>>>> functions
>>>>>>> logCount = Hadoop.mr(
>>>>>>>    {key, value, out, report -> out.collect(value.split[0],
1)},
>>>>>>>    {keyword, counts, out, report ->
>>>>>>>       sum = 0;
>>>>>>>       counts.each { sum += it}
>>>>>>>       out.collect(keyword, sum)
>>>>>>>    })
>>>>>>> 
>>>>>>> // apply the function to an input file and collect the results
in a
>>>>> map
>>>>>>> results = [:]
>>>>>>> LogCount(inputFileEitherLocallyOnHDFS).eachLine {
>>>>>>>     line ->
>>>>>>>       parts = line.split(\t)
>>>>>>>       results[parts[0]] = parts[1]
>>>>>>> }
>>>>>>> 
>>>>>>> // sort the entries in the map by descending count and print
the
>>>>> results
>>>>>>> for (x in results.entrySet().sort( {-it.value} )) {
>>>>>>>    println x
>>>>>>> }
>>>>>>> 
>>>>>>> // delete the temporary results
>>>>>>> Hadoop.cleanup(results)
>>>>>>> 
>>>>>>> The important points here are:
>>>>>>> 
>>>>>>> 1) the groovy binding lets you express the map-reduce part of
your
>>>>> program
>>>>>>> simply.
>>>>>>> 
>>>>>>> 2) collecting the results is trivial ... You don't have to worry
>>>> about
>>>>>>> where
>>>>>>> or how the results are kept.  You would use the same code to
read a
>>>>> local
>>>>>>> file as to read the results of the map-reduce computation
>>>>>>> 
>>>>>>> 3) because of (2), you can do some computation locally (the sort)
>>>> and
>>>>> some
>>>>>>> in parallel (the counting).  You could easily translate the sort
to
>>>> a
>>>>>>> hadoop
>>>>>>> call as well.
>>>>>>> 
>>>>>>> I know that this doesn't quite answer the question because my
>>>>>>> groovy-hadoop
>>>>>>> bridge isn't available yet, but it hopefully will spark some
>>>> interest.
>>>>>>> 
>>>>>>> The question I would like to pose to the community is this:
>>>>>>> 
>>>>>>>   What is the best way to proceed with code like this that is
not
>>>>> ready
>>>>>>> for
>>>>>>> prime time, but is ready for others to contribute and possibly
also
>>>>> use?
>>>>>>> Should I follow the Jaql and Cascading course and build a separate
>>>>>>> repository and web site or should I try to add this as a contrib
>>>>> package
>>>>>>> like streaming?  Or should I just hand out source by hand for
a
>>>> little
>>>>>>> while
>>>>>>> to get feedback?
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/4/08 2:04 PM, "Tarandeep Singh" <tarandeep@gmail.com>
wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Can someone guide me on how to write program using hadoop
>>>> framework
>>>>>>>> that analyze the log files and find out the top most frequently
>>>>>>>> occurring keywords. The log file has the format -
>>>>>>>> 
>>>>>>>> keyword source dateId
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Tarandeep
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland,
>>>>>> with registration number SC005336.
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>> Scotland,
>>>> with registration number SC005336.
>>>> 
>> 
>> 


Mime
View raw message