hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject Re: ongoing · Wide Finder 2
Date Thu, 01 May 2008 22:34:32 GMT
or Cascading (+Groovy).

should have a release of my Groovy Cascading builder by this weekend...

def APACHE_COMMON_REGEX = /^([^ ]*) +[^ ]* +[^ ]* +\[([^]]*)\] + 
\"([^ ]*) ([^ ]*) [^ ]*\" ([^ ]*) ([^ ]*).*$/
def APACHE_COMMON_GROUPS = [1, 2, 3, 4, 5, 6]
def APACHE_COMMON_FIELDS = ["ip", "time", "method", "url", "status",  
"size"]

def URL_PATTERN = /\/ongoing\/When\/\d\d\dx\/\d\d\d\d\/\d\d\/\d\d\/ 
[^ .]+/

def cascading = new Cascading()
def builder = cascading.builder();

Flow flow = builder.flow("widefinder")
   {
     source(input, scheme: text())

     // parse apache log
     regexParser(pattern: APACHE_COMMON_REGEX, groups:  
APACHE_COMMON_GROUPS, declared: APACHE_COMMON_FIELDS )

     // throw away tuples that don't match
     filter(arguments:["url"], pattern:URL_PATTERN)

     // throw away unused fields
     project(arguments:["url"])

     group(groupBy:["url"])

     // creates 'count' field, by default
     count()

     // group/sort on 'count', reverse the sort order
     group(["count"], reverse: true)

     sink(output, delete: true)
   }

flow.complete() // execute the flow


On May 1, 2008, at 2:12 PM, Doug Cutting wrote:

> Anyone want to play?  The goal is to find a small program that  
> quickly computes some statistics over 45GB of log data on a 32-core  
> box.  Hadoop seems like a good candidate.  Streaming?  Pig?  Java?
>
> http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2
>
> Doug

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/





Mime
View raw message