hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Equivalent of cmdline head or tail?
Date Fri, 07 Mar 2008 01:02:12 GMT


First, there is a close method on reducers.  That means that combiners and
reducers can do first N values pretty easily.

Secondly, you can define sort orders so that the reduce can just process the
first N items and then quit.  I don't know if the combiner sees things in
order.  IF it does, then you can prune on both levels to minimize data
transfer.


On 3/6/08 10:40 AM, "Jimmy Wan" <jimmy@indeed.com> wrote:

> I've got some jobs where I'd like to just pull out the top N or bottom N
> values.
> 
> It seems like I can't do this from the map or combine phases (due to not
> having enough data), but I could aggregate this data during the reduce
> phase. The problem I have is that I won't know when to actually write them
> out until I've gone through the entire set, at which point reduce isn't
> called anymore.
> 
> It's easy enough to post-process with some combination of sort, head, and
> tail, but I was wondering if I was missing something obvious.


Mime
View raw message