hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Partitioning reduce output by date
Date Wed, 19 Mar 2008 20:39:04 GMT
Otis Gospodnetic wrote:
> That "numPartitions" corresponds to the number of reduce tasks.  What I need is partitioning
that corresponds to the number of unique dates (yyyy-mm-dd) processed by the Mapper and not
the number of reduce tasks.  I don't know the number of distinct dates in the input ahead
of time, though, so I cannot just specify the same number of reduces.
> I *can* get the number of unique dates by keeping track of dates in map().  I was going
to take this approach and use this number in the getPartition(....) method, but apparently
getPartition(...) is called as each input row is processed by map() call.  This causes a problem
for me, as I know the total number of unique dates only after *all* of the input is processed
by map().

The number of partitions is indeed the number of reduces.  If you were 
to compute it during map, then each map might generate a different 
number.  Each map must partition into the same space, so that all 
partition 0 data can go to one reduce, partition 1 to another, and so on.

I think Ted pointed you in the right direction: your Partitioner should 
partition by the hash of the date, then your OutputFormat should start 
writing a new file each time the date changes.  That will give you a 
unique file per date.


View raw message