hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Partitioning reduce output by date
Date Thu, 20 Mar 2008 23:00:19 GMT
Thank you, Doug and Ted, this pointed me in the right direction, which lead to a custom OutputFormat
and a RecordWriter that opens and closes the DataOutputStream based on the current key (if
current key diff from previous key, close previous output and open a new one, then write....)

As for partitioning, that worked, too.  My getPartition method now has:

            int dateHash = startDate.hashCode();
            if (dateHash < 0)
                dateHash = -dateHash;
            int partitionID = dateHash % numPartitions;
            return partitionID;

Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Doug Cutting <cutting@apache.org>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 19, 2008 4:39:04 PM
Subject: Re: Partitioning reduce output by date

Otis Gospodnetic wrote:
> That "numPartitions" corresponds to the number of reduce tasks.  What I need is partitioning
that corresponds to the number of unique dates (yyyy-mm-dd) processed by the Mapper and not
the number of reduce tasks.  I don't know the number of distinct dates in the input ahead
of time, though, so I cannot just specify the same number of reduces.
> I *can* get the number of unique dates by keeping track of dates in map().  I was going
to take this approach and use this number in the getPartition(....) method, but apparently
getPartition(...) is called as each input row is processed by map() call.  This causes a problem
for me, as I know the total number of unique dates only after *all* of the input is processed
by map().

The number of partitions is indeed the number of reduces.  If you were 
to compute it during map, then each map might generate a different 
number.  Each map must partition into the same space, so that all 
partition 0 data can go to one reduce, partition 1 to another, and so on.

I think Ted pointed you in the right direction: your Partitioner should 
partition by the hash of the date, then your OutputFormat should start 
writing a new file each time the date changes.  That will give you a 
unique file per date.


View raw message