hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: Partitioning reduce output by date
Date Thu, 20 Mar 2008 23:09:51 GMT

If you want to output data to different files based on date or any value
parts, you may want to check


> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Thursday, March 20, 2008 4:00 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Partitioning reduce output by date
> Thank you, Doug and Ted, this pointed me in the right direction, which
> lead to a custom OutputFormat and a RecordWriter that opens and closes
> DataOutputStream based on the current key (if current key diff from
> previous key, close previous output and open a new one, then
> As for partitioning, that worked, too.  My getPartition method now
>             int dateHash = startDate.hashCode();
>             if (dateHash < 0)
>                 dateHash = -dateHash;
>             int partitionID = dateHash % numPartitions;
>             return partitionID;
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> ----- Original Message ----
> From: Doug Cutting <cutting@apache.org>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, March 19, 2008 4:39:04 PM
> Subject: Re: Partitioning reduce output by date
> Otis Gospodnetic wrote:
> > That "numPartitions" corresponds to the number of reduce tasks.
What I
> need is partitioning that corresponds to the number of unique dates
> mm-dd) processed by the Mapper and not the number of reduce tasks.  I
> don't know the number of distinct dates in the input ahead of time,
> though, so I cannot just specify the same number of reduces.
> >
> > I *can* get the number of unique dates by keeping track of dates in
> map().  I was going to take this approach and use this number in the
> getPartition(....) method, but apparently getPartition(...) is called
> each input row is processed by map() call.  This causes a problem for
> as I know the total number of unique dates only after *all* of the
> is processed by map().
> The number of partitions is indeed the number of reduces.  If you were
> to compute it during map, then each map might generate a different
> number.  Each map must partition into the same space, so that all
> partition 0 data can go to one reduce, partition 1 to another, and so
> I think Ted pointed you in the right direction: your Partitioner
> partition by the hash of the date, then your OutputFormat should start
> writing a new file each time the date changes.  That will give you a
> unique file per date.
> Doug

View raw message