hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Adams <trevorad...@gmail.com>
Subject Re: Is there a way to insure that different jobs have the same number of reducers
Date Thu, 30 Jun 2011 01:11:43 GMT
Exact same bucket is possible, exact same machine (if that is what you had
in mind) probably not. The partitioner breaks the data up for the reducers,
so if they map to the same partition they will be done by the same reducer.
If you can partition the data such that the output of one reducer partitions
to 1 bucket and is not split then you can get all the data going to one
reducer. Doing it this way means there needs to be some transient property
that carries over from the step 1 reducer and through the step 2 mapper.
Most cases, I would assume, do not have that property.


On Wed, Jun 29, 2011 at 9:05 PM, Steve Lewis <lordjoe2000@gmail.com> wrote:

> I am trying to run an application where I try to generate the cartesion
> product of two potentially large data sets. In reality I only need the
> cartesian product of
> values in the set with a particular integer key. I am considering a design
> where the first mappers run through the values of set A emitting that
> integer as a key and the item as a value. The reducers are simple identity
> reducers.
> In the second job the mappers run through set B emitting values with a key
> and the item as a value. The reducers read the output of the first job to
> run through the values of A.
> One issue is that assuming the same hashing partitioner is used there are
> the same number of reducers, a specific reducer , say reducer 12 ,
> will receive the same keys in both jobs and thus  part-r-00012 from the
> first job is the only file reducer 12 will need to read.
> Can I guarantee (without restricting the number of reducers to a smaller
> number than the cluster will support) that this condition is met - namely
> that the keys in the second job hit the same reducer number as the first
> job? What about restarts and failures?
> BTW is there any way to find out the size of a cluster??
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com

View raw message