Please ignore my question below. I made a mistake with my calculation. The
mapside joins do not perform a crossproduct of the data. They just emit
the data using the joinkey as the row key.
Thanks,
Safdar
On Mon, May 7, 2012 at 12:31 AM, Ali Safdar Kureishy <
safdar.kureishy@gmail.com> wrote:
> Hi,
>
> I'm setting up a Hadoop cluster and would like to understand how much disk
> space I should expect to need with joins.
>
> Let's assume that I have 2 tables, each of about 500 GB. Since the tables
> are large, these will all be reduceside joins. As far as I know about such
> joins, the data generated is a cross product of the size of the two tables.
> Am I wrong?
>
> In other words, for a reduceside join in Hive involving 2 such tables,
> would I need to accommodate for 500 GB * 500 GB = *250000 *GB of *
> intermediate* (mapside output) data before the reducer(s) kickin in my
> cluster? Or am I missing something? That seems rediculously high, so I hope
> I'm mistaken.
>
> But if the above IS accurate, what are the ways to reduce this consumption
> for the same kind of join in Hive?
>
> Thanks,
> Safdar
>
