hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Safdar Kureishy <safdar.kurei...@gmail.com>
Subject Storage requirements for intermediate (map-side-output) data during Hive joins
Date Mon, 07 May 2012 07:31:39 GMT
Hi,

I'm setting up a Hadoop cluster and would like to understand how much disk
space I should expect to need with joins.

Let's assume that I have 2 tables, each of about 500 GB. Since the tables
are large, these will all be reduce-side joins. As far as I know about such
joins, the data generated is a cross product of the size of the two tables.
Am I wrong?

In other words, for a reduce-side join in Hive involving 2 such tables,
would I need to accommodate for 500 GB * 500 GB = *250000 *GB of *
intermediate* (map-side output) data before the reducer(s) kick-in in my
cluster? Or am I missing something? That seems rediculously high, so I hope
I'm mistaken.

But if the above IS accurate, what are the ways to reduce this consumption
for the same kind of join in Hive?

Thanks,
Safdar

Mime
View raw message