hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Safdar Kureishy <>
Subject Storage requirements for intermediate (map-side-output) data during Hive joins
Date Mon, 07 May 2012 07:31:39 GMT

I'm setting up a Hadoop cluster and would like to understand how much disk
space I should expect to need with joins.

Let's assume that I have 2 tables, each of about 500 GB. Since the tables
are large, these will all be reduce-side joins. As far as I know about such
joins, the data generated is a cross product of the size of the two tables.
Am I wrong?

In other words, for a reduce-side join in Hive involving 2 such tables,
would I need to accommodate for 500 GB * 500 GB = *250000 *GB of *
intermediate* (map-side output) data before the reducer(s) kick-in in my
cluster? Or am I missing something? That seems rediculously high, so I hope
I'm mistaken.

But if the above IS accurate, what are the ways to reduce this consumption
for the same kind of join in Hive?


View raw message