hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy...@yahoo.com>
Subject Re: Storage requirements for intermediate (map-side-output) data during Hive joins
Date Mon, 07 May 2012 07:47:59 GMT
Hi Safdar
     Map side join uses memory on the hive client to form hash tables. They don't come
into key value juggling part as there is no reduce phase involved for such jobs.

Regards
Bejoy KS


________________________________
 From: Ali Safdar Kureishy <safdar.kureishy@gmail.com>
To: user@hive.apache.org 
Sent: Monday, May 7, 2012 1:08 PM
Subject: Re: Storage requirements for intermediate (map-side-output) data during Hive joins
 

Please ignore my question below. I made a mistake with my calculation. The map-side joins
do not perform a cross-product of the data. They just emit the data using the join-key as
the row key.


Thanks,
Safdar




On Mon, May 7, 2012 at 12:31 AM, Ali Safdar Kureishy <safdar.kureishy@gmail.com> wrote:

Hi,
>
>
>I'm setting up a Hadoop cluster and would like to understand how much disk space I should
expect to need with joins.
>
>
>Let's assume that I have 2 tables, each of about 500 GB. Since the tables are large, these
will all be reduce-side joins. As far as I know about such joins, the data generated is a
cross product of the size of the two tables. Am I wrong?
>
>
>In other words, for a reduce-side join in Hive involving 2 such tables, would I need to
accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side output) data before
the reducer(s) kick-in in my cluster? Or am I missing something? That seems rediculously high,
so I hope I'm mistaken.
>
>
>But if the above IS accurate, what are the ways to reduce this consumption for the same
kind of join in Hive?
>
>
>Thanks,
>Safdar
Mime
View raw message