hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Safdar Kureishy <safdar.kurei...@gmail.com>
Subject Re: Storage requirements for intermediate (map-side-output) data during Hive joins
Date Tue, 08 May 2012 06:58:06 GMT
Hi Bejoy,

Thanks....I see...I was asking because I wanted to know how much total
storage space I would need on the cluster for the given data in the tables.

Are you saying that for 2 tables of 500 Gb each (spread across the
cluster), there would be a need for intermediate storage of 250000 GB? Or
are you saying that it is the sum total of all data *processing* that
happens, but is not actually stored? I'm guessing you were referring to the
latter, because the former seems unscalable.

Regards,
Safdar


On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:

> Hi Ali
>
>       The 500*500 Gigs of data is actually processed by multiple tasks
> across multiple nodes. In default settings a task will process 64Mb of data
> per task. So you don't need *250000 *GB temp space in a node at all . A
> few gigs of free space is more than enough for any MR task .
>
> Regards
> Bejoy KS
>
>   ------------------------------
> *From:* Ali Safdar Kureishy <safdar.kureishy@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Monday, May 7, 2012 1:01 PM
> *Subject:* Storage requirements for intermediate (map-side-output) data
> during Hive joins
>
> Hi,
>
> I'm setting up a Hadoop cluster and would like to understand how much disk
> space I should expect to need with joins.
>
> Let's assume that I have 2 tables, each of about 500 GB. Since the tables
> are large, these will all be reduce-side joins. As far as I know about such
> joins, the data generated is a cross product of the size of the two tables.
> Am I wrong?
>
> In other words, for a reduce-side join in Hive involving 2 such tables,
> would I need to accommodate for 500 GB * 500 GB = *250000 *GB of *
> intermediate* (map-side output) data before the reducer(s) kick-in in my
> cluster? Or am I missing something? That seems rediculously high, so I hope
> I'm mistaken.
>
> But if the above IS accurate, what are the ways to reduce this consumption
> for the same kind of join in Hive?
>
> Thanks,
> Safdar
>
>
>

Mime
View raw message