hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <>
Subject Re: Storage requirements for intermediate (map-side-output) data during Hive joins
Date Tue, 08 May 2012 09:10:54 GMT
Hi Ali

     Sorry my short response may have got you confused. Let us assume you are doing a LeftOuterJoin
on two tables 'A' and 'B' on a column 'id' (table are large so that only reduce side joins
are only possible) then from my understanding this is how it should happen (explanation based
on a simple join)
- Two sets of mappers would be there one set to process table A, one set to process table
B. The mapper semits it output with and as the keys and the required columns from
the two tables as values.
- On the reducer the values for common keys from both the mappers come together. if it is
like A LEFT OUTER JOIN B, and if there are n matching values in the values List for table
B, then there will be n rows in the output. (matching based on equality conditions provided
in ON clause)

Now if table A has 100 columns and table B has 150 columns, but in the final result you
just need 25 columns. In that case the data is filtered in map itself and the map output has
less data volume (map output gets spilled to lfs ) . So the amount of data that goes into
local disk is based mostly on your join query than the input data. Now the intermediate output
is a well optimized and serialized form for hadoop - Writables ,so it occupies lesser memory compared
to actual input data in hdfs. The map output tmp dir in a node has only the data processed
by that node, so if you take the whole data volume only part of the same will be on every
node. That is one of the advantages of distributed processing.

To add on, you can still optimize the size of intermediate output by enabling map output compression.

Bejoy KS

 From: Ali Safdar Kureishy <>
To: Bejoy Ks <> 
Cc: "" <> 
Sent: Tuesday, May 8, 2012 12:28 PM
Subject: Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Hi Bejoy,

Thanks....I see...I was asking because I wanted to know how much total storage space I would
need on the cluster for the given data in the tables.

Are you saying that for 2 tables of 500 Gb each (spread across the cluster), there would be
a need for intermediate storage of 250000 GB? Or are you saying that it is the sum total of
all data processing that happens, but is not actually stored? I'm guessing you were referring
to the latter, because the former seems unscalable.


On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks <> wrote:

Hi Ali
>      The 500*500 Gigs of data is actually processed by multiple tasks across multiple
nodes. In default settings a task will process 64Mb of data per task. So you don't need 250000 GB
temp space in a node at all . A few gigs of free space is more than enough for any MR task
>Bejoy KS
> From: Ali Safdar Kureishy <>
>Sent: Monday, May 7, 2012 1:01 PM
>Subject: Storage requirements for intermediate (map-side-output) data during Hive joins
>I'm setting up a Hadoop cluster and would like to understand how much disk space I should
expect to need with joins.
>Let's assume that I have 2 tables, each of about 500 GB. Since the tables are large, these
will all be reduce-side joins. As far as I know about such joins, the data generated is a
cross product of the size of the two tables. Am I wrong?
>In other words, for a reduce-side join in Hive involving 2 such tables, would I need to
accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side output) data before
the reducer(s) kick-in in my cluster? Or am I missing something? That seems rediculously high,
so I hope I'm mistaken.
>But if the above IS accurate, what are the ways to reduce this consumption for the same
kind of join in Hive?
View raw message