hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek <>
Subject Cartesian Product in HIVE
Date Sat, 29 Sep 2012 03:56:16 GMT
Hi all,

I have use case where we are doing Cartesian product of two tables with
One table with 
990k rows
Second table 
20k rows

Query is Cartesian product of just two columns. 

So it comes around 20 billion rows

For one hour it is processing like around 5 billion rows.

So the process takes around 4 hrs.

I have over riden some of the properties in hive

>> Set io.sort.mb=512
    Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set
>> Set hive.mapred.reduce.speculative.execution=false

How can optimize it to get better results.

Even though I have set reduce tasks to 17, only one reduce is spawned for the query . Did
I do some thing wrong ??

My cluster configuration is having
20 slave nodes running cdh3u5.
With 240 map slots
        120 reduce slots 
 Block size is 128 mb
 Memory on the slave node is 96GB

How can the query perform better??

How can I increase number of rows processed by reducer at a single moment, or per second

Can help is greatly appreciated.

View raw message