hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix.徐 <ygnhz...@gmail.com>
Subject Performance difference between tuning reducer num and partition table
Date Fri, 28 Jun 2013 15:40:40 GMT
Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT .... FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT .... FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!

Mime
View raw message