hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dima Datsenko <di...@microsoft.com>
Subject RE: Effecient partitions usage in join
Date Thu, 22 Nov 2012 15:06:45 GMT
Hi Benny,

The udf solution sounds like a plan. Much better than generating hive query with hardcoded
partition out of table B. Can you please provide a sample of what you’re doing there?

Thanks,
Dima

From: Bennie Schut [mailto:bschut@ebuddy.com]
Sent: יום ה 22 נובמבר 2012 16:28
To: user@hive.apache.org
Cc: Dima Datsenko
Subject: RE: Effecient partitions usage in join

Unfortunately at the moment partition pruning is a bit limited in hive. When hive creates
the query plan it decides what partitions to use. So if you put hardcoded list of partition_id
items in the where clause it will know what to do. In the case of a join (or a subquery) it
would have to run the query before it can know what it can prune.  There are obvious solutions
to this but they are simply not implemented at the moment.
Generally speaking people try to work around this by not normalizing the data. So if you plan
on doing a clean star schema with a calendar table then do yourself a favor and but the actual
date in the fact table and not a meaningless key.
It’s also good to realize you can (in some special cases) work around it by using udf’s.
I’ve used it once by creating a udf which produced the current date which I flagged as deterministic
(ugly I know). This causes the planner to run the udf during planning and use the result as
if it’s a constant and thus partition pruning works again. It’s currently the only way
I know to select x days of data with partition pruning working.


From: Dima Datsenko [mailto:dimad@microsoft.com]
Sent: Thursday, November 22, 2012 2:56 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Effecient partitions usage in join

Hi Guys,

I wonder if you could help me.

I have a huge Hive table partitioned by some field. It has thousands of partitions.
Now I have another small table containing tens of partitions id. I’d like to get the data
only from those partitions.

However when I run
Select * from A join B on (A.partition_id = B.partition_id),
It reads all data from A, then from B and on reduce stage performs join.

I tried /*+ MAPJOIN*/ it ran faster sparing reduce operation, but still read the whole A table.

Is there a more efficient way to perform the query w/o reading the whole A content?


Thanks
Dima

Mime
View raw message