hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bejoy KS" <bejoy...@yahoo.com>
Subject Re: When/how to use partitions and buckets usefully?
Date Mon, 23 Apr 2012 15:31:50 GMT
Partitions are good when you want to run your queries on a subset of whole data. So the partition
column depends on your queries. But a good point to be taken care is that every partition
have enough data.
Partition gets into effect when you use filters with Where clause.

Buckets are good for sampling and joins like bucketed map joins.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Ruben de Vries <ruben.devries@hyves.nl>
Date: Mon, 23 Apr 2012 17:19:00 
To: user@hive.apache.org<user@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: When/how to use partitions and buckets usefully?

It seems there's enough information to be found on how to setup and use partitions and buckets.
But I'm more interested in how to figure out when and what columns you should be partitioning
and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me
info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that partitioning on date
is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can
that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem to find them!?

Any help would be appreciated, thanks in advance :)

Mime
View raw message