hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KayVajj <>
Subject Question regarding cluster by multiple columns
Date Mon, 27 Jan 2014 04:03:21 GMT

I'm studying the bucketed tables as an option for my storage. What would be
use case where it is useful to cluster by multiple columns?

I 'm trying to solve a problem of optimizing a join between two tables with

Let's say Table A has columns (id, country, .....) and table has columns
(Id, country....)

Note: A country could have multiple Ids.

Single column clustering

If I cluster both tables by Id column, into 8 buckets.

Table A would have files FileA1, FileA2..FileA8

And similarly Table B would have FileB1..FileB8

In case of a join on column Id, I would imagine FileA1 would be joined with
FileB1.. FileA2 with FileB2... so on and so forth. the filter is applied on
the country in each join. This would avoid the need for comparing FileA1
with files other than FileB1 and I see a performance gain.

Multiple Column Clustering

How would clustering on two columns Id and country play in this scenario..

Your inputs are very much appreciated.


View raw message