hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varadharajan Mukundan <srinath...@gmail.com>
Subject Hive Buckets and Select queries
Date Tue, 29 Dec 2015 07:00:01 GMT
Hi All,

Say i have a table with below schema:

CREATE TABLE foo (id INT) CLUSTERED BY (id) INTO 8 BUCKETS STORED AS ORC;

and when we issue the following query, its doing a "Full table scan"

SELECT * FROM foo WHERE id=<some integer>

After doing some searching on the net, i found that "table sample" seems to
be one of the ways to resolve this and the query would be written in this
manner:

SELECT * FROM foo TABLE SAMPLE(BUCKET <bucket id of some_integer: say 2> of
8) where id=<some_integer>

I was expecting the above query to read only the 2nd bucket but to my
surprise it did full table scan again. I understand that partitioning in
hive is the way to go for such queries by i have two points on why we need
such filtering techniques for buckets as well.

1. Partitioning may not be suitable / preferred when there are lots of
partitions (high cardinality columns)
2. Buckets just works out of the box, without specifying things like
partitioning keys etc.. in the queries like "insert clauses".

I was wondering if there are any technical constraints on why we were not
able to restrict the scan only to that bucket for such pointed queries?

-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message