hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lefty Leverenz <leftylever...@gmail.com>
Subject Re: Hive Buckets and Select queries
Date Thu, 31 Dec 2015 08:55:08 GMT
cc:  user@hive.apache.org

-- Lefty



On Mon, Dec 28, 2015 at 11:00 PM, Varadharajan Mukundan <
srinathsmn@gmail.com> wrote:

> Hi All,
>
> Say i have a table with below schema:
>
> CREATE TABLE foo (id INT) CLUSTERED BY (id) INTO 8 BUCKETS STORED AS ORC;
>
> and when we issue the following query, its doing a "Full table scan"
>
> SELECT * FROM foo WHERE id=<some integer>
>
> After doing some searching on the net, i found that "table sample" seems to
> be one of the ways to resolve this and the query would be written in this
> manner:
>
> SELECT * FROM foo TABLE SAMPLE(BUCKET <bucket id of some_integer: say 2> of
> 8) where id=<some_integer>
>
> I was expecting the above query to read only the 2nd bucket but to my
> surprise it did full table scan again. I understand that partitioning in
> hive is the way to go for such queries by i have two points on why we need
> such filtering techniques for buckets as well.
>
> 1. Partitioning may not be suitable / preferred when there are lots of
> partitions (high cardinality columns)
> 2. Buckets just works out of the box, without specifying things like
> partitioning keys etc.. in the queries like "insert clauses".
>
> I was wondering if there are any technical constraints on why we were not
> able to restrict the scan only to that bucket for such pointed queries?
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Mime
View raw message