hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namit Jain <nj...@facebook.com>
Subject Re: Selecting data based on the clustered columns
Date Thu, 16 Jul 2009 14:19:29 GMT
Right now, bucketing information is not used in a lot of places - it is only used in sampling.
For eg:

If your query was:

Select .. From Posts(tablesample 1 out of 256) a;

Then only the first bucket will be scanned.

Your query can be optimized, but currently it is not. Can you file a jira on that ?
It will help us prioritize this.



-namit


On 7/16/09 3:25 AM, "Deepak A" <deepak.a@gmail.com> wrote:

Hi,

I have the following table in Hive
Posts(Id, UserId, PostDate, ...) CLUSTERED BY (UserId) SORTED BY (PostDate) INTO 256 BUCKETS;

Since the data is hash partitioned based on the 'UserId' column, buckets were created based
on the hash value of 'UserId'.

Now, when I issue a Select query to fetch all the posts by a particular 'UserId ' (say, Select
count(Id) from Posts where UserId=1), does it scan only the bucket to which 'UserId' is hashed
to?. But, when I run this query, I could see all the buckets being searched for the UserId.

Moreover, I see that's there is a way to sample the table based on the buckets. Why can't
hive automatically figure out the bucket to which UserId is hashed to and search only in that
bucket?

Can someone clarify me on this?

Thanks,
Deepak


Mime
View raw message