hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Channe <drah...@googlemail.com>
Subject Re: Inserting data in hive bucket
Date Sat, 20 Aug 2016 22:53:20 GMT
Hi Mich,

I want to know If we can drop data of particular bucket in hive

On Friday, August 19, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hash partitioning (Bucketing) does not make much sense for YYYY/MM/DD/32
> as pointed out.
>
> So it is clear that with (mod 32), the maximum number of offsets is going
> to be 32, i.e. in the range between 0-31. With YYYY/MM/DD you have to
> account for hash collisions as well. The set of inputs is potentially many
> (definitely not known until we encounter them all) and if you want to
> spread them evenly (after all that is what hash partitioning is all about)
> then I think day of the month makes more sense.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 August 2016 at 23:15, Gopal Vijayaraghavan <gopalv@apache.org
> <javascript:_e(%7B%7D,'cvml','gopalv@apache.org');>> wrote:
>
>>
>> > We are bucketing by date so we wil have max 32 buckets
>>
>> If you do want to lookup specifically by date, you could just create day
>> partitions and never partition by month.
>>
>> FYI, in a modern version of Hive
>>
>> select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12
>>
>> does prune it on the client side.
>>
>> On a different note, 31 buckets is a bad idea (32 is ok), because for
>> String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50% of
>> your buckets have 0 data.
>>
>> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6
>>
>>
>> Use that as a number and you'll get the same number back as the hashcode,
>> so it won't be stable as months change (20160816 % 32 == 16 and 20160716 %
>> 32 == 12).
>>
>> The only way to have buckets correspond to a day_of_month as an int and
>> bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Mime
View raw message