hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Channe <drah...@googlemail.com>
Subject Re: Inserting data in hive bucket
Date Mon, 22 Aug 2016 13:26:18 GMT
Thank you for the responses

On Sunday, August 21, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hi Rahul,
>
> I don't believe you can drop a particular bucket in Hive
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 20 August 2016 at 23:53, Rahul Channe <drahulc@googlemail.com
> <javascript:_e(%7B%7D,'cvml','drahulc@googlemail.com');>> wrote:
>
>> Hi Mich,
>>
>> I want to know If we can drop data of particular bucket in hive
>>
>> On Friday, August 19, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebzadeh@gmail.com');>> wrote:
>>
>>> Hash partitioning (Bucketing) does not make much sense for YYYY/MM/DD/32
>>> as pointed out.
>>>
>>> So it is clear that with (mod 32), the maximum number of offsets is
>>> going to be 32, i.e. in the range between 0-31. With YYYY/MM/DD you have to
>>> account for hash collisions as well. The set of inputs is potentially many
>>> (definitely not known until we encounter them all) and if you want to
>>> spread them evenly (after all that is what hash partitioning is all about)
>>> then I think day of the month makes more sense.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 19 August 2016 at 23:15, Gopal Vijayaraghavan <gopalv@apache.org>
>>> wrote:
>>>
>>>>
>>>> > We are bucketing by date so we wil have max 32 buckets
>>>>
>>>> If you do want to lookup specifically by date, you could just create day
>>>> partitions and never partition by month.
>>>>
>>>> FYI, in a modern version of Hive
>>>>
>>>> select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12
>>>>
>>>> does prune it on the client side.
>>>>
>>>> On a different note, 31 buckets is a bad idea (32 is ok), because for
>>>> String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50%
>>>> of
>>>> your buckets have 0 data.
>>>>
>>>> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6
>>>>
>>>>
>>>> Use that as a number and you'll get the same number back as the
>>>> hashcode,
>>>> so it won't be stable as months change (20160816 % 32 == 16 and
>>>> 20160716 %
>>>> 32 == 12).
>>>>
>>>> The only way to have buckets correspond to a day_of_month as an int and
>>>> bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>

Mime
View raw message