kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: How to manage yearly range partition efficiently
Date Thu, 08 Jun 2017 21:59:45 GMT
On Wed, Jun 7, 2017 at 7:44 PM, Jason Heo <jason.heo.sde@gmail.com> wrote:

> Hi.
>
> This is a partition strategy of my table.
>
> PARTITION BY HASH (...) PARTITIONS 40,
> RANGE (ymd) (
>     PARTITION VALUES < "2015",
>     PARTITION "2015" <= VALUES < "2016",
>     PARTITION "2016" <= VALUES < "2017",
>     PARTITION "2017" <= VALUES
> )
>
> My concern is that how to manage RANGE(ymd) partitions for greather than
> 2017.
>
> plan 1) using a cron job, add 2018 partition at the end of 2017, add 2019
> partition at the end of 2018, ...
>     - pros: no unused partitions
>     - cons: problems arise if next year's partition is not created by
> mistake
> plan 2) add all upcoming 10 years' partitions
>     - pros: can reduce risks
>     - cons: 400 partitions (40*10 years) are created but they has no data
>
> I prefer to plan 2) but I'm wondering what many unnecessarily partitions
> lead to problems.
>

The empty partitions will increase heartbeat traffic on your nodes, etc.
It's not major, but does add a bit of overhead. They will also participate
in any queries you run which are not able to prune partitions based on
range. Even though they have no data, there would be unnecessary
Spark/impala tasks/fragments running against the empty partitions, etc,
which may impact performance and concurrency.

I think I'd suggest plan 1, plus also put it on several people's calendars
to verify :) Alternatively, something like in 2017 add the partitions for
2018 and 2019, so you always maintain one extra year ahead and you are less
likely to "not notice" if the new one is not created in time.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message