kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mauricio Aristizabal <mauri...@impact.com>
Subject Re: Best way to merge range partitions
Date Wed, 01 May 2019 20:31:10 GMT
Sorry William, yes, 600 is not exactly right, I've just adopted it as my
soft target to stay close to given the error you get when creating with
more ("NonRecoverableException: The requested number of tablets is over the
maximum permitted at creation time (580). Additional tablets may be added
by adding range partitions to the table post-creation.")

But question still stands as I would like to stay comfortably under the
limits in https://kudu.apache.org/docs/known_issues.html and we'll be
adding ~100 big tables when all is said and done (with 30 node cluster at
least for now).

Thanks for validating that's the right approach at the moment.

This would be a very nice feature to add IMHO.  Conceptually seems
relatively simple, especially if agreed limitation is that tablets for
ranges being merged will go read-only and ingest ops on them will fail
(merging cold data partitions so that's fine).

On Wed, May 1, 2019 at 12:55 PM William Berkeley <wdberkeley@cloudera.com>

> Where's the 600 tablet count recommendation sourced from? Is that
> pre-replication and per-tserver, so there's 1800 replicas per tablet
> server? We recommend 1000-2000 replicas per server.
> As for your strategy for merging range partitions, I think it's the best
> available at this point.
> -Will
> On Tue, Apr 30, 2019 at 1:23 PM Mauricio Aristizabal <mauricio@impact.com>
> wrote:
>> I'm doing the delicate dance of maximizing ingest by having enough
>> current hash partitions (say 25), minimizing query runtime by having range
>> partitions that roughly match most report runs (say 2 weeks), while keeping
>> tablet count not far above the 600 recommended, and supporting at least 18
>> months of data.
>> I'm thinking of a strategy of routinely merging older cold data range
>> partitions into bigger ones (say 2 months instead of 2 weeks), and leverage
>> the reduced overall tablet count to increase the hash buckets.
>> It would be really nice if there was a Kudu CLI 'merge_range_partition'
>> command (ranges would need to be contiguous).  It would greatly simplify
>> optimization of time-series data structures.
>> So instead i'm planning on copying the range partitions' data to a
>> parquet side table, dropping the partitions, creating a single one, and
>> copying the data back in.
>> Any better approach I can use currently?
>> Using CDH 5.15 Impala 2.13 Kudu 1.7
>> Thanks in advance,
>> -m
>> --
>> Mauricio Aristizabal
>> Architect - Data Pipeline
>> mauricio@impact.com | 323 309 4260
>> https://impact.com
>> <https://www.linkedin.com/company/impact-martech/>
>> <https://www.facebook.com/ImpactMarTech/>
>> <https://twitter.com/impactmartech>
>> <https://www.youtube.com/c/impactmartech>

Mauricio Aristizabal
Architect - Data Pipeline
mauricio@impact.com | 323 309 4260

View raw message