kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Number of buckets
Date Mon, 19 Jun 2017 22:04:52 GMT
Hi Pavel,

On Mon, Jun 19, 2017 at 6:12 AM, Pavel Martynov <mr.xkurt@gmail.com> wrote:

> Hi!
>
> I can't find any generic recommendations to choose a number of buckets in
> single-level hash partitioning.
>
> All that I found:
> * "For large tables, prefer to use roughly 10 partitions per server in the
> cluster". https://impala.incubator.apache.org/docs/
> build/html/topics/impala_kudu.html#kudu_partitioning__kudu_
> hash_partitioning. BTW, why 10? Looks like magic number for me :).
>

Best I can tell that's as generic as it can get! :)


> * Some recommendations: https://kudu.apache.org/docs/known_issues.
> html#_scale
>

This page goes into more details of how partitioning works in Kudu:
http://kudu.apache.org/docs/schema_design.html#partitioning


>
> My use case: accumulate up to 500GB-1TB of day data and run some
> aggregation with Spark on that data at day end.
>
> On what values should buckets number depend on? A number of servers,
> a number of disks (I use HDDs without any RAID), a number of CPU cores?
>

For the total number of tablets, something between 10 and the number of
CPUs on your hosts, times the number of hosts, would be a good start as
long as it doesn't bust the other limits. Basically you want it to be
distributed enough that you can efficiently scan in parallel but not too
much that it can start affecting insert speed (because the clients will
buffer edits per tablets up to a certain size).

Do you want to only query the data for that day or for all the data at the
end of the day? If the former then it would be preferable to use range
partitioning by time after you hash, in which case the number of buckets
should be closer to the number of hosts.


>
> Any suggestions?
>
> --
> with best regards, Pavel Martynov
>

Mime
View raw message