accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishmin Rai <>
Subject Re: Number of partitions for sharded table
Date Tue, 30 Oct 2012 20:38:03 GMT
Thanks, Adam… that's exactly what I was looking for, and gives me a lot to think about.


On Oct 30, 2012, at 4:08 PM, Adam Fuchs wrote:

> Krishmin,
> There are a few extremes to keep in mind when choosing a manual partitioning strategy:
> 1. Parallelism and balance at ingest time. You need to find a happy medium between too
few partitions (not enough parallelism) and too many partitions (tablet server resource contention
and inefficient indexes). Probably at least one partition per tablet server being actively
written to is good, and you'll want to pre-split so they can be distributed evenly. Ten partitions
per tablet server is probably not too many -- I wouldn't expect to see contention at that
> 2. Parallelism and balance at query time. At query time, you'll be selecting a subset
of all of the partitions that you've ever ingested into. This subset should be bounded similarly
to the concern addressed in #1, but the bounds could be looser depending on the types of queries
you want to run. Lower latency queries would tend to favor only a few partitions per node.
> 3. Growth over time in partition size. Over time, you want partitions to be bounded to
less than about 10-100GB. This has to do with limiting the maximum amount of time that a major
compaction will take, and impacts availability and performance in the extreme cases. At the
same time, you want partitions to be as large as possible so that their indexes are more efficient.
> One strategy to optimize partition size would be to keep using each partition until it
is "full", then make another partition. Another would be to allocate a certain number of partitions
per day, and then only put data in those partitions during that day. These strategies are
also elastic, and can be tweaked as the cluster grows.
> In all of these cases, you will need a good load balancing strategy. The default strategy
of evening up the number of partitions per tablet server is probably not sufficient, so you
may need to write your own tablet load balancer that is aware of your partitioning strategy.
> Cheers,
> Adam
> On Tue, Oct 30, 2012 at 3:06 PM, Krishmin Rai <> wrote:
> Hi All,
>   We're working with an index table whose row is a shardId (an integer, like the wiki-search
or IndexedDoc examples). I was just wondering what the right strategy is for choosing a number
of partitions, particularly given a cluster that could potentially grow.
>   If I simply set the number of shards equal to the number of slave nodes, additional
nodes would not improve query performance (at least over the data already ingested). But starting
with more partitions than slave nodes would result in multiple tablets per tablet server…
I'm not really sure how that would impact performance, particularly given that all queries
against the table will be batchscanners with an infinite range.
>   Just wondering how others have addressed this problem, and if there are any performance
rules of thumb regarding the ratio of tablets to tablet servers.
> Thanks!
> Krishmin

View raw message