accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <>
Subject Re: What is the optimal number of tablets for a large table?
Date Wed, 14 Oct 2015 01:15:27 GMT
Here are a few other factors to consider:
1. Tablets may not be used uniformly. If there is a temporal element to the
row key then writes and reads may be skewed to go to a portion of the
tablets. If some tables are big but more archival in nature then they will
skew the stats as well. It's usually good to estimate things like CPU and
RAM usage based on active tablets rather than total tablets, so plan
2. Compactions take longer as tablets grow. Accumulo tends to have a nice,
fairly well-bounded write amplification factor (number of times a key/value
pair goes through major compaction), even with large tablets. However, a
compaction of a 200GB+ tablet can take hours, making for difficulty in
predicting performance and availability. It's nice to have background
operations split up into manageably small tasks (incidentally, this is
something LevelDB solves by essentially compacting fixed-size blocks rather
than variable-size files). Assuming you have 20TB disk on a beefy node, you
may want 500-1000 tablets on that machine, which is probably much more than
the number of available cores.
3. Query and indexing patterns, such as the wikisearch-style inverted
index, may drive you closer to one tablet per thread, but with range
queries this becomes less important.


On Fri, Oct 9, 2015 at 6:53 PM, Jeff Kubina <> wrote:

> I read the following from the Accumulo manual on tablet merging
> <>:
>> Over time, a table can get very large, so large that it has hundreds of
>> thousands of split points. Once there are enough tablets to spread a table
>> across the entire cluster, additional splits may not improve performance,
>> and may create unnecessary bookkeeping.
> So would the optimal number of tablets for a very large table be close to
> the total tservers times the total cores of the machine (or the worker
> threads the tservers are config to use--whichever is less)?

View raw message