accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <>
Subject Re: Efficient Tablet Merging [SEC=UNOFFICIAL]
Date Wed, 02 Oct 2013 13:05:10 GMT
The most efficient way is kind of scary.  If this is a production system, I
would not recommend it.

First, find out the size of your 10x tablets.  Let's say it's 10G.  Set
your split threshold to 10G.  Then merge all old tablets.... all of them
into one tablet.  This will dump thousands of files into a single tablet,
but it will soon split out again into the nice 10G tablets you are looking
for.  The system will probably be unusable during this operation.

The more conservative way is to specify the merge in single steps (the
master will only coordinate a single merge on a table at a time anyhow).
 You can do it by range or by size... I would do it by size, especially if
you are aging off your old data.

Compacting the data won't have any effect on the speed of the merge.


On Tue, Oct 1, 2013 at 11:58 PM, Dickson, Matt MR <> wrote:

> **
> I have a table that we create splits of the form yyyymmdd-*nnnn *where
> nnnn ranges from 0000 to 0840.  The bulk of our data is loaded for the
> current date with no data loaded for days older than 3 days so from my
> understanding it would be wise to merge splits older than 3 days in order
> to reduce the overall tablet count.  It would still be optimal to
> maintain some distribution of tablets for a day across the cluster so I'm
> looking at merging splits in 10 increments eg, merge -b 20130901-0000 -e
> 20130901-0009, therefore reducing 840 splits per day to 84.
> Currently we have 120K tablets (size 1G) on a cluster of 56 nodes and our
> ingest has slowed as the data quantity and tablet count has grown.
> Initialy we were achieving 200-300K, now 50-100K.
> My question is, what is the best way to do this merge?  Should we use the
> merge command with the size option set at something like 5G, or maybe use
> the compaction command?
> From my tests this process could take some time so I'm keen to understand
> the most efficient approach.
> Thanks in advance,
> Matt Dickson

View raw message