accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slater, David M." <>
Subject RE: Straggler problem in Accumulo BatchScans
Date Wed, 21 Aug 2013 23:47:09 GMT
Hi James,

I already had the data sharded into 7 partitions, and that works well to distribute the data
into 7 tablets. (I have 2 GB tablet sizes, with about 1.2 TB of data, so there are numerous
tablets per server). The difficulty is that Accumulo seems to decide for itself where each
tablet goes. When I only had 10 GB of data, it nicely divided into 7 tablets, one on each
node. However, with dozens of tablets per tablet server, it assigns tablets to tablet servers
orthogonally to my presplits.

Is there a way to force Accumulo to keep specific ranges on specific nodes? If not, I suppose
that I could have 4x or more shards per tablet server to ensure that it was more uniformly


From: James Hughes []
Sent: Wednesday, August 21, 2013 7:29 PM
Subject: Re: Straggler problem in Accumulo BatchScans

Each tablet is hosted by one tablet server, and there's no way around that.  (This is actually
quite reasonably; otherwise, we would receive duplicate results from multiple tablet servers.)
One strategy to deal with imbalanced data is to add a random partition prefix to your row
Ids.  This does complicate building queries, but in general, you'll be able to leverage all
of your nodes.  I did some testing with the nodes of such random shard ids, and it seems like
having 1-2x as many shards as tablet servers worked pretty well.  (I'd suggest 2x in case
you ever grow your cloud.)
In particular, if you can reingest your data, prepend a random "01-14~" to the beginning of
each row Id, and see if that helps.  After that, you can "help" Accumulo decide where it should
split tablets with addSplits 01 02 <etc> 14 from the Accumulo shell (or programmatically
with the addSplits).  After that, you can make sure that your 14+ splits are distributed across
the 7 nodes in a reasonable way.
I hope that helps,


On Wed, Aug 21, 2013 at 7:09 PM, Slater, David M. <<>>
Hey, I have a 7 node network running accumulo 1.4.1 and hadoop 1.0.4.

When I run large BatchScanner operations, the number of tablets scanned per node is not uniform,
leading to the overloaded nodes taking much longer to finish than the others. For queries
that require all of the scans to finish before returning, this is a major latency issue. What
are some practical means of load-balancing this to reduce delay?

Is it possible for tablets to be hosted on multiple tablet servers, up to the replication
factor of the underlying hdfs? Are there reasons this might be an undesirable design?

Thanks in advance,

View raw message