accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <>
Subject Re: Improving ingest performance [SEC=UNCLASSIFIED]
Date Wed, 24 Jul 2013 12:35:52 GMT
Assuming that 5 billion records means 5 billion Key/Values, this is nearly
100K K-V/sec/node, which isn't so bad.  If the key/values are small and
uniformly distributed, 200K is closer to the rate you can expect given
decent drives.

If you are only concerned with ingest, crank up the size of the in-memory
map and increase the compaction ratio from 3 to 5 (or even as high as 10).
 This will reduce the number of re-writes of your data.  If you don't care
about possible data loss, turn off the write-ahead log on your table, or
reduce the replication factor for the write-ahead log.

Make sure your table is pre-split, if possible, to maximize parallel
performance during initial ingest.  Aim for 10-50 tablets per server.

If the latency of waiting for the data to be prepped does not bother you,
it is almost always more efficient to use bulk ingest.   Can you wait 30
minutes to queue up enough data, and then another 5-15 for the map/reduce
job to produce the RFiles?

There's a fair amount of overhead to starting a mapper.  You may want to
experiment with larger map jobs.


On Wed, Jul 24, 2013 at 2:26 AM, Dickson, Matt MR <> wrote:

> **
> Hi,
> I'm trying to improve ingest performance on a 12 node test cluster.
> Currently I'm loading 5 billion records in approximately 70 minutes which
> seems excessive.  Monitoring the job there are 2600 map jobs (there is no
> reduce stage, just the mapper) with 288 running at any one time.  The
> performance seems slowest in the early stages of the job prior to to min or
> maj compactions occuring.  Each server has 48 GB memory and currently the
> accumulo settings are based on the 3GB settings in the example config
> directory, ie tserver.memory.maps.max = 1GB,
> and  All other settings on the table are
> default.
> Questions.
> 1. What is Accumulo doing in the initial stage of a load and which
> configurations should I focus on to improve this?
> 2. At what ingest rate should I consider using the bulk ingest process
> with rfiles?
> Thanks
> Matt
> *IMPORTANT*: This email remains the property of the Department of Defence
> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
> you have received this email in error, you are requested to contact the
> sender and delete the email.

View raw message