accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <>
Subject Re: Improving ingest performance [SEC=UNCLASSIFIED]
Date Thu, 25 Jul 2013 02:16:49 GMT
Eric, all this info would be great as a FAQ on the website. :)

Christopher L Tubbs II

On Wed, Jul 24, 2013 at 8:35 AM, Eric Newton <> wrote:
> Assuming that 5 billion records means 5 billion Key/Values, this is nearly
> 100K K-V/sec/node, which isn't so bad.  If the key/values are small and
> uniformly distributed, 200K is closer to the rate you can expect given
> decent drives.
> If you are only concerned with ingest, crank up the size of the in-memory
> map and increase the compaction ratio from 3 to 5 (or even as high as 10).
> This will reduce the number of re-writes of your data.  If you don't care
> about possible data loss, turn off the write-ahead log on your table, or
> reduce the replication factor for the write-ahead log.
> Make sure your table is pre-split, if possible, to maximize parallel
> performance during initial ingest.  Aim for 10-50 tablets per server.
> If the latency of waiting for the data to be prepped does not bother you, it
> is almost always more efficient to use bulk ingest.   Can you wait 30
> minutes to queue up enough data, and then another 5-15 for the map/reduce
> job to produce the RFiles?
> There's a fair amount of overhead to starting a mapper.  You may want to
> experiment with larger map jobs.
> -Eric
> On Wed, Jul 24, 2013 at 2:26 AM, Dickson, Matt MR
> <> wrote:
>> Hi,
>> I'm trying to improve ingest performance on a 12 node test cluster.
>> Currently I'm loading 5 billion records in approximately 70 minutes which
>> seems excessive.  Monitoring the job there are 2600 map jobs (there is no
>> reduce stage, just the mapper) with 288 running at any one time.  The
>> performance seems slowest in the early stages of the job prior to to min or
>> maj compactions occuring.  Each server has 48 GB memory and currently the
>> accumulo settings are based on the 3GB settings in the example config
>> directory, ie tserver.memory.maps.max = 1GB,
>> and  All other settings on the table are
>> default.
>> Questions.
>> 1. What is Accumulo doing in the initial stage of a load and which
>> configurations should I focus on to improve this?
>> 2. At what ingest rate should I consider using the bulk ingest process
>> with rfiles?
>> Thanks
>> Matt
>> IMPORTANT: This email remains the property of the Department of Defence
>> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
>> you have received this email in error, you are requested to contact the
>> sender and delete the email.

View raw message