Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 834E611C6F for ; Tue, 8 Apr 2014 21:35:55 +0000 (UTC) Received: (qmail 24730 invoked by uid 500); 8 Apr 2014 21:35:54 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 24439 invoked by uid 500); 8 Apr 2014 21:35:53 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 24426 invoked by uid 99); 8 Apr 2014 21:35:53 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Apr 2014 21:35:53 +0000 Received: from localhost (HELO mail-wi0-f179.google.com) (127.0.0.1) (smtp-auth username afuchs, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Apr 2014 21:35:53 +0000 Received: by mail-wi0-f179.google.com with SMTP id z2so2194571wiv.0 for ; Tue, 08 Apr 2014 14:35:51 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.194.48.100 with SMTP id k4mr5895516wjn.49.1396992951524; Tue, 08 Apr 2014 14:35:51 -0700 (PDT) Received: by 10.216.200.146 with HTTP; Tue, 8 Apr 2014 14:35:51 -0700 (PDT) In-Reply-To: References: Date: Tue, 8 Apr 2014 17:35:51 -0400 Message-ID: Subject: Re: Advice on increasing ingest rate From: Adam Fuchs To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=047d7b86de40a1258504f68ec4b6 --047d7b86de40a1258504f68ec4b6 Content-Type: text/plain; charset=ISO-8859-1 MIke, What version of Accumulo are you using, how many tablets do you have, and how many threads are you using for minor and major compaction pools? Also, how big are the keys and values that you are using? Here are a few settings that may help you: 1. WAL replication factor (tserver.wal.replication). This defaults to 3 replicas (the HDFS default), but if you set it to 2 it will give you a performance boost without a huge hit to reliability. 2. Ingest buffer size (tserver.memory.maps.max), also known as the in-memory map size. Increasing this generally improves the efficiency of minor compactions and reduces the number of major compactions that will be required down the line. 4-8 GB is not unreasonable. 3. Make sure your WAL settings are such that the size of a log (tserver.walog.max.size) multiplied by the number of active logs (table.compaction.minor.logs.threshold) is greater than the in-memory map size. You probably want to accomplish this by bumping up the number of active logs. 4. Increase the buffer size on the BatchWriter that the clients use. This can be done with the setBatchWriterOptions method on the AccumuloOutputFormat. Cheers, Adam On Tue, Apr 8, 2014 at 4:47 PM, Mike Hugo wrote: > Hello, > > We have an ingest process that operates via Map Reduce, processing a large > set of XML files and inserting mutations based on that data into a set of > tables. > > On a 5 node cluster (each node has 64G ram, 20 cores, and ~600GB SSD) I > get 400k inserts per second with 20 mapper tasks running concurrently. > Increasing the number of concurrent mapper tasks to 40 doesn't have any > effect (besides causing a little more backup in compactions). > > I've increased the table.compaction.major.ratio and increased the number > of concurrent allowed compactions for both min and max compaction but each > of those only had negligible impact on ingest rates. > > Any advice on other settings I can tweak to get things to move more > quickly? Or is 400k/second a reasonable ingest rate? Are we at a point > where we should consider generating r files like the bulk ingest example? > > Thanks in advance for any advice. > > Mike > --047d7b86de40a1258504f68ec4b6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
MIke,

What version of Accumulo are you using, how many tablets do = you have, and how many threads are you using for minor and major compaction= pools? Also, how big are the keys and values that you are using?

Here are a few settings that may help you:
1.= WAL replication factor (tserver.wal.replication). This defaults to 3 repli= cas (the HDFS default), but if you set it to 2 it will give you a performan= ce boost without a huge hit to reliability.
2. Ingest buffer size (tserver.memory.maps.max), also known as the in-= memory map size. Increasing this generally improves the efficiency of minor= compactions and reduces the number of major compactions that will be requi= red down the line. 4-8 GB is not unreasonable.
3. Make sure your WAL settings are such that the size of a log (tserve= r.walog.max.size) multiplied by the number of active logs (table.compaction= .minor.logs.threshold) is greater than the in-memory map size. You probably= want to accomplish this by bumping up the number of active logs.
4. Increase the buffer size on the BatchWriter that the clients use. T= his can be done with the setBatchWriterOptions method on the AccumuloOutput= Format.

Cheers,
Adam



On Tue,= Apr 8, 2014 at 4:47 PM, Mike Hugo <mike@piragua.com> wrote:<= br>
Hello,

We have an ingest process that o= perates via Map Reduce, processing a large set of XML files and =A0insertin= g mutations based on that data into a set of tables.

On a 5 node cluster (each node has 64G ram, 20 cores, and ~600GB SSD) I get= 400k inserts per second with 20 mapper tasks running concurrently. =A0Incr= easing the number of concurrent mapper tasks to 40 doesn't have any eff= ect (besides causing a little more backup in compactions).

I've increased the table.compaction.major.ratio and= increased the number of concurrent allowed compactions for both min and ma= x compaction but each of those only had negligible impact on ingest rates.<= /div>

Any advice on other settings I can tweak to get things = to move more quickly? =A0Or is 400k/second a reasonable ingest rate? =A0Are= we at a point where we should consider generating r files like the bulk in= gest example?

Thanks in advance for any advice.

Mike
=

--047d7b86de40a1258504f68ec4b6--