accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Running Continuous Ingest on small cluster
Date Thu, 20 Dec 2012 02:02:33 GMT
Thanks for the great info, Eric!

On 12/19/2012 8:59 PM, Eric Newton wrote:
> It depends on the number of drives you have, too.
> I run ingesters and scanners on every slave node, a single batchwalker
> on the master node.
> You want at least 100K for outgoing buffers for your ingester for each
> slave node you have.
> A large in-memory map is probably less useful than block index to get
> your query performance to be faster.
> Once you start getting to 5G / tablet, the number of files per tablet
> causes merging minor compactions, which cuts your performance in half.
>   Increasing the number of files will reduce query performance, so that
> gives you a basic way to control ingest vs query performance.
> Pre-splitting to 20 tablets/server will give you the sweet-spot for
> ingest performance; add more if you have more drives.  It allows for for
> more parallel writes during minor compactions.
> If you have more than 4 drives per node, try doubling the number of
> ingesters you run.
> I like to tweak everything until I get the system load on each machine
> to be roughly the number of real cores after 12 hours.  This is hard to
> do without a spindle for every two CPUs.
> It's important to watch for failed/failing hardware.  You can sample the
> outgoing write buffers of the ingesters (using netstat).  If you see a
> node constantly having data queued going to it, try taking it out of the
> cluster.  You can do the same for datanodes, too.  At dozens of nodes,
> this is not really important.  When you get to hundreds, it becomes much
> more likely that a single node will flake out after a day of abuse.
> -Eric
> On Wed, Dec 19, 2012 at 8:10 PM, Josh Elser <
> <>> wrote:
>     In playing around with the continuous ingest collection of code
>     (ingest, walkers, batchwalkers, scanners and agitators), I found
>     myself blindly guessing at how many of each of these processes I
>     should use.
>     Are there some generic thoughts as to what might be an ideal
>     saturation point for N tservers?
>     I initially split my hosts 4 ways and ran (N/4) of each process
>     (ingest, walkers, batchwalkers, and scanners), ratcheting down the
>     number of threads ingest and batchwalkers (to avoid saturating CPU
>     and memory). Should I try to balance (query threads * query clients)
>     + (ingest threads * ingest clients) against the available threads
>     per host and adjust the BatchWriter send buffers similarly in regard
>     to memory available?
>     I appreciate anyone's insight.
>     - Josh

View raw message