accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Hugo <m...@piragua.com>
Subject Re: Advice on increasing ingest rate
Date Wed, 09 Apr 2014 20:45:38 GMT
On Tue, Apr 8, 2014 at 5:35 PM, David Medinets <david.medinets@gmail.com>wrote:

> 20 cores and just one SSD? Is there a standard recommendation for a core
> to SSD ratio?
>
> Other questions:
>
> How are you sharding your data (i.e., what does your row look like)?
>

we do something kind of like the entity attribute / graph tables example
from the accumulo manual


> Are you pre-spliting the table?
>

no


> How many tablets are ingesting at the same time?
>

4


> Are you writing from the map-reduce directly to Accumulo or writing to
> rFiles first?
>

directly to accumulo


> Are the Accumulo nodes and the Hadoop nodes on the same servers?
>

yes


>  Do you see the server load spike during ingest?
>



> How much memory are you allocating to the tservers?
>

2GB


> How large are the entries on average?
>
What are the largest entries?
> Does the data skew towards large entries?
>

entries are small.  probably <1k for key/value combined in most instances


> Are you querying at the same time as ingesting?
>
> no


>
>
> On Tue, Apr 8, 2014 at 5:35 PM, Adam Fuchs <afuchs@apache.org> wrote:
>
>> MIke,
>>
>> What version of Accumulo are you using, how many tablets do you have, and
>> how many threads are you using for minor and major compaction pools? Also,
>> how big are the keys and values that you are using?
>>
>> Here are a few settings that may help you:
>> 1. WAL replication factor (tserver.wal.replication). This defaults to 3
>> replicas (the HDFS default), but if you set it to 2 it will give you a
>> performance boost without a huge hit to reliability.
>> 2. Ingest buffer size (tserver.memory.maps.max), also known as the
>> in-memory map size. Increasing this generally improves the efficiency of
>> minor compactions and reduces the number of major compactions that will be
>> required down the line. 4-8 GB is not unreasonable.
>> 3. Make sure your WAL settings are such that the size of a log
>> (tserver.walog.max.size) multiplied by the number of active logs
>> (table.compaction.minor.logs.threshold) is greater than the in-memory map
>> size. You probably want to accomplish this by bumping up the number of
>> active logs.
>> 4. Increase the buffer size on the BatchWriter that the clients use. This
>> can be done with the setBatchWriterOptions method on the
>> AccumuloOutputFormat.
>>
>> Cheers,
>> Adam
>>
>>
>>
>> On Tue, Apr 8, 2014 at 4:47 PM, Mike Hugo <mike@piragua.com> wrote:
>>
>>> Hello,
>>>
>>> We have an ingest process that operates via Map Reduce, processing a
>>> large set of XML files and  inserting mutations based on that data into a
>>> set of tables.
>>>
>>> On a 5 node cluster (each node has 64G ram, 20 cores, and ~600GB SSD) I
>>> get 400k inserts per second with 20 mapper tasks running concurrently.
>>>  Increasing the number of concurrent mapper tasks to 40 doesn't have any
>>> effect (besides causing a little more backup in compactions).
>>>
>>> I've increased the table.compaction.major.ratio and increased the number
>>> of concurrent allowed compactions for both min and max compaction but each
>>> of those only had negligible impact on ingest rates.
>>>
>>> Any advice on other settings I can tweak to get things to move more
>>> quickly?  Or is 400k/second a reasonable ingest rate?  Are we at a point
>>> where we should consider generating r files like the bulk ingest example?
>>>
>>> Thanks in advance for any advice.
>>>
>>> Mike
>>>
>>
>>
>

Mime
View raw message