accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Bulk Ingest
Date Fri, 17 Jun 2016 03:37:37 GMT
Can you create RFiles outside of Accumulo and then import those?

On Thu, Jun 16, 2016 at 10:24 PM, Josh Elser <josh.elser@gmail.com> wrote:

> There are two big things that are required to really scale up bulk
> loading. Sadly (I guess) they are both things you would need to be
> implement on your own:
>
> 1) Avoid lots of small files. Target as large of files as you can,
> relative to your ingest latency requirements and your max file size (set on
> your instance or table)
>
> 2) Avoid having to import one file to multiple tablets. Remember that the
> majority of the metadata update for Accumulo is updating the tablet row
> with the new file. When you have one file which spans many tablets, you are
> now create N metadata updates instead of just one. When you create the
> files, take into account the split points of your table, and use that try
> to target one file per tablet.
>
>
> Roshan Punnoose wrote:
>
>> We are trying to perform bulk ingest at scale and wanted to get some
>> quick thoughts on how to increase performance and stability. One of the
>> problems we have is that we sometimes import thousands of small files,
>> and I don't believe there is a good way around this in the architecture
>> as of yet. Already I have run into an rpc timeout issue because the
>> import process is taking longer than 5m. And another issue where we have
>> so many files after a bulk import that we have had to bump the
>> tserver.scan.files.open.max to 1K.
>>
>> Here are some other configs that we have been toying with:
>> - master.fate.threadpool.size: 20
>> - master.bulk.threadpool.size: 20
>> - master.bulk.timeout: 20m
>> - tserver.bulk.process.threads: 20
>> - tserver.bulk.assign.threads: 20
>> - tserver.bulk.timeout: 20m
>> - tserver.compaction.major.concurrent.max: 20
>> - tserver.scan.files.open.max: 1200
>> - tserver.server.threads.minimum: 64
>> - table.file.max: 64
>> - table.compaction.major.ratio: 20
>>
>> (HDFS)
>> - dfs.namenode.handler.count: 100
>> - dfs.datanode.handler.count: 50
>>
>> Just want to get any quick ideas for performing bulk ingest at scale.
>> Thanks guys
>>
>> p.s. This is on Accumulo 1.6.5
>>
>

Mime
View raw message