accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Fwd: Data authorization/visibility limit in Accumulo
Date Mon, 11 Apr 2016 03:32:22 GMT
Dylan Hutchison wrote:
>> >  2. What is the most effective way to ingest data, if we're receiving data
>>> >>  with the size of>1 TB on a daily basis?
>>> >>
>> >
>> >  If latency is not a primary concern, creating Accumuo RFiles and
>> >  performing bulk ingest/bulk loading is by far the most efficient way to
>> >  getting data into Accumulo. This is often done by a MapReduce job to
>> >  process your incoming data, create Accumulo RFiles and then bulk load these
>> >  files into Accumulo. If you have a low latency for getting data into
>> >  Accumuo, waiting for a MapReduce job to complete may take too long to meet
>> >  your required latencies.
>> >
>> >
> If you need a lower latency, you still have the option of parallel ingest
> via normal BatchWriters.  Assuming good load balancing and the same number
> of ingestors as tablet servers, you should easily obtain ingest rates of
> 100k entries/sec/node.  With significant effort, some have pushed this to
> 400k entries/sec/node.
> Josh, do we have numbers on bulk ingest rates?  I'm curious what the best
> rates ever achieved are.

Hrm. Not that I'm aware of. Generally, a bulk import is some ZooKeeper 
operations (via FATE) and a few metadata updates per file (~3? i'm not 
actually sure). Maybe I'm missing something?

My hunch is that you'd run into HDFS issues in generating the data to 
import before you'd run into Accumulo limits. Eventually, compactions 
might bog you down too (depending on how you generated the data). I'm 
not sure if we even have a bulk-import benchmark (akin to continuous 

View raw message