accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@cs.washington.edu>
Subject Re: Fwd: Data authorization/visibility limit in Accumulo
Date Mon, 11 Apr 2016 05:07:47 GMT
On Sun, Apr 10, 2016 at 8:32 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Dylan Hutchison wrote:
>
>> >  2. What is the most effective way to ingest data, if we're receiving
>>> data
>>>
>>>> >>  with the size of>1 TB on a daily basis?
>>>> >>
>>>>
>>> >
>>> >  If latency is not a primary concern, creating Accumuo RFiles and
>>> >  performing bulk ingest/bulk loading is by far the most efficient way
>>> to
>>> >  getting data into Accumulo. This is often done by a MapReduce job to
>>> >  process your incoming data, create Accumulo RFiles and then bulk load
>>> these
>>> >  files into Accumulo. If you have a low latency for getting data into
>>> >  Accumuo, waiting for a MapReduce job to complete may take too long to
>>> meet
>>> >  your required latencies.
>>> >
>>> >
>>>
>> If you need a lower latency, you still have the option of parallel ingest
>> via normal BatchWriters.  Assuming good load balancing and the same number
>> of ingestors as tablet servers, you should easily obtain ingest rates of
>> 100k entries/sec/node.  With significant effort, some have pushed this to
>> 400k entries/sec/node.
>>
>> Josh, do we have numbers on bulk ingest rates?  I'm curious what the best
>> rates ever achieved are.
>>
>
> Hrm. Not that I'm aware of. Generally, a bulk import is some ZooKeeper
> operations (via FATE) and a few metadata updates per file (~3? i'm not
> actually sure). Maybe I'm missing something?
>
> My hunch is that you'd run into HDFS issues in generating the data to
> import before you'd run into Accumulo limits. Eventually, compactions might
> bog you down too (depending on how you generated the data). I'm not sure if
> we even have a bulk-import benchmark (akin to continuous ingest).
>

Good point: this does depend on the original data source.  If the data
source is itself the output of a MapReduce job, then MapReducing to RFiles
is free (in the best case).  If the data source is a 1TB file on disk, then
it is hard to say whether MapReduce->BulkImport or BatchWriter is faster,
without empirical evidence on both solutions.

Fikri, it sounds like the conclusion is that you should determine your
latency requirement, and try whichever method is easiest to start and fits
your requirement.  Then you can measure performance and keep the solution
if it works, or seek another option if not.  You can report back your
numbers and experience to us =)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message