accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: Fwd: Data authorization/visibility limit in Accumulo
Date Mon, 11 Apr 2016 17:20:14 GMT
> Eventually, compactions might bog you down too (depending on how you
generated the data)

Yes, I've found that this is very important. If you're using a hash-based
partitioner to distribute the work in either Spark or M/R, it's easy to get
into a situation where each tablet server is responsible for every split of
every RFile. You wind up in this weird situation where the importDirectory
call takes almost as long as generating the RFiles in the first place! I
*think* it's because importDirectory is waiting for a bunch of compactions
to reorganize the data, but I'm not 100% sure.

-Russ



On Sun, Apr 10, 2016 at 10:08 PM Dylan Hutchison <dhutchis@cs.washington.edu>
wrote:

> On Sun, Apr 10, 2016 at 8:32 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
> > Dylan Hutchison wrote:
> >
> >> >  2. What is the most effective way to ingest data, if we're receiving
> >>> data
> >>>
> >>>> >>  with the size of>1 TB on a daily basis?
> >>>> >>
> >>>>
> >>> >
> >>> >  If latency is not a primary concern, creating Accumuo RFiles and
> >>> >  performing bulk ingest/bulk loading is by far the most efficient way
> >>> to
> >>> >  getting data into Accumulo. This is often done by a MapReduce job
to
> >>> >  process your incoming data, create Accumulo RFiles and then bulk
> load
> >>> these
> >>> >  files into Accumulo. If you have a low latency for getting data into
> >>> >  Accumuo, waiting for a MapReduce job to complete may take too long
> to
> >>> meet
> >>> >  your required latencies.
> >>> >
> >>> >
> >>>
> >> If you need a lower latency, you still have the option of parallel
> ingest
> >> via normal BatchWriters.  Assuming good load balancing and the same
> number
> >> of ingestors as tablet servers, you should easily obtain ingest rates of
> >> 100k entries/sec/node.  With significant effort, some have pushed this
> to
> >> 400k entries/sec/node.
> >>
> >> Josh, do we have numbers on bulk ingest rates?  I'm curious what the
> best
> >> rates ever achieved are.
> >>
> >
> > Hrm. Not that I'm aware of. Generally, a bulk import is some ZooKeeper
> > operations (via FATE) and a few metadata updates per file (~3? i'm not
> > actually sure). Maybe I'm missing something?
> >
> > My hunch is that you'd run into HDFS issues in generating the data to
> > import before you'd run into Accumulo limits. Eventually, compactions
> might
> > bog you down too (depending on how you generated the data). I'm not sure
> if
> > we even have a bulk-import benchmark (akin to continuous ingest).
> >
>
> Good point: this does depend on the original data source.  If the data
> source is itself the output of a MapReduce job, then MapReducing to RFiles
> is free (in the best case).  If the data source is a 1TB file on disk, then
> it is hard to say whether MapReduce->BulkImport or BatchWriter is faster,
> without empirical evidence on both solutions.
>
> Fikri, it sounds like the conclusion is that you should determine your
> latency requirement, and try whichever method is easiest to start and fits
> your requirement.  Then you can measure performance and keep the solution
> if it works, or seek another option if not.  You can report back your
> numbers and experience to us =)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message