accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Kepner <>
Subject Re: How does Accumulo compare to HBase
Date Thu, 10 Jul 2014 15:08:18 GMT
If you repeated the experiments you did for the Accumulo only portion with HBase what would
you expect the results to be?

On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
> At the risk of derailing the original thread, I'll take a moment to explain
> my methodology. Since each entry can be thought of as a 6 dimensional
> vector (r, cf, cq, vis, ts, val) there's a lot of room for fiddling with
> the specifics of it. YCSB gives you several knobs, but unfortunately, not
> absolutely everything was tunable.
> The things that are configurable:
> - # of rows
> - # of column qualifiers
> - length of value
> - number of operations per client
> - number of clients
> Things that are not configurable:
> - row length (it's a long int)
> - # of column families (constant at one per row)
> - length of column qualifier (basically a one-up counter per row)
> - visibilities
> In all of my experiments, the goal was to keep data size constant. This can
> be approximated by (number of entries * entry size). Number of entries is
> intuitively rows (configurable) * column families (1) * columns qualifiers
> per family (configurable), while entry size is key overhead (about 40
> bytes) + configured length of value. So to keep total size constant, we
> have three easy knobs. However, tweaking three values at a time produces
> really messy data where you're not always going to be sure where the
> causality arrow lies. Even doing two at a time can cause issues but then
> the choice is between tweaking two properties of the data, or one property
> of the data and the total size (which is also a relevant attribute).
> Whew. So why did I use two different independent variables between the two
> halves?
> Partly, because I'm not comparing the two tests to each other, so they
> don't have to be duplicative. I ran them on different hardware from each
> other, with different number of clients, disks, cores, etc. There's no
> meaningful comparisons to be drawn, so I wanted to remove the temptation to
> compare results against each other. I'll admit that I might be wrong in
> this regard.
> The graphs are not my complete data sets. For the Accumulo v Accumulo
> tests, we have about ten more data points varying rows and data size as
> well. Trying to show three independent variables on a graph was pretty
> painful, so they didn't make it into the presentation. The short version of
> the story is that nothing scaled linearly (some things were better, some
> things were worse) but the general trend lines were approximately what you
> would expect.
> Let me know if you have more questions, but we can probably start a new
> thread for future search posterity! (This applies to everybody).
> Mike
> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
>> wrote:
> > Mike Drob put together a great talk at the Accumulo Summit (
> > discussing Accumulo
> > performance and HBase performance.  This exactly the kind of work the
> > entire Hadoop community needs to continue to move forward.
> >
> > I had one question about the talk which I was wondering if someone might
> > be able to shed light on.  In the Accumulo part of the talk the experiments
> > varied #rows while keeping the #cols fixed, while in the Accumulo/HBase
> > part of the the experiments varied #cols while keeping #rows fixed?

View raw message