accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Drob <md...@apache.org>
Subject Re: How does Accumulo compare to HBase
Date Thu, 10 Jul 2014 15:02:00 GMT
At the risk of derailing the original thread, I'll take a moment to explain
my methodology. Since each entry can be thought of as a 6 dimensional
vector (r, cf, cq, vis, ts, val) there's a lot of room for fiddling with
the specifics of it. YCSB gives you several knobs, but unfortunately, not
absolutely everything was tunable.

The things that are configurable:
- # of rows
- # of column qualifiers
- length of value
- number of operations per client
- number of clients

Things that are not configurable:
- row length (it's a long int)
- # of column families (constant at one per row)
- length of column qualifier (basically a one-up counter per row)
- visibilities

In all of my experiments, the goal was to keep data size constant. This can
be approximated by (number of entries * entry size). Number of entries is
intuitively rows (configurable) * column families (1) * columns qualifiers
per family (configurable), while entry size is key overhead (about 40
bytes) + configured length of value. So to keep total size constant, we
have three easy knobs. However, tweaking three values at a time produces
really messy data where you're not always going to be sure where the
causality arrow lies. Even doing two at a time can cause issues but then
the choice is between tweaking two properties of the data, or one property
of the data and the total size (which is also a relevant attribute).

Whew. So why did I use two different independent variables between the two
halves?

Partly, because I'm not comparing the two tests to each other, so they
don't have to be duplicative. I ran them on different hardware from each
other, with different number of clients, disks, cores, etc. There's no
meaningful comparisons to be drawn, so I wanted to remove the temptation to
compare results against each other. I'll admit that I might be wrong in
this regard.

The graphs are not my complete data sets. For the Accumulo v Accumulo
tests, we have about ten more data points varying rows and data size as
well. Trying to show three independent variables on a graph was pretty
painful, so they didn't make it into the presentation. The short version of
the story is that nothing scaled linearly (some things were better, some
things were worse) but the general trend lines were approximately what you
would expect.

Let me know if you have more questions, but we can probably start a new
thread for future search posterity! (This applies to everybody).

Mike


On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
kepner@ll.mit.edu> wrote:

> Mike Drob put together a great talk at the Accumulo Summit (
> http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing Accumulo
> performance and HBase performance.  This exactly the kind of work the
> entire Hadoop community needs to continue to move forward.
>
> I had one question about the talk which I was wondering if someone might
> be able to shed light on.  In the Accumulo part of the talk the experiments
> varied #rows while keeping the #cols fixed, while in the Accumulo/HBase
> part of the the experiments varied #cols while keeping #rows fixed?

Mime
View raw message