accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: How does Accumulo compare to HBase
Date Thu, 10 Jul 2014 18:55:57 GMT
Last year, I used Accumulo's rapid ingest ability to join two data silos
into one dataset. Every field was fully indexed. Having all of the data in
one place allowed cross-referencing queries to be executed. For various
reason, this kind of query was not possible using the existing technology.
The rapid ingest was important because a new copy of the data silos was
pulled every night.


On Thu, Jul 10, 2014 at 1:55 PM, Sean Busbey <busbey@cloudera.com> wrote:

>
> Chuck,
>
> It would help the community and my own benchmarking efforts if you could
> describe how you think a benchmark might incorporate representations of
> real-world bottlenecks.
>
> Do you think YCSB sufficiently covers the kind of testing you'd prefer?
>
>
> Marc,
>
> Similarly, it would help if you could describe the use case(s) behind your
> statement of interest.
>
> -Sean
>
>
> On Thu, Jul 10, 2014 at 12:11 PM, Marc Parisi <marc@accumulo.net> wrote:
>
>> I care
>>
>>
>> On Thu, Jul 10, 2014 at 11:33 AM, Chuck Adams <chuck.adams@oracle.com>
>> wrote:
>>
>>> Dr. Kepner,
>>>
>>> Who cares how fast you can load data into a non-indexed HBase or
>>> Accumulo database?  What is the strategy to handle user queries against
>>> this corpus?  Run some real world tests which include simultaneous queries,
>>> index maintenance, information life cycle management, during the initial
>>> and incremental data loads.
>>>
>>> The test you are running do not appear to have any of the real world
>>> bottlenecks that occur for production systems that users rely on for their
>>> business.
>>>
>>> Respectfully,
>>> Chuck Adams
>>>
>>> Vice President Technical Leadership Team
>>> Oracle National Security Group
>>> 1910 Oracle Way
>>> Reston, VA 20190
>>>
>>> Cell:     301.529.9396
>>> Email: chuck.adams@oracle.com
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>> -----Original Message-----
>>> From: Jeremy Kepner [mailto:kepner@ll.mit.edu]
>>> Sent: Thursday, July 10, 2014 11:08 AM
>>> To: user@accumulo.apache.org
>>> Subject: Re: How does Accumulo compare to HBase
>>>
>>> If you repeated the experiments you did for the Accumulo only portion
>>> with HBase what would you expect the results to be?
>>>
>>> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
>>> > At the risk of derailing the original thread, I'll take a moment to
>>> > explain my methodology. Since each entry can be thought of as a 6
>>> > dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
>>> > fiddling with the specifics of it. YCSB gives you several knobs, but
>>> > unfortunately, not absolutely everything was tunable.
>>> >
>>> > The things that are configurable:
>>> > - # of rows
>>> > - # of column qualifiers
>>> > - length of value
>>> > - number of operations per client
>>> > - number of clients
>>> >
>>> > Things that are not configurable:
>>> > - row length (it's a long int)
>>> > - # of column families (constant at one per row)
>>> > - length of column qualifier (basically a one-up counter per row)
>>> > - visibilities
>>> >
>>> > In all of my experiments, the goal was to keep data size constant.
>>> > This can be approximated by (number of entries * entry size). Number
>>> > of entries is intuitively rows (configurable) * column families (1) *
>>> > columns qualifiers per family (configurable), while entry size is key
>>> > overhead (about 40
>>> > bytes) + configured length of value. So to keep total size constant,
>>> > we have three easy knobs. However, tweaking three values at a time
>>> > produces really messy data where you're not always going to be sure
>>> > where the causality arrow lies. Even doing two at a time can cause
>>> > issues but then the choice is between tweaking two properties of the
>>> > data, or one property of the data and the total size (which is also a
>>> relevant attribute).
>>> >
>>> > Whew. So why did I use two different independent variables between the
>>> > two halves?
>>> >
>>> > Partly, because I'm not comparing the two tests to each other, so they
>>> > don't have to be duplicative. I ran them on different hardware from
>>> > each other, with different number of clients, disks, cores, etc.
>>> > There's no meaningful comparisons to be drawn, so I wanted to remove
>>> > the temptation to compare results against each other. I'll admit that
>>> > I might be wrong in this regard.
>>> >
>>> > The graphs are not my complete data sets. For the Accumulo v Accumulo
>>> > tests, we have about ten more data points varying rows and data size
>>> > as well. Trying to show three independent variables on a graph was
>>> > pretty painful, so they didn't make it into the presentation. The
>>> > short version of the story is that nothing scaled linearly (some
>>> > things were better, some things were worse) but the general trend
>>> > lines were approximately what you would expect.
>>> >
>>> > Let me know if you have more questions, but we can probably start a
>>> > new thread for future search posterity! (This applies to everybody).
>>> >
>>> > Mike
>>> >
>>> >
>>> > On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
>>> > kepner@ll.mit.edu> wrote:
>>> >
>>> > > Mike Drob put together a great talk at the Accumulo Summit (
>>> > > http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
>>> > > Accumulo performance and HBase performance.  This exactly the kind
>>> > > of work the entire Hadoop community needs to continue to move
>>> forward.
>>> > >
>>> > > I had one question about the talk which I was wondering if someone
>>> > > might be able to shed light on.  In the Accumulo part of the talk
>>> > > the experiments varied #rows while keeping the #cols fixed, while in
>>> > > the Accumulo/HBase part of the the experiments varied #cols while
>>> keeping #rows fixed?
>>>
>>
>>
>
>
> --
> Sean
>

Mime
View raw message