accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Adams <chuck.ad...@oracle.com>
Subject RE: How does Accumulo compare to HBase
Date Fri, 11 Jul 2014 11:54:58 GMT
Hi Sean – 

 

Sorry I told Dr. Kepner that no one cares, because obviously some do!

 

I understand that there is a need to perform initial data loads and all products have the
ability to bulk load data for this purpose.  In this case, CPU is the limiting factor due
to parsing and compressing of the data during the load.  Add more CPU(nodes) and data segments
to scale.  

 

Once the data is loaded and indexed to meet the system query SLAs, incremental data load and
aging data off while supporting simultaneous queries, incremental indexing, query concurrency,
ILM, and meeting the query performance SLAs introduce NON-CPU bottlenecks.   Here are couple
of real world use cases I see everywhere, everyday:

 

1.       24x7 user and system to system query access to the corpus

2.       24x7 data stream loading from multiple sources into the corpus

3.       24x7 data roll off into a well defined ILM strategy

4.       SLA for immediate access to new data.  For example, no data staging and bulk
load.  

5.       SLA for query performance.  For example, no queuing. 

 

Say hi to Rob Morrow for me.  We miss him!

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 



HYPERLINK "http://www.oracle.com/"Oracle

Innovative technologies enabling the worlds best intelligence

Chuck Adams 

Vice President Technical Leadership Team
National Security Group
1910 Oracle Way
Reston, VA 20190

Cell:     301.529.9396  
Email: HYPERLINK "mailto:dave.rose@oracle.com"chuck.adams@oracle.com

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

 

 

From: Sean Busbey [mailto:busbey@cloudera.com] 
Sent: Thursday, July 10, 2014 1:55 PM
To: Accumulo User List
Subject: Re: How does Accumulo compare to HBase

 

 

Chuck,

 

It would help the community and my own benchmarking efforts if you could describe how you
think a benchmark might incorporate representations of real-world bottlenecks.

 

Do you think YCSB sufficiently covers the kind of testing you'd prefer?

 

 

Marc,

 

Similarly, it would help if you could describe the use case(s) behind your statement of interest.

 

-Sean

 

 

On Thu, Jul 10, 2014 at 12:11 PM, Marc Parisi <HYPERLINK "mailto:marc@accumulo.net" \nmarc@accumulo.net>
wrote:

I care 

 

On Thu, Jul 10, 2014 at 11:33 AM, Chuck Adams <HYPERLINK "mailto:chuck.adams@oracle.com"
\nchuck.adams@oracle.com> wrote:

Dr. Kepner,

Who cares how fast you can load data into a non-indexed HBase or Accumulo database?  What
is the strategy to handle user queries against this corpus?  Run some real world tests which
include simultaneous queries, index maintenance, information life cycle management, during
the initial and incremental data loads.

The test you are running do not appear to have any of the real world bottlenecks that occur
for production systems that users rely on for their business.

Respectfully,
Chuck Adams 

Vice President Technical Leadership Team
Oracle National Security Group
1910 Oracle Way
Reston, VA 20190

Cell:     HYPERLINK "tel:301.529.9396" \n301.529.9396  
Email: HYPERLINK "mailto:chuck.adams@oracle.com" \nchuck.adams@oracle.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 



-----Original Message-----
From: Jeremy Kepner [mailto:HYPERLINK "mailto:kepner@ll.mit.edu" \nkepner@ll.mit.edu]
Sent: Thursday, July 10, 2014 11:08 AM
To: HYPERLINK "mailto:user@accumulo.apache.org" \nuser@accumulo.apache.org
Subject: Re: How does Accumulo compare to HBase

If you repeated the experiments you did for the Accumulo only portion with HBase what would
you expect the results to be?

On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
> At the risk of derailing the original thread, I'll take a moment to
> explain my methodology. Since each entry can be thought of as a 6
> dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
> fiddling with the specifics of it. YCSB gives you several knobs, but
> unfortunately, not absolutely everything was tunable.
>
> The things that are configurable:
> - # of rows
> - # of column qualifiers
> - length of value
> - number of operations per client
> - number of clients
>
> Things that are not configurable:
> - row length (it's a long int)
> - # of column families (constant at one per row)
> - length of column qualifier (basically a one-up counter per row)
> - visibilities
>
> In all of my experiments, the goal was to keep data size constant.
> This can be approximated by (number of entries * entry size). Number
> of entries is intuitively rows (configurable) * column families (1) *
> columns qualifiers per family (configurable), while entry size is key
> overhead (about 40
> bytes) + configured length of value. So to keep total size constant,
> we have three easy knobs. However, tweaking three values at a time
> produces really messy data where you're not always going to be sure
> where the causality arrow lies. Even doing two at a time can cause
> issues but then the choice is between tweaking two properties of the
> data, or one property of the data and the total size (which is also a relevant attribute).
>
> Whew. So why did I use two different independent variables between the
> two halves?
>
> Partly, because I'm not comparing the two tests to each other, so they
> don't have to be duplicative. I ran them on different hardware from
> each other, with different number of clients, disks, cores, etc.
> There's no meaningful comparisons to be drawn, so I wanted to remove
> the temptation to compare results against each other. I'll admit that
> I might be wrong in this regard.
>
> The graphs are not my complete data sets. For the Accumulo v Accumulo
> tests, we have about ten more data points varying rows and data size
> as well. Trying to show three independent variables on a graph was
> pretty painful, so they didn't make it into the presentation. The
> short version of the story is that nothing scaled linearly (some
> things were better, some things were worse) but the general trend
> lines were approximately what you would expect.
>
> Let me know if you have more questions, but we can probably start a
> new thread for future search posterity! (This applies to everybody).
>
> Mike
>
>
> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
> HYPERLINK "mailto:kepner@ll.mit.edu" \nkepner@ll.mit.edu> wrote:
>
> > Mike Drob put together a great talk at the Accumulo Summit (
> > http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
> > Accumulo performance and HBase performance.  This exactly the kind
> > of work the entire Hadoop community needs to continue to move forward.
> >
> > I had one question about the talk which I was wondering if someone
> > might be able to shed light on.  In the Accumulo part of the talk
> > the experiments varied #rows while keeping the #cols fixed, while in
> > the Accumulo/HBase part of the the experiments varied #cols while keeping #rows
fixed?

 





 

-- 

Sean

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message