accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: How does Accumulo compare to HBase
Date Thu, 17 Jul 2014 10:37:27 GMT
Hey Ted,

You're right, specialized filters such as ColumnRangeFilter and
ColumnPrefixFilter are/should be as efficient as row scan. Here's the
source of information:

  http://hadoop-hbase.blogspot.com/2012/01/hbase-intra-row-scanning.html

So the API is not limited, and we can do both range scans on RK and CQ.

More general filters such as QualifierFilter will need to scan the whole
row, here's the source of info:


http://stackoverflow.com/questions/20837458/hbase-qualifier-filter-vs-value-filter-performance


Now it all makes sense to me, thanks for the challenge! :)

Jianshi





On Thu, Jul 17, 2014 at 4:09 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> W.r.t. HBase filter's performance, can you let us know the source of the
> information ?
> Is performance bad for all filters or some types of filters ?
>
> Cheers
>
> On Jul 16, 2014, at 11:48 PM, Jianshi Huang <jianshi.huang@gmail.com>
> wrote:
>
> I finally decided to make the Storage part agnostic and wrote a HBase
> adapter as well.
>
> I'd like to say that Accumulo's API makes more sense than HBase'.
>
> One example is scan. In Accumulo, scan can take a Range which can take a
> startKey and endKey; a Key is a combination of RK+CF+CQ+CV+TS. This worked
> perfect and really free the CQ to a magic field in schema design.
>
> In HBase, scan can only take a startRow and an endRow... which both
> indicates the RK part. Scan through a range of Columns will require a
> ColumnRangeFilter AFAIK. (I heard HBase filter's performance sucks, not
> sure why).
>
> Any HBase expert disagrees? :)
>
> Jianshi
>
>
>
>
> On Fri, Jul 11, 2014 at 9:56 PM, Kepner, Jeremy - 0553 - MITLL <
> kepner@ll.mit.edu> wrote:
>
>> In many communities real applications can be hard to come by.  These
>> communities often sponsor the development of application/technology
>> agnostic benchmarks as a surrogate.  YCBS is one example.  Graph500 is
>> another example.  This allows a technology developer to focus on their own
>> technology and not have to be in the awkward position of having to also
>> benchmark multiple technologies simultaneously.  It's not a perfect system,
>> but in general, a few mediocre benchmarks tend to be a lot better than no
>> benchmarks (too many benchmarks is also a problem).  The benchmarks also
>> help with communicating the work because you can just reference the
>> benchmark.
>>
>> Now the tricky part is educating the customer base in how to interpret
>> the benchmarks.  To a certain degree this is simply the burden of being an
>> informed consumer, but we can do as much as we can to help them.  Using
>> standard benchmarks and showing how they correlate with some applications
>> and don't correlate with other applications is our obligation.
>>
>> On Jul 11, 2014, at 1:05 AM, Josh Elser <josh.elser@gmail.com> wrote:
>>
>> > It's important to remember that YCSB is a benchmark designed to test a
>> specific workload across database systems.
>> >
>> > No benchmark is going to be representative of every "real-life"
>> workload that can be thought of. Understand what the benchmark is showing
>> and draw realistic conclusions about what the results are showing you WRT
>> the problem(s) you care about.
>> >
>> > On 7/10/14, 11:33 AM, Chuck Adams wrote:
>> >> Dr. Kepner,
>> >>
>> >> Who cares how fast you can load data into a non-indexed HBase or
>> Accumulo database?  What is the strategy to handle user queries against
>> this corpus?  Run some real world tests which include simultaneous queries,
>> index maintenance, information life cycle management, during the initial
>> and incremental data loads.
>> >>
>> >> The test you are running do not appear to have any of the real world
>> bottlenecks that occur for production systems that users rely on for their
>> business.
>> >>
>> >> Respectfully,
>> >> Chuck Adams
>> >>
>> >> Vice President Technical Leadership Team
>> >> Oracle National Security Group
>> >> 1910 Oracle Way
>> >> Reston, VA 20190
>> >>
>> >> Cell:     301.529.9396
>> >> Email: chuck.adams@oracle.com
>> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Jeremy Kepner [mailto:kepner@ll.mit.edu]
>> >> Sent: Thursday, July 10, 2014 11:08 AM
>> >> To: user@accumulo.apache.org
>> >> Subject: Re: How does Accumulo compare to HBase
>> >>
>> >> If you repeated the experiments you did for the Accumulo only portion
>> with HBase what would you expect the results to be?
>> >>
>> >> On Thu, Jul 10, 2014 at 10:02:00AM -0500, Mike Drob wrote:
>> >>> At the risk of derailing the original thread, I'll take a moment to
>> >>> explain my methodology. Since each entry can be thought of as a 6
>> >>> dimensional vector (r, cf, cq, vis, ts, val) there's a lot of room for
>> >>> fiddling with the specifics of it. YCSB gives you several knobs, but
>> >>> unfortunately, not absolutely everything was tunable.
>> >>>
>> >>> The things that are configurable:
>> >>> - # of rows
>> >>> - # of column qualifiers
>> >>> - length of value
>> >>> - number of operations per client
>> >>> - number of clients
>> >>>
>> >>> Things that are not configurable:
>> >>> - row length (it's a long int)
>> >>> - # of column families (constant at one per row)
>> >>> - length of column qualifier (basically a one-up counter per row)
>> >>> - visibilities
>> >>>
>> >>> In all of my experiments, the goal was to keep data size constant.
>> >>> This can be approximated by (number of entries * entry size). Number
>> >>> of entries is intuitively rows (configurable) * column families (1)
*
>> >>> columns qualifiers per family (configurable), while entry size is key
>> >>> overhead (about 40
>> >>> bytes) + configured length of value. So to keep total size constant,
>> >>> we have three easy knobs. However, tweaking three values at a time
>> >>> produces really messy data where you're not always going to be sure
>> >>> where the causality arrow lies. Even doing two at a time can cause
>> >>> issues but then the choice is between tweaking two properties of the
>> >>> data, or one property of the data and the total size (which is also
a
>> relevant attribute).
>> >>>
>> >>> Whew. So why did I use two different independent variables between the
>> >>> two halves?
>> >>>
>> >>> Partly, because I'm not comparing the two tests to each other, so they
>> >>> don't have to be duplicative. I ran them on different hardware from
>> >>> each other, with different number of clients, disks, cores, etc.
>> >>> There's no meaningful comparisons to be drawn, so I wanted to remove
>> >>> the temptation to compare results against each other. I'll admit that
>> >>> I might be wrong in this regard.
>> >>>
>> >>> The graphs are not my complete data sets. For the Accumulo v Accumulo
>> >>> tests, we have about ten more data points varying rows and data size
>> >>> as well. Trying to show three independent variables on a graph was
>> >>> pretty painful, so they didn't make it into the presentation. The
>> >>> short version of the story is that nothing scaled linearly (some
>> >>> things were better, some things were worse) but the general trend
>> >>> lines were approximately what you would expect.
>> >>>
>> >>> Let me know if you have more questions, but we can probably start a
>> >>> new thread for future search posterity! (This applies to everybody).
>> >>>
>> >>> Mike
>> >>>
>> >>>
>> >>> On Thu, Jul 10, 2014 at 9:26 AM, Kepner, Jeremy - 0553 - MITLL <
>> >>> kepner@ll.mit.edu> wrote:
>> >>>
>> >>>> Mike Drob put together a great talk at the Accumulo Summit (
>> >>>> http://www.slideshare.net/AccumuloSummit/10-30-drob) discussing
>> >>>> Accumulo performance and HBase performance.  This exactly the kind
>> >>>> of work the entire Hadoop community needs to continue to move
>> forward.
>> >>>>
>> >>>> I had one question about the talk which I was wondering if someone
>> >>>> might be able to shed light on.  In the Accumulo part of the talk
>> >>>> the experiments varied #rows while keeping the #cols fixed, while
in
>> >>>> the Accumulo/HBase part of the the experiments varied #cols while
>> keeping #rows fixed?
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
View raw message