hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: High variance in results for hbase benchmarking
Date Fri, 04 Mar 2011 15:20:05 GMT
Is your insert path multi-threaded?

On Thu, Mar 3, 2011 at 10:51 PM, Aditya Sharma <adityadsharma@gmail.com>wrote:

> It was quite variable, as I said earlier, but in one sort of representative
> READs only benchmark, it was 115 READs per second. For a READ + WRITE
> benchmark, it was 90 operations per second (with some primitive caching
> thrown in).
> Aditya
> On Fri, Mar 4, 2011 at 11:54 AM, Ted Dunning <tdunning@maprtech.com>wrote:
>> What kinds of speeds are you seeing?
>> On Thu, Mar 3, 2011 at 10:19 PM, Aditya Sharma <adityadsharma@gmail.com>wrote:
>>> Hi All,
>>> I am working on benchmarking different data stores to find the best fit
>>> for
>>> our use case. I would like to know views and suggestions of the HBase
>>> user
>>> and developer community on some of my findings as the results I am
>>> getting
>>> are highly variable.
>>> My HBase Setup has two EC2 Large hosts (each one has 7.5 GB memory, 4 CPU
>>> cores etc), on which both the HBase master and slaves reside. HDFS master
>>> slave and Zookeeper instances are also split between these two hosts. I
>>> have
>>> three tables with one column family each and they have 100 million, 75
>>> million and 500 million rows respectively. The actual data consists of a
>>> String key and Long, String columns. The usual access patterns is to have
>>> GETs on individual keys and have periodical batch PUTs.
>>> I ran my benchmark application on HBase for different scenarios to
>>> measure
>>> pure GET performance, mixed GET and PUT performance etc. This was
>>> actually
>>> without enabling the HTable APIs writeBuffer or any BloomFilters. The
>>> results I got were quite unimpressive, compared to similar benchmarking
>>> done
>>> using MySQL, Cassandra etc. The performance was anywhere from 40% to 100%
>>> worse. So I started using writeBuffers in my code and also enabled
>>> BloomFilters at ROW level. However I started seeing a lot of variance in
>>> the
>>> benchmarking results (though I would not be too sure about correlating
>>> this
>>> with Bloomfilters/WriteBuffering). Another fact causing concern was that
>>> the
>>> results were actually worse than earlier results.
>>> Since we are using EC2 Large instances, it seems unlikely that network or
>>> some other virtualization related resources crunch are affecting our
>>> performance measurement.
>>> What I would want to know is whether this rings a bell for anyone else
>>> here.
>>> Could I be missing out on some configuration knob which would result in
>>> background compaction or some such process to start at the wrong time
>>> which
>>> might be affecting my benchmarks? Any comments or feedback are welcome.
>>> Thanks,
>>> Aditya

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message