hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aditya Sharma <adityadsha...@gmail.com>
Subject High variance in results for hbase benchmarking
Date Fri, 04 Mar 2011 06:19:43 GMT
Hi All,

I am working on benchmarking different data stores to find the best fit for
our use case. I would like to know views and suggestions of the HBase user
and developer community on some of my findings as the results I am getting
are highly variable.

My HBase Setup has two EC2 Large hosts (each one has 7.5 GB memory, 4 CPU
cores etc), on which both the HBase master and slaves reside. HDFS master
slave and Zookeeper instances are also split between these two hosts. I have
three tables with one column family each and they have 100 million, 75
million and 500 million rows respectively. The actual data consists of a
String key and Long, String columns. The usual access patterns is to have
GETs on individual keys and have periodical batch PUTs.

I ran my benchmark application on HBase for different scenarios to measure
pure GET performance, mixed GET and PUT performance etc. This was actually
without enabling the HTable APIs writeBuffer or any BloomFilters. The
results I got were quite unimpressive, compared to similar benchmarking done
using MySQL, Cassandra etc. The performance was anywhere from 40% to 100%
worse. So I started using writeBuffers in my code and also enabled
BloomFilters at ROW level. However I started seeing a lot of variance in the
benchmarking results (though I would not be too sure about correlating this
with Bloomfilters/WriteBuffering). Another fact causing concern was that the
results were actually worse than earlier results.

Since we are using EC2 Large instances, it seems unlikely that network or
some other virtualization related resources crunch are affecting our
performance measurement.

What I would want to know is whether this rings a bell for anyone else here.
Could I be missing out on some configuration knob which would result in
background compaction or some such process to start at the wrong time which
might be affecting my benchmarks? Any comments or feedback are welcome.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message