Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Thu, 8 Sep 2011 00:19:07 +0530
From: Arvind Jayaprakash <work@anomalizer.net>
To: user@hbase.apache.org
Subject: Re: HBase Vs CitrusLeaf?
Message-ID: <20110907184907.GB3203@aa>
References: 
 <CAHXz3_H=hFG8EbjPbez8JYczOy6ak=QNh2h4hdxwgH=aRBbFdw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: 
 <CAHXz3_H=hFG8EbjPbez8JYczOy6ak=QNh2h4hdxwgH=aRBbFdw@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)

On Sep 06, Something Something wrote:
>Anyway, before I spent a lot of time on it, I thought I should check if
>anyone has compared HBase against CitrusLeaf.  If you've, I would greatly
>appreciate it if you would share your experiences.

Disclaimer: I was an early evaluator/tester of citrusleaf about a year
ago when it was in its infancy. Though I am not affliated with them in
any manner, I might be more benevolent to them than most readers of this
mailing list.

The short answer is that hbase & citrusleaf (called CL in remainder of
the mail) are very different products. 

CL cares a lot more about predictable latencies than hbase does. This is
manifested in two aspects of the design:

* It is heavily optimized for large RAM + SSD usage. While hbase does
a fair job of using RAM, I can say for sure that both the throughput and
latency trends is much better with CL in cases where spinning disks are
not used directly in the readwrite path.

* Multiple machines can concurrently/actively handle requests for the
same key, so the loss of one server does not mean that a range of keys
is temporarily unavailable. A hbase cluster does have a partial,
temporary outage when a region server dies. Things don't get back to
normal immediately even when a new server takes over since not all
region data may now be local disk reads. Even if they are, it won't be
readily waiting for you in fast memory.

* A third aspect that is more of a side-effect is that HDFS still has a
SPOF in form the namenode does continue to be a cause for concern wrt
overall uptime guarantees


Here is where hbase would do much better:

* It is designed for much larger data to the point where it is natural 
for the entire dataset to much larger than the total available RAM and
the usage of hard disks as the primary storage medium is natural.

* A bigtable implementation is also designed for both ranged scans and
also full table scans. Last I recall, CL was more of a DHT and so ranged
scans is infeasible and doing full scans would qualify as much more than
shooting oneself in the foot.


And here is where hbase has advantages in principle:

* As others mentioned, there are "textbook" advantages of using an open
source solution.

* hbase definitely has run both longer and on larger clusters than CL
possibly has.


While generalizations are dangerous, the one place when C++ code could
shine over java (JVM really) is one does not have to fight the GC. I'd
personally be more confomtable with handing off say 48GB of memory to a
good C/C++ code than the JVM. That being said, the folks working on hbase
have been actively been addressing this problem to the extent possible
in pure java by using unmanaged heap memory. Search for "mslab hbase" to
learn more about it.


My conclusion is that the two products address different problem spaces.
So I'd urge you to spend time understanding your access patterns and see
which one does it map to more closely. Feel free to contact me off list
if you feel the need to ask anything that is not approrpiate for the
mailing list but is relevant to this discussion.