hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: HBase and Cassandra on StackOverflow
Date Tue, 30 Aug 2011 23:55:43 GMT
On Tue, Aug 30, 2011 at 10:42 AM, Joe Pallas <joseph.pallas@oracle.com> wrote:
> On Aug 30, 2011, at 2:47 AM, Andrew Purtell wrote:
>> Better to focus on improving HBase than play whack a mole.
> Absolutely.  So let's talk about improving HBase.  I'm speaking here as someone who
has been learning about and experimenting with HBase for more than six months.
>> HBase supports replication between clusters (i.e. data centers).
> That’s … debatable.  There's replication support in the code, but several times
in the recent past when someone asked about it on this mailing list, the response was “I
don't know of anyone actually using it.”  My understanding of replication is that you can't
replicate any existing data, so unless you activated it on day one, it isn't very useful.
 Do I misunderstand?
>> Cassandra does not have strong consistency in the sense that HBase provides. It can
provide strong consistency, but at the cost of failing any read if there is insufficient quorum.
HBase/HDFS does not have that limitation. On the other hand, HBase has its own and different
scenarios where data may not be immediately available. The differences between the systems
are nuanced and which to use depends on the use case requirements.
> That's fair enough, although I think your first two sentences nearly contradict each
other :-).  If you use N=3, W=3, R=1 in Cassandra, you should get similar behavior to HBase/HDFS
with respect to consistency and availability ("strong" consistency and reads do not fail if
any one copy is available).

This is on the surface true, but there are a few hbase use cases that
cass has a harder time supporting:
- increment counter
- CAS calls

some people find these essential to building systems

> A more important point, I think, is the one about storage.  HBase uses two different
kinds of files, data files and logs, but HDFS doesn't know about that and cannot, for example,
optimize data files for write throughput (and random reads) and log files for low latency
sequential writes.  (For example, how could performance be improved by adding solid-state

I think "HDFS doesnt know about that and cannot... optimize" is a bit
of an overstatement... While it is TRUE that currently HDFS does not
do anything, there is no reason why it could do something better.
Adding SSD in an intelligent way would be nice.  Probably not for logs

Will HDFS ever focus on these things?  Probably in the mid-term, I'm
guessing we'll start to see attention on this towards the end of 2012,
or possibly not at all (after all these things dont help MapReduce, so
why bother?)

If an alternate DFS was able to work on these issues, they could very
quickly differentiate themselves over HDFS in terms of HBase support.

>> Cassandra's RandomPartitioner / hash based partitioning means efficient MapReduce
or table scanning is not possible, whereas HBase's distributed ordered tree is naturally efficient
for such use cases, I believe explaining why Hadoop users often prefer it. This may or may
not be a problem for any given use case.
> I don't think you can make a blanket statement that random partitioning makes efficient
MapReduce impossible (scanning, yes).  Many M/R tasks process entire tables.  Random partitioning
has definite advantages for some cases, and HBase might well benefit from recognizing that
and adding some support.
>> Cassandra is no less complex than HBase. All of this complexity is "hidden" in the
sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, etc. -- but the Cassandra
internals are no less layered.
> Operationally, however, HBase is more complex.  Admins have to configure and manage
ZooKeeper, HDFS, and HBase.  Could this be improved?
>> With Cassandra, all RPC is via Thrift with various wrappers, so actually all Cassandra
clients are second class in the sense that jbellis means when he states "Non-Java clients
are not second-class citizens".
> That's disingenuous.  Thrift exposes all of the Cassandra API to all of the wrappers,
while HBase clients who want to use all of the HBase API must use Java.  That can be fixed,
but it is the status quo.
> joe

View raw message