hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase and Cassandra on StackOverflow
Date Tue, 30 Aug 2011 09:47:51 GMT
Hi Chris,

Appreciate your answer on the post.

Personally speaking however the endless Cassandra vs. HBase discussion is tiresome and rarely
do blog posts or emails in this regard shed any light. Often, Cassandra proponents mis-state
their case out of ignorance of HBase or due to commercial or personal agendas. It is difficult
to find clear eyed analysis among the partisans. I'm not sure it will make any difference
posting a rebuttal to some random thing jbellis says. Better to focus on improving HBase than
play whack a mole.

Regarding some of the specific points in that post:

HBase is proven in production deployments larger than the largest publicly reported Cassandra
cluster, ~1K versus 400 or 700 or somesuch. But basically this is the same order of magnitude,
with HBase having a slight edge. I don't see a meaningful difference here. Stating otherwise
is false.

HBase supports replication between clusters (i.e. data centers). I believe, but admit I'm
not super familiar with the Cassandra option here, that the main difference is HBase provides
simple mechanism and the user must build a replication architecture useful for them; while
Cassandra attempts to hide some of that complexity. I do not know if they succeed there, but
large scale cross data center replication is rarely one size fits all so I doubt it.

Cassandra does not have strong consistency in the sense that HBase provides. It can provide
strong consistency, but at the cost of failing any read if there is insufficient quorum. HBase/HDFS
does not have that limitation. On the other hand, HBase has its own and different scenarios
where data may not be immediately available. The differences between the systems are nuanced
and which to use depends on the use case requirements.

Cassandra's RandomPartitioner / hash based partitioning means efficient MapReduce or table
scanning is not possible, whereas HBase's distributed ordered tree is naturally efficient
for such use cases, I believe explaining why Hadoop users often prefer it. This may or may
not be a problem for any given use case. Using an ordered partitioner with Cassandra used
to require frequent manual rebalancing to avoid blowing up nodes. I don't know if more recent
versions still have this mis-feature.

Cassandra is no less complex than HBase. All of this complexity is "hidden" in the sense that
with Hadoop/HBase the layering is obvious -- HDFS, HBase, etc. -- but the Cassandra internals
are no less layered. An impartial analysis of implementation and algorithms will reveal that
Cassandra's theory of operation in its full detail is substantially more complex. Compare
the BigTable and Dynamo papers and this is clear. There are actually more opportunities for
something to go wrong with Cassandra.

While we are looking at codebases, it should be noted that HBase has substantially more unit

With Cassandra, all RPC is via Thrift with various wrappers, so actually all Cassandra clients
are second class in the sense that jbellis means when he states "Non-Java clients are not
second-class citizens".

The master-slave versus peer-to-peer argument is larger than Cassandra vs. HBase, and not
nearly as one sided as claimed. The famous (infamous?) global failure of Amazon's S3 in 2008,
a fully peer-to-peer system, due to a single flipped bit in a gossip message demonstrates
how in peer to peer systems every node can be a single point of failure. There is no obvious
winner, instead, a series of trade offs. Claiming otherwise is intellectually dishonest. Master-slave
architectures seem easier to operate and reason about in my experience. Of course, I'm partial

I have just scratched the surface.

Best regards,

       - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

>From: Chris Tarnas <cft@email.com>
>To: hbase-user@hadoop.apache.org
>Sent: Tuesday, August 30, 2011 2:02 PM
>Subject: HBase and Cassandra on StackOverflow
>Someone with better knowledge than might be interested in helping answer this question
over at StackOverflow:
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message