cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sal Fuentes <>
Subject Re: HBase vs. Cassandra: new article!
Date Thu, 29 Oct 2009 22:56:15 GMT
Thanks for taking the time to contribute these corrections Jonathan. I
second Mr. Were's idea of perhaps having a wiki where we can host actual and
current facts as oppose to a blog post (that will hardly be updated as time
passes but will continue to perpetuate) about outdated and perhaps biased

Perhaps this "comparison" should be hosted on so that it won't
be perceived as subjective. Furthermore this should allow HBase devs and
users to contribute factual information as well. Lastly, this also has the
potential to contribute some healthy competition. I would be interested to
hear what you devs and users think about this as well.

On Thu, Oct 29, 2009 at 2:59 PM, Jonathan Ellis <> wrote:

> Okay, here are some corrections.  It's a bit choppy because it's just
> that; a list of corrections.
> Again, this is just trying to address factual errors; I disagree with
> many of the expressed opinions, too. :)
> > Cassandra relies mostly on Key-Value pairs for storage
> No more than hbase does.  Cassandra's columnfamily model does away
> with historical values, and adds supercolumns, but the two have a lot
> more in commmon with each other than with actual k/v stores.
> > it’s a fact that far more people are using HBase than Cassandra at this
> moment
> While it's possible that more people are using HBase right now, with
> 90 people in the cassandra irc chanel, and 55 in hbase, I'm
> comfortable that Cassandra's community is healthy.
> > despite both being similarly recent
> HBase is roughly 2x as old as Cassandra.
> > HBase values strong consistency and High Availability while Cassandra
> values Availability and Partitioning tolerance
> HBase actually picks CP.
> > Efficiently running MapReduce on Cassandra, on the other hand, is
> difficult because all of its keys are in one big “space”, so the MapReduce
> framework doesn’t know how to split and divide the data natively. There
> needs to be some hackery in place to handle all of that.
> Writing a hadoop input generator is a Feature, to use the article's
> terminology.  It doesn't have to be hackish; in fact, trunk now has a
> key range splitter that could easily be adapted to Hadoop.
> Quoting an old patchset to "prove" that cassandra can only poorly
> interface to hadoop is weak.
> > Cassandra is only a Ruby gem install away.
> Or a tar download, or a deb package...
> > You still have to do quite a bit of manual configuration
> Other than columnfamily definition (which must also be done for
> hbase), I'm not sure what the author was thinking of here.
> bin/cassandra works out of the box, and (unlike hbase) there is only
> one type of process to deal with, which is a huge win for ops in
> production.
> > in HBase, if a region server is down, writes will be blocked for affected
> data until the data is redistributed
> (that is why hbase really has CP out of CAP, not CA)
> > Cassandra, however, has an internal method of resolving up-to-dateness
> issues with vector clocks — a complex but workable solution where basically
> the latest timestamp wins
> No; Cassandra uses latest-timestamp-wins, which is totally different
> from vector clocks.
> > Another architectural quibble is that Cassandra only supports one table
> per install. That means you can’t denormalize your data to make it more
> usable in analytical scenarios.
> Not even a kernel of truth there.  wtf?
> > Cassandra is really more of a Key Value store than a Data Warehouse.
> Again: wtf?
> > Furthermore, schema changes require a cluster restart
> This part is true, for now.  But, misleading since "schema change"
> means "adding CFs or keyspaces," not merely "modifying columns" like
> in traditional dbs.
> > it’s difficult to claim that Cassandra implements the BigTable model
> We never claimed to be a pure bigtable clone.  We don't want to be,
> because of the single points of failures and operational complexity
> involved.
> > Cassandra is optimized for small datacenters (hundreds of nodes)
> connected by very fast fiber. HBase, being based on research originally
> published by Google, is happy to handle replication to thousands of
> planet-strewn nodes across the ’slow’, unpredictable Internet
> Cassandra has multi-datacenter support already.  HBase didn't, last I
> checked.  So this is weird.
> > This first diagram is a model of the Cassandra replication scheme.
> Note that all these steps happen in parallel.
> > it’s impossible to tell when the required number of replicas will be
> up-to-date. This can be extremely painful in a live situation — when one of
> your DCs goes down, you often want to know *exactly* when to expect data
> consistency
> Cassandra provides consistency when R + W > N (read replica count +
> write replica count > replication factor).  If you do writes and reads
> both with QUORUM, for one example, you can expect data consistency as
> soon as there are enough nodes for a quorum (which may not even
> require the DC to be online).  That is not "impossible to tell" at
> all.
> > It’s important to note that Cassandra relies on high-speed fiber between
> datacenters.
> Simply flat-out wrong.
> > If your writes are taking 1 or 2 ms, that’s fine. But when a DC goes out
> and you have to revert to a secondary one in China instead of 20 miles away,
> the incredible latency will lead to write timeouts and highly inconsistent
> data.
> Sure, "incredible" latency of 100ms or so is bad, but it's not the end
> of the world, and won't cause either write timeouts or inconsistent
> data, assuming that you are in fact using R + W > N.

Salvador Fuentes Jr.

View raw message