hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Cassandra vs HBase
Date Tue, 01 Sep 2009 22:35:51 GMT
I think this is kind of an existential question between "dynamo"-like
storage and 'bigtable"-like storage.

But one that is fairly easily resolved - look at the parent systems.

Bigtable - used by an increasing number of systems at google.
Underlies more than you'd ever think or realize (igoogle, reader, app
engine, etc, etc).

Dynamo - no longer in use at Amazon, cassandra at facebook hasn't
expanded usage as far as I can tell.

In the end, you'll have outages of one kind or another. The question
isnt "which system has less SPF", it's "which system lets me do what I
want to do faster".  Right now you can with do the following with
hbase, but not in cassandra:
- build left-match indexes
- map reduce at millions of (small) rows/sec.
- have strong counter systems (think: hit stats)
- flexible cluster configuration

some of these are fixable (apparently cassandra has an order
preserving partitioner which makes #1 possible, and a map-reduce proof
of concept was written for #2), and others will never be possible (3
and 4).

So, gather the info, test both systems out.  I only ask that you test
HBase 0.20, since it is substantially faster than 0.19.

Finally, here at Stumbleupon, our php app developers love hbase and
enjoy the flexibility that the schema-less system combined with an
easy to understand index model.

-ryan


On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis<jbellis@gmail.com> wrote:
>> They have aspects in common -- java, datastores, apache -- but the
>> differences are pretty acute:
>
> This is a pretty fair summary, IMO.
>
>> + Cassandra does eventual consistency.  HBase does strong consistency.  See
>> http://devblog.streamy.com/2009/08/24/cap-theorem/ for more on this.
>
> As I wrote in the other email, you can still get strong consistency
> with Cassandra.  (But, you can't get row locking: that is a definite
> win for HBase.  In my experience though most apps need locking less
> than they think.)
>
> The big win for Cassandra is that its p2p distribution model -- which
> drives the consistency model -- means there is no single point of
> failure.  SPF can be mitigated by failover but it's really, really
> hard to get all the corner cases right with that approach.  Even
> Google with their 3 year head start and huge engineering resources
> still has trouble with that occasionally.  (See e.g.
> http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179.)
>
>> + Cassandra does not have have a natural sharding notion as there is in
>> HBase -- i.e. HBase Regions -- so hooking Cassandra to MapReduce is awkward.
>
> Actually that's not a big deal -- the token ring is known, so you can
> break up at a coarse granularity there, and each node has a sampling
> of the keys stored on it thanks to the way the sstable indexing works,
> so generating hadoop input regions is pretty easy.  Jeff Hodges wrote
> a proof of concept over at
> https://issues.apache.org/jira/browse/CASSANDRA-342.
>
>> + The Cassandra fellas talk of their app being one ball of code only whereas
>> with HBase there is HDFS, ZooKeeper and then HBase itself (Apparently it has
>> less lines of code too).
>
> Opinions may differ, but I still think this is a huge win for troubleshooting.
>
>> Less tangible differences -- or differences that can be addressed through
>> application and development -- would include community, maturity, number and
>> variety of production installs, and features (monitoring, shells, UIs, admin
>> tools, etc.).  On these latter dimensions, HBase would seem to do better but
>> do the research and make your own call.
>
> I agree that HBase does better on some of these metrics right now, but
> I also think Cassandra is accelerating faster. :)
>
> -Jonathan (cassandra committer)
>

Mime
View raw message