cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Rantil <jens.ran...@tink.se>
Subject Re: Hbase vs Cassandra
Date Mon, 08 Jun 2015 08:57:41 GMT
Hi,

Some minor comments:

> 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool
for Cassandra but it doesn't support vnodes.

Not entirely sure what you mean here, but we ran Cloudera for a while and
Cloudera Manager was buggy and hard to debug. Overall, our experience
wasn't very good. This was definitely also due to us not knowing how all
the Cloudera packages were configured.

> HBase is always consistent. Machine outages lead to inability to read or
write data on that machine. With Cassandra you can always write.

Sort of true. You can decide write consistency and throw an exception if
write didn't go through consistently. However, do note that Cassandra will
never rollback failed writes which means writes aren't atomic (as in ACID).

We chose Cassandra over HBase mostly due to ease of managability. We are a
small team, and my feeling is that you will want dedicated people taking
care of a Hadoop cluster if you are going down the HBase path. A Cassandra
cluster can be handled by a single engineer and is, in my opinion, easier
to maintain.

Cheers,
Jens

On Mon, Jun 8, 2015 at 9:59 AM, Ajay <ajay.garga@gmail.com> wrote:

> Hi All,
>
> Thanks for all the input. I posted the same question in HBase forum and
> got more response.
>
> Posting the consolidated list here.
>
> Our case is that a central team builds and maintain the platform
> (Cassandra as a service). We have couple of usecases which fits Cassandra
> like time-series data. But as a platform team, we need to know more
> features and usecases which fits or best handled in Cassandra. Also to
> understand the usecases where HBase performs better (we might need to have
> it as a service too).
>
> *Cassandra:*
>
> 1) From 2013 both can still be relevant:
> http://www.pythian.com/blog/watch-hbase-vs-cassandra/
>
> 2) Here are some use cases from PlanetCassandra.org of companies who chose
> Cassandra over HBase after evaluation, or migrated to Cassandra from HBase.
> The eComNext interview cited on the page touches on time-series data;
> http://planetcassandra.org/hbase-to-cassandra-migration/
>
> 3) From googling, the most popular advantages for Cassandra over HBase is
> easy to deploy, maintain & monitor and no single point of failure.
>
> 4) From our six months research and POC experience in Cassandra, CQL is
> pretty limited. Though CQL is targeted for Real time Read and Write, there
> are cases where need to pull out data differently and we are OK with little
> more latency. But Cassandra doesn't support that. We need MapReduce or
> Spark for those. Then the debate starts why Cassandra and why not HBase if
> we need Hadoop/Spark for MapReduce.
>
> Expected a few more technical features/usecases that is best handled by
> Cassandra (and how it works).
>
> *HBase:*
>
> 1) As for the #4 you might be interested in reading
> https://aphyr.com/posts/294-call-me-maybe-cassandra
> Not sure if there is comparable article about HBase (anybody knows?) but
> it can give you another perspective about what else to keep an eye on
> regarding these systems.
>
> 2) See http://hbase.apache.org/book.html#perf.network.call_me_maybe
>
> 3) http://blog.parsely.com/post/1928/cass/
> *Anyone have any comments on this?*
>
> 4) 1. No killer features comparing to hbase
> 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool
> for Cassandra but it doesn't support vnodes.
> 3. Rumors say it fast when it works;) the reason- it can silently drop
> data you try to write.
> 4. Timeseries is a nightmare. The easiest approach is just replicate data
> to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala
>
> 5)  Migrated from Cassandra to HBase.
> Reasons:
> Scan is fast with HBase. It fits better with time series data model.
> Please look at opentsdb. Cassandra models it with large rows.
> Server side filtering. You can use to filter some of your time series data
> on the server side.
> Hbase has a better integration with hadoop in general. We had to write our
> own bulk loader using mapreduce for cassandra. hbase has already had a tool
> for that. There is a nice integration with flume and kite.
> High availability didnet matter for us. 10 secs down is fine for our use
> cases.HBase started to support eventually consistent reads.
>
> 6) Coprocessor framework (custom code inside Region Server and
> MasterServers), which Cassandra is missing, afaik.
>    Coprocessors have been widely used by hBase users (Phoenix SQL, for
> example) since inception (in 0.92).
> * HBase security model is more mature and align well with Hadoop/HDFS
> security. Cassandra provides just basic authentication/authorization/SSL
> encryption, no Kerberos, no end-to-end data encryption,
> no cell level security.
>
> 7) Another point to add is the new "HBase read high-availability using
> timeline-consistent region replicas" feature from HBase 1.0 onward, which
> brings HBase closer to Cassandra in term of Read Availability during
> node failures.  You have a choice for Read Availability now.
> https://issues.apache.org/jira/browse/HBASE-10070
>
> 8) Hbase can do range scans, and one can attack many problems with range
> scans. Cassandra can't do range scans.
>
> 9) HBase is a distributed, consistent, sorted key value store. The
> "sorted" bit allows for range scans in addition to the point gets that all
> K/V stores support. Nothing more, nothing less.
> It happens to store its data in HDFS by default, and we provide convenient
> input and output formats for map reduce.
>
> *Neutral:*
> 1)
> http://khangaonkar.blogspot.com/2013/09/cassandra-vs-hbase-which-nosql-store-do.html
>
> 2) The fundamental differences that come to mind are:
> * HBase is always consistent. Machine outages lead to inability to read or
> write data on that machine. With Cassandra you can always write.
>
> * Cassandra defaults to a random partitioner, so range scans are not
> possible (by default)
> * HBase has a range partitioner (if you don't want that the client has to
> prefix the rowkey with a prefix of a hash of the rowkey). The main feature
> that set HBase apart are range scans.
>
> * HBase is much more tightly integrated with Hadoop/MapReduce/HDFS, etc.
> You can map reduce directly into HFiles and map those into HBase instantly.
>
> * Cassandra has a dedicated company supporting (and promoting) it.
> * Getting started is easier with Cassandra. For HBase you need to run HDFS
> and Zookeeper, etc.
> * I've heard lots of anecdotes about Cassandra working nicely with small
> cluster (< 50 nodes) and quick degenerating above that.
> * HBase does not have a query language (but you can use Phoenix for full
> SQL support)
> * HBase does not have secondary indexes (having an eventually consistent
> index, similar to what Cassandra has, is easy in HBase, but making it as
> consistent as the rest of HBase is hard)
>
> Thanks
> Ajay
>
>
>>
>> On May 29, 2015, at 12:09 PM, Ajay <ajay.garga@gmail.com> wrote:
>>
>> Hi,
>>
>> I need some info on Hbase vs Cassandra as a data store (in general plus
>> specific to time series data).
>>
>> The comparison in the following helps:
>> 1: features
>> 2: deployment and monitoring
>> 3: performance
>> 4: anything else
>>
>> Thanks
>> Ajay
>>
>>
>


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.rantil@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>

Mime
View raw message