hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase and Cassandra on StackOverflow
Date Wed, 31 Aug 2011 01:05:02 GMT
Hi Joe,

> > HBase supports replication between clusters (i.e. data centers).
> 
> That’s
 … debatable.  There's replication support in the code, but
> several 
times in the recent past when someone asked about it on this
> mailing 
list, the response was “I don't know of anyone actually
> using it.” 


I believe SU uses it.

Anyway I think this is really the point I was making here:

> > the main difference is HBase provides simple mechanism and the user
> > must
 build a replication architecture useful for them; while
> > Cassandra 
attempts to hide some of that complexity

So I don't think you nor I are debating this point really, except this:

> My understanding of replication is that you can't replicate any
> existing
 data, so unless you activated it on day one, it isn't very
> useful.

That was a design choice. Existing data should be transferred in advance or in background
one-shot with a utility that chooses on an application-specific basis what is useful to replicate.
There is also a generic utility provided as a MR job for this purpose.

> If you use N=3, W=3, R=1 in Cassandra, you 
should get similar behavior
> to HBase/HDFS with respect to consistency 
and availability

My understanding is that R=1 does not guarantee that you won't see different versions of the
data in different reads, in some scenarios. There was an excellent Quora answer in this regard,
I don't remember it offhand, perhaps you can find the link to it or someone can provide it
to you.

> Random partitioning has definite advantages for some cases, and HBase
> might well benefit from recognizing that and adding some support.

Or just use salted keys? 

Random partitioning in a distributed ordered tree sounds like impedance mismatch to me.

> HBase uses two different kinds of files, data files and logs, but
> HDFS 
doesn't know about that and cannot, for example, optimize data
> files for write throughput

You are assuming that HDFS is a shrinkwrapped static thing here, no?

Anyway, your point is valid, in the past features that HBase requires of HDFS have not received
the level of support in the HDFS developer community that we would have liked. However this
is now rapidly changing for the better.

> Operationally, however, HBase is more complex.
> Admins have to configure
 and manage ZooKeeper, HDFS, and HBase.
> Could this be improved?

Sure, there is room for improvement for hiding some of the complexity for evaluators or single
system developers or other users who want e.g. a three step quickstart.

Personally I prefer having the ability to tune those layers independent of each other.

And, while complexity may be more "hidden" operationally in the Cassandra case relative to
HBase, when there is a problem on your cluster, I don't know if that buys you anything. I
suppose it depends on the nature of the problem. I do not believe there is a guarantee that
operationally Cassandra is really simpler than HBase when it's 2 am and there is a bug and
nodes are going down.


Best regards,


        - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


>________________________________
>From: Joe Pallas <joseph.pallas@oracle.com>
>To: user@hbase.apache.org
>Sent: Wednesday, August 31, 2011 1:42 AM
>Subject: Re: HBase and Cassandra on StackOverflow
>
>
>On Aug 30, 2011, at 2:47 AM, Andrew Purtell wrote:
>
>> Better to focus on improving HBase than play whack a mole.
>
>Absolutely.  So let's talk about improving HBase.  I'm speaking here as someone who
has been learning about and experimenting with HBase for more than six months.
>
>> HBase supports replication between clusters (i.e. data centers).
>
>That’s … debatable.  There's replication support in the code, but several times in
the recent past when someone asked about it on this mailing list, the response was “I don't
know of anyone actually using it.”  My understanding of replication is that you can't replicate
any existing data, so unless you activated it on day one, it isn't very useful.  Do I misunderstand?
>
>> Cassandra does not have strong consistency in the sense that HBase provides. It can
provide strong consistency, but at the cost of failing any read if there is insufficient quorum.
HBase/HDFS does not have that limitation. On the other hand, HBase has its own and different
scenarios where data may not be immediately available. The differences between the systems
are nuanced and which to use depends on the use case requirements.
>
>That's fair enough, although I think your first two sentences nearly contradict each other
:-).  If you use N=3, W=3, R=1 in Cassandra, you should get similar behavior to HBase/HDFS
with respect to consistency and availability ("strong" consistency and reads do not fail if
any one copy is available).
>
>A more important point, I think, is the one about storage.  HBase uses two different
kinds of files, data files and logs, but HDFS doesn't know about that and cannot, for example,
optimize data files for write throughput (and random reads) and log files for low latency
sequential writes.  (For example, how could performance be improved by adding solid-state
disk?)
>
>> Cassandra's RandomPartitioner / hash based partitioning means efficient MapReduce
or table scanning is not possible, whereas HBase's distributed ordered tree is naturally efficient
for such use cases, I believe explaining why Hadoop users often prefer it. This may or may
not be a problem for any given use case. 
>
>I don't think you can make a blanket statement that random partitioning makes efficient
MapReduce impossible (scanning, yes).  Many M/R tasks process entire tables.  Random partitioning
has definite advantages for some cases, and HBase might well benefit from recognizing that
and adding some support.
>
>> Cassandra is no less complex than HBase. All of this complexity is "hidden" in the
sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase, etc. -- but the Cassandra
internals are no less layered. 
>
>Operationally, however, HBase is more complex.  Admins have to configure and manage ZooKeeper,
HDFS, and HBase.  Could this be improved?
>
>> With Cassandra, all RPC is via Thrift with various wrappers, so actually all Cassandra
clients are second class in the sense that jbellis means when he states "Non-Java clients
are not second-class citizens".
>
>That's disingenuous.  Thrift exposes all of the Cassandra API to all of the wrappers,
while HBase clients who want to use all of the HBase API must use Java.  That can be fixed,
but it is the status quo.
>
>joe
>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message