Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of davidj@gmail.com designates
 74.125.82.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=NuxTy2PZks4wWA/7d90ddLUkmB7tUBJpPVqtKZPJeWPjecMB6ySqS9Rt5yzqQaPRec
         BSM2HrsxJgI7numCIHC5pbkYMEMlPDndCnZy7AShG9dr9eUcL8Vvg8h4KaEfEMbSNe1E
         MjJpEfcZR1o+510jySN04SQaQIxcLk8UPheEQ=
MIME-Version: 1.0
In-Reply-To: <AANLkTikUoZgAShLKGBdUAczHwaFy6P_9MmTgokQRVjhY@mail.gmail.com>
References: <AANLkTimuCNP1+jK9hDn=_Uf7henwQPa5qQco6KbHHf98@mail.gmail.com>
	<AANLkTi=8oBpuVqoqYepdcxsGGZB54JLw+u8iOaVDfLUy@mail.gmail.com>
	<AANLkTi=pzoEybo-7iHVaN=L+wBsmr65g-uP1oKbHi8fM@mail.gmail.com>
	<AANLkTimRPK8kyaTQNGyh4Cp8CY3NcfKy0KMo1L5Cbg3G@mail.gmail.com>
	<AANLkTikot8E_JaczT4=qNrSYTkDPbK6Y19x53wC+zpu8@mail.gmail.com>
	<AANLkTi=BCNzy+KXmDOZkZKada7_99O4OSDaKKRY5VVdj@mail.gmail.com>
	<AANLkTikq96CBSG-fYZK6-OB6wL9g==h4AvdKbr5k4=CY@mail.gmail.com>
	<AANLkTikUoZgAShLKGBdUAczHwaFy6P_9MmTgokQRVjhY@mail.gmail.com>
Date: Mon, 22 Nov 2010 16:50:40 -0800
Message-ID: <AANLkTin+cj7B463H+DzYO19119Wo12X2Q1v9Sq0ZePf-@mail.gmail.com>
Subject: Re: cassandra vs hbase summary (was facebook messaging)
From: David Jeske <davidj@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6dab1780683c90495adc3dc

--0016e6dab1780683c90495adc3dc
Content-Type: text/plain; charset=ISO-8859-1

This is my second attempt at a summary of Cassandra vs HBase consistency and
performance for an hbase acceptable workload. I think these tricky subtlties
are hard to understand, yet it's helpful for the community to understand
them. I'm not trying to state my own facts (or opinion) but merely summarize
what I've read.

Again, please correct any facts which are wrong. Thanks for the kind and
thoughtful responses!

*1) Cassandra can't replicate the consistency situation of HBase.* Namely
that once a write is finished that new value will either always appear or
never appear.

[In Cassandra]Provided at least one node receives the write, it will
eventually be written to all replicas. A failure to meet the requested
ConsistencyLevel is just that; not a failure to write the data itself. Once
the write is received by a node, it will eventually reach all replicas,
there is no roll back. - Nick Telford
[ref<http://www.mail-archive.com/user@cassandra.apache.org/msg07398.html>
]

In Cassandra (N3/W3/R1, N3/W2/R2, or N3/W3/R3), a write can occur to a
single node, fail to meet the write-consistency request, readback can show
the old value, but later show the new value once the write that did occur is
propagated.

[In HBase]Once a region master accepts a write, it has been flushed to the
HDFS log. If the replica server goes down while writing, if the write was
finished to any copies of the HDFS log, the new region master will accept
and propagate the write, if not, the write will never appear.

*2) Cassandra has a less efficient use of memory, particularly for data
pinned in memory. *With 3 replicas on Cassandra, each element of data pinned
in-memory is kept on 3 servers, wheras in hbase only region masters keep the
data in memory, so there is only one-copy of each data element.

CASSANDRA-1314 <https://issues.apache.org/jira/browse/CASSANDRA-1314>provides
an opportunity to allow a 'soft master', where reads prefer a
particular replica. Combined with a disable of read-repair this should allow
for more efficient memory usage for data pinned or cached in memory. #1 is
still true, namely that a write may only occur to a node which is not the
soft-master, and that new new value may not appear for a while and then
eventually appear. However, with N3/W3/R1, once a write appears at the
soft-master it will remain, so as long as the soft-master preference can be
honored it will be closer to HBase's consistency.

*3) HBase can't match the row-availability situation of Cassandra
(N3/W2/R2).* In the face of a single machine failure, if it is a region
master, those keys are offline in HBase until a new region master is elected
and brought online. In Cassandra, no single node failure causes the data to
become unavailable.

*4) Two Cassandra configurations are closest to the **consistency situation
of hbase, and provide slightly different node failure
characteristics.*(note, #1 above means Cassandra can't truly reach the
same consistency
situation as HBase)

In Cassandra (N3/W3/R1), a node failure will disallow writes to a keyrange
during the replica rebuild, while still allowing reads.
In Cassandra (N3/W2-3/R2), a node failure will allow both reads and writes
to continue, while requiring uncached reads to contact two servers.
(Requiring a response from two servers may increase common case latency, but
may hide latency from GC spikes, since any two of the three may respond)
In HBase, if an HDFS node fails, both reads and writes continue; while when
a region-master fails, both reads and writes are stalled until the region
master is replaced.


Was that a better summary? Is it closer to correct?

--0016e6dab1780683c90495adc3dc
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote">This is my second attempt at a summary of Cassan=
dra vs HBase consistency and performance for an hbase acceptable workload. =
I think these tricky subtlties are hard to understand, yet it&#39;s helpful=
 for the community to understand them. I&#39;m not trying to state my own f=
acts (or opinion) but merely summarize what I&#39;ve read.</div>
<div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Again, plea=
se correct any facts which are wrong. Thanks for the kind and thoughtful re=
sponses!=A0</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_q=
uote">
<b>1) Cassandra can&#39;t replicate the consistency situation of HBase.</b>=
 Namely that once a write is finished that new value will either always app=
ear or never appear.=A0</div><div class=3D"gmail_quote"><br></div><div clas=
s=3D"gmail_quote">
<span class=3D"Apple-style-span" style=3D"font-family: arial, sans-serif; f=
ont-size: 13px; border-collapse: collapse; ">[In Cassandra]Provided at leas=
t one node receives the write, it will eventually be written to all replica=
s. A failure to meet the requested ConsistencyLevel is just that; not a fai=
lure to write the data itself. Once the write is received by a node, it wil=
l eventually reach all replicas, there is no roll back. - Nick Telford [<a =
href=3D"http://www.mail-archive.com/user@cassandra.apache.org/msg07398.html=
" target=3D"_blank" style=3D"color: rgb(0, 0, 204); ">ref</a>]</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br=
></span></div><div class=3D"gmail_quote"><span class=3D"Apple-style-span" s=
tyle=3D"font-family: arial, sans-serif; font-size: 13px; border-collapse: c=
ollapse; ">In Cassandra (N3/W3/R1, N3/W2/R2, or N3/W3/R3), a write can occu=
r to a single node, fail to meet the write-consistency request, readback ca=
n show the old value, but later show the new value once the write that did =
occur is propagated.</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br=
></span></div><div class=3D"gmail_quote"><span class=3D"Apple-style-span" s=
tyle=3D"font-family: arial, sans-serif; font-size: 13px; border-collapse: c=
ollapse; ">[In HBase]Once a region master accepts a write, it has been flus=
hed to the HDFS log. If the replica server goes down while writing, if the =
write was finished to any copies of the HDFS log, the new region master wil=
l accept and propagate the write, if not, the write will never appear.=A0</=
span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br=
></span></div><div class=3D"gmail_quote"><span class=3D"Apple-style-span" s=
tyle=3D"font-family: arial, sans-serif; font-size: 13px; border-collapse: c=
ollapse; "><b>2) Cassandra has a less efficient use of memory, particularly=
 for data pinned in memory. </b>With 3 replicas on Cassandra, each element =
of data pinned in-memory is kept on 3 servers, wheras in hbase only region =
masters keep the data in memory, so there is only one-copy of each data ele=
ment.=A0</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br=
></span></div><div class=3D"gmail_quote"><span class=3D"Apple-style-span" s=
tyle=3D"font-family: arial, sans-serif; font-size: 13px; border-collapse: c=
ollapse; "><a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-1314"=
>CASSANDRA-1314</a> provides an opportunity to allow a &#39;soft master&#39=
;, where reads prefer a particular replica. Combined with a disable of read=
-repair this should allow for more efficient memory usage for data pinned o=
r cached in memory. #1 is still true, namely that a write may only occur to=
 a node which is not the soft-master, and that new new value may not appear=
 for a while and then eventually appear. However, with N3/W3/R1, once a wri=
te appears at the soft-master it will remain, so as long as the soft-master=
 preference can be honored it will be closer to HBase&#39;s consistency.=A0=
</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br=
></span></div><div class=3D"gmail_quote"><font class=3D"Apple-style-span" f=
ace=3D"arial, sans-serif"><span class=3D"Apple-style-span" style=3D"border-=
collapse: collapse;"><b>3) HBase can&#39;t match the row-availability situa=
tion of Cassandra (N3/W2/R2).</b> In the face of a single machine failure, =
if it is a region master, those keys are offline in HBase until a new regio=
n master is elected and brought online. In Cassandra, no single node failur=
e causes the data to become unavailable.=A0</span></font></div>
<div class=3D"gmail_quote"><font class=3D"Apple-style-span" face=3D"arial, =
sans-serif"><span class=3D"Apple-style-span" style=3D"border-collapse: coll=
apse;"><br></span></font></div><div class=3D"gmail_quote"><font class=3D"Ap=
ple-style-span" face=3D"arial, sans-serif"><span class=3D"Apple-style-span"=
 style=3D"border-collapse: collapse;"><b>4) Two Cassandra configurations ar=
e closest to the=A0</b></span></font><span class=3D"Apple-style-span" style=
=3D"font-family: arial, sans-serif; border-collapse: collapse; "><b>consist=
ency situation of hbase, and provide slightly different node failure charac=
teristics.</b> (note, #1 above means Cassandra can&#39;t truly reach the sa=
me consistency situation as HBase)</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; border-collapse: collapse; "><br></span></div><di=
v class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-fami=
ly: arial, sans-serif; border-collapse: collapse; ">In Cassandra (N3/W3/R1)=
, a node failure will disallow writes to a keyrange during the replica rebu=
ild, while still allowing reads.</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; border-collapse: collapse; ">In Cassandra (N3/W2-=
3/R2), a node failure will allow both reads and writes to continue, while r=
equiring uncached reads to contact two servers. (Requiring a response from =
two servers may increase common case latency, but may hide latency from GC =
spikes, since any two of the three may respond)</span></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; border-collapse: collapse; ">In HBase, if an HDFS=
 node fails, both reads and writes continue; while when a region-master fai=
ls, both reads and writes are stalled until the region master is replaced.=
=A0</span></div>
<div class=3D"gmail_quote"><font class=3D"Apple-style-span" face=3D"arial, =
sans-serif"><span class=3D"Apple-style-span" style=3D"border-collapse: coll=
apse;"><br></span></font></div><div class=3D"gmail_quote"><font class=3D"Ap=
ple-style-span" face=3D"arial, sans-serif"><span class=3D"Apple-style-span"=
 style=3D"border-collapse: collapse;"><br>
</span></font></div><div class=3D"gmail_quote"><font class=3D"Apple-style-s=
pan" face=3D"arial, sans-serif"><span class=3D"Apple-style-span" style=3D"b=
order-collapse: collapse;">Was that a better summary? Is it closer to corre=
ct?</span></font></div>
<div class=3D"gmail_quote"><span class=3D"Apple-style-span" style=3D"font-f=
amily: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br=
></span></div><div class=3D"gmail_quote"><span class=3D"Apple-style-span" s=
tyle=3D"font-family: arial, sans-serif; font-size: 13px; border-collapse: c=
ollapse; "><br>
</span></div><div class=3D"gmail_quote"><font class=3D"Apple-style-span" fa=
ce=3D"arial, sans-serif"><span class=3D"Apple-style-span" style=3D"border-c=
ollapse: collapse;"><br></span></font></div><div class=3D"gmail_quote"><spa=
n class=3D"Apple-style-span" style=3D"font-family: arial, sans-serif; font-=
size: 13px; border-collapse: collapse; "><br>
</span></div><div class=3D"gmail_quote"><span class=3D"Apple-style-span" st=
yle=3D"font-family: arial, sans-serif; font-size: 13px; border-collapse: co=
llapse; "><br></span></div>

--0016e6dab1780683c90495adc3dc--