Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of tlipcon@gmail.com designates
 209.85.214.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:from:date
         :x-google-sender-auth:message-id:subject:to:content-type;
        b=cFjaGZmzxPOoX13ND0FqH3EwyDdSE0M7AD5l/MUUBtTrCtO8TUROnZw0RO3F/Q4V65
         CV3mGVTTZg+ijuNUqmij7pcUw9zLoKDvKm8LOVq4SeWwCfFyGEhcU1qiy6Iso6iw14k2
         UlVMhmMAfdcPvw1skYOm3YQTPlgDwQu+r+XOo=
MIME-Version: 1.0
Sender: tlipcon@gmail.com
In-Reply-To: <AANLkTikpQCg-9RUKJhq_ZWSLqJcdweE1iOP5NTQEXpfE@mail.gmail.com>
References: <AANLkTimtNtZEC31z8bLBZpX5arEfhTQved8x-R2hi713@mail.gmail.com>
 <AANLkTimLjAv_oTpgqG9oxCosjzsniHZPWMk=Pr7YCoqB@mail.gmail.com>
 <AANLkTikho_Twqeu0jWaLsVmjhzyR0BH=RaiCH17h=BT_@mail.gmail.com>
 <AANLkTim4=Cn+hiPj+U0YtmKh_zhvG-QFmAEeaRW1w=Jo@mail.gmail.com>
 <AANLkTikpQCg-9RUKJhq_ZWSLqJcdweE1iOP5NTQEXpfE@mail.gmail.com>
From: Todd Lipcon <todd@lipcon.org>
Date: Sun, 21 Nov 2010 16:16:16 -0800
Message-ID: <AANLkTi=7cNn6v7Ck5j+kfq_rLdFJ-pi4gbOXfmH0TKBC@mail.gmail.com>
Subject: Re: Facebook messaging and choice of HBase over Cassandra - what can
 we learn?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000325576e325c27a70495992b74

--000325576e325c27a70495992b74
Content-Type: text/plain; charset=ISO-8859-1

On Sun, Nov 21, 2010 at 2:06 PM, Edward Ribeiro <edward.ribeiro@gmail.com>wrote:

>
> Also I believe saying HBASE is consistent is not true. This can happen:
>> Write to region server. -> Region Server acknowledges client-> write
>> to WAL -> region server fails = write lost
>>
>> I wonder how facebook will reconcile that. :)
>>
>
> Are you sure about that? Client writes to WAL before ack user?
>
> According to these posts[1][2], "if writing the record to the WAL fails the
> whole operation must be considered a failure.", so it would be nonsense
> acknowledge clients before writing the lifeline. I hope any cloudera guy
> explain this...
>
>
[only jumping in because info was requested - those who know me know that I
think Cassandra is a very interesting architecture and a better fit for many
applications than HBase]

You can operate the commit log in two different modes in HBase. One mode is
"deferred log flush", where the region server appends but does not sync()
the commit log to HDFS on every write, but rather on a periodic basis (eg
once a second). This is similar to the innodb_flush_log_at_trx_commit=2
option in MySQL for example. This has slightly better performance obviously
since the writer doesn't need to wait on the commit, but as you noted
there's a window where a write may be acknowledged but then lost. This is an
issue of *durability* moreso than consistency.

In the other mode of operation (default in recent versions of HBase) we do
not acknowledge a write until it has been pushed to the OS buffer on the
entire pipeline of log replicas. Obviously this is slower, but it results in
"no lost data" regardless of any machine failures. Additionally, concurrent
readers do not see written data until these same properties have been
satisfied. So this mode is 100% consistent and 100% durable. In practice,
this effects latency significantly since it adds two extra round trips to
each write, but system throughput is only reduced by 20-30% since the
commits are pipelined (see HDFS-895 for gory details)

I believe Cassandra has similar tuning options about whether to sync every
commit to the log or only do so periodically.

If you're interested in learning more, feel free to reference this
documentation:
http://hbase.apache.org/docs/r0.89.20100726/acid-semantics.html


> Besides that, you know that WAL is written to HDFS that takes care of
> replication and fault tolerance, right? Of course, even so, there's a
> "window of inconsistency" before the HLog is flushed to disk, but I don't
> think you can dismiss this as not consistent. At most, you may classify it
> as "eventual consistent". :)
>
> [1] http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
> [2]
> http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
>
> E. Ribeiro
>
>

--000325576e325c27a70495992b74
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Sun, Nov 21, 2010 at 2:06 PM, Edward Ribeiro <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:edward.ribeiro@gmail.com">edward.ribeiro@gmail.com</a>&gt;</sp=
an> wrote:<br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class=3D"im"><br><div class=3D"gmail_quote"><blockquote class=3D"gmail=
_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 20=
4, 204);padding-left:1ex">
Also I believe saying HBASE is consistent is not true. This can happen:<br>
Write to region server. -&gt; Region Server acknowledges client-&gt; write<=
br>
to WAL -&gt; region server fails =3D write lost<br>
<br>
I wonder how facebook will reconcile that. :)<br></blockquote></div><br></d=
iv>Are you sure about that? Client writes to WAL before ack user?<br><br>Ac=
cording to these posts[1][2], &quot;if writing the record to the WAL fails =
the whole operation must be considered a failure.&quot;, so it would be non=
sense acknowledge clients before writing the lifeline. I hope any cloudera =
guy explain this...<br>


<br></blockquote><div><br></div><div>[only jumping in because info was requ=
ested - those who know me know that I think Cassandra is a very interesting=
 architecture and a better fit for many applications than HBase]</div>

<div><br></div><div>You can operate the commit log in two different modes i=
n HBase. One mode is &quot;deferred log flush&quot;, where the region serve=
r appends but does not sync() the commit log to HDFS on every write, but ra=
ther on a periodic basis (eg once a second). This is similar to the innodb_=
flush_log_at_trx_commit=3D2 option in MySQL for example. This has slightly =
better performance obviously since the writer doesn&#39;t need to wait on t=
he commit, but as you noted there&#39;s a window where a write may be ackno=
wledged but then lost. This is an issue of *durability* moreso than consist=
ency.</div>

<div><br></div><div>In the other mode of operation (default in recent versi=
ons of HBase) we do not acknowledge a write until it has been pushed to the=
 OS buffer on the entire pipeline of log replicas. Obviously this is slower=
, but it results in &quot;no lost data&quot; regardless of any machine fail=
ures. Additionally, concurrent readers do not see written data until these =
same properties have been satisfied. So this mode is 100% consistent and 10=
0% durable. In practice, this effects latency significantly since it adds t=
wo extra round trips to each write, but system throughput is only reduced b=
y 20-30% since the commits are pipelined (see HDFS-895 for gory details)</d=
iv>

<div><br></div><div>I believe Cassandra has similar tuning options about wh=
ether to sync every commit to the log or only do so periodically.</div><div=
><br></div><div>If you&#39;re interested in learning more, feel free to ref=
erence this documentation:</div>

<div><meta http-equiv=3D"content-type" content=3D"text/html; charset=3Dutf-=
8"><a href=3D"http://hbase.apache.org/docs/r0.89.20100726/acid-semantics.ht=
ml">http://hbase.apache.org/docs/r0.89.20100726/acid-semantics.html</a></di=
v><div>

<br></div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Besides that, you k=
now that WAL is written to HDFS that takes care of replication and fault to=
lerance, right? Of course, even so, there&#39;s a &quot;window of inconsist=
ency&quot; before the HLog is flushed to disk, but I don&#39;t think you ca=
n dismiss this as not consistent. At most, you may classify it as &quot;eve=
ntual consistent&quot;. :)<br>


<br>[1] <a href=3D"http://www.larsgeorge.com/2009/10/hbase-architecture-101=
-storage.html" target=3D"_blank">http://www.larsgeorge.com/2009/10/hbase-ar=
chitecture-101-storage.html</a><br>[2] <a href=3D"http://www.larsgeorge.com=
/2010/01/hbase-architecture-101-write-ahead-log.html" target=3D"_blank">htt=
p://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html<=
/a><br>


<br>E. Ribeiro<br><br>
</blockquote></div><br>

--000325576e325c27a70495992b74--