Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 6743 invoked from network); 22 Nov 2010 00:16:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Nov 2010 00:16:33 -0000 Received: (qmail 68922 invoked by uid 500); 22 Nov 2010 00:17:03 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 68897 invoked by uid 500); 22 Nov 2010 00:17:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 68889 invoked by uid 99); 22 Nov 2010 00:17:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 00:17:03 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tlipcon@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 00:16:58 +0000 Received: by iwn40 with SMTP id 40so7810548iwn.31 for ; Sun, 21 Nov 2010 16:16:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:sender:received :in-reply-to:references:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=ICHt/ROS0S7vp/uAmcDawqZ9ueFG7HH2RAdzl1A/A/M=; b=Ec0Ol+yKebeJdhfgfsu58yLBJrEn0P/Nb034yYZLK7HFIRcs+UIBUJAuDN/S/yHrtb a0q/h5b1v2kPawmKCS9810vkVdgFAj+8uci5fEUwvvWQPHctGOFEgkoyMibG3dLa6iI0 bhUDHi8OleudEXB2D4hmdna2mOTIUop1WmPVo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; b=cFjaGZmzxPOoX13ND0FqH3EwyDdSE0M7AD5l/MUUBtTrCtO8TUROnZw0RO3F/Q4V65 CV3mGVTTZg+ijuNUqmij7pcUw9zLoKDvKm8LOVq4SeWwCfFyGEhcU1qiy6Iso6iw14k2 UlVMhmMAfdcPvw1skYOm3YQTPlgDwQu+r+XOo= Received: by 10.231.35.11 with SMTP id n11mr5945194ibd.168.1290384996835; Sun, 21 Nov 2010 16:16:36 -0800 (PST) MIME-Version: 1.0 Sender: tlipcon@gmail.com Received: by 10.231.79.132 with HTTP; Sun, 21 Nov 2010 16:16:16 -0800 (PST) In-Reply-To: References: From: Todd Lipcon Date: Sun, 21 Nov 2010 16:16:16 -0800 X-Google-Sender-Auth: Mk64hHxxmc8woa99I-F22o700Ho Message-ID: Subject: Re: Facebook messaging and choice of HBase over Cassandra - what can we learn? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000325576e325c27a70495992b74 --000325576e325c27a70495992b74 Content-Type: text/plain; charset=ISO-8859-1 On Sun, Nov 21, 2010 at 2:06 PM, Edward Ribeiro wrote: > > Also I believe saying HBASE is consistent is not true. This can happen: >> Write to region server. -> Region Server acknowledges client-> write >> to WAL -> region server fails = write lost >> >> I wonder how facebook will reconcile that. :) >> > > Are you sure about that? Client writes to WAL before ack user? > > According to these posts[1][2], "if writing the record to the WAL fails the > whole operation must be considered a failure.", so it would be nonsense > acknowledge clients before writing the lifeline. I hope any cloudera guy > explain this... > > [only jumping in because info was requested - those who know me know that I think Cassandra is a very interesting architecture and a better fit for many applications than HBase] You can operate the commit log in two different modes in HBase. One mode is "deferred log flush", where the region server appends but does not sync() the commit log to HDFS on every write, but rather on a periodic basis (eg once a second). This is similar to the innodb_flush_log_at_trx_commit=2 option in MySQL for example. This has slightly better performance obviously since the writer doesn't need to wait on the commit, but as you noted there's a window where a write may be acknowledged but then lost. This is an issue of *durability* moreso than consistency. In the other mode of operation (default in recent versions of HBase) we do not acknowledge a write until it has been pushed to the OS buffer on the entire pipeline of log replicas. Obviously this is slower, but it results in "no lost data" regardless of any machine failures. Additionally, concurrent readers do not see written data until these same properties have been satisfied. So this mode is 100% consistent and 100% durable. In practice, this effects latency significantly since it adds two extra round trips to each write, but system throughput is only reduced by 20-30% since the commits are pipelined (see HDFS-895 for gory details) I believe Cassandra has similar tuning options about whether to sync every commit to the log or only do so periodically. If you're interested in learning more, feel free to reference this documentation: http://hbase.apache.org/docs/r0.89.20100726/acid-semantics.html > Besides that, you know that WAL is written to HDFS that takes care of > replication and fault tolerance, right? Of course, even so, there's a > "window of inconsistency" before the HLog is flushed to disk, but I don't > think you can dismiss this as not consistent. At most, you may classify it > as "eventual consistent". :) > > [1] http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html > [2] > http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html > > E. Ribeiro > > --000325576e325c27a70495992b74 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Sun, Nov 21, 2010 at 2:06 PM, Edward Ribeiro <edward.ribeiro@gmail.com> wrote:

Also I believe saying HBASE is consistent is not true. This can happen:
Write to region server. -> Region Server acknowledges client-> write<= br> to WAL -> region server fails =3D write lost

I wonder how facebook will reconcile that. :)

Are you sure about that? Client writes to WAL before ack user?

Ac= cording to these posts[1][2], "if writing the record to the WAL fails = the whole operation must be considered a failure.", so it would be non= sense acknowledge clients before writing the lifeline. I hope any cloudera = guy explain this...


[only jumping in because info was requ= ested - those who know me know that I think Cassandra is a very interesting= architecture and a better fit for many applications than HBase]

You can operate the commit log in two different modes i= n HBase. One mode is "deferred log flush", where the region serve= r appends but does not sync() the commit log to HDFS on every write, but ra= ther on a periodic basis (eg once a second). This is similar to the innodb_= flush_log_at_trx_commit=3D2 option in MySQL for example. This has slightly = better performance obviously since the writer doesn't need to wait on t= he commit, but as you noted there's a window where a write may be ackno= wledged but then lost. This is an issue of *durability* moreso than consist= ency.

In the other mode of operation (default in recent versi= ons of HBase) we do not acknowledge a write until it has been pushed to the= OS buffer on the entire pipeline of log replicas. Obviously this is slower= , but it results in "no lost data" regardless of any machine fail= ures. Additionally, concurrent readers do not see written data until these = same properties have been satisfied. So this mode is 100% consistent and 10= 0% durable. In practice, this effects latency significantly since it adds t= wo extra round trips to each write, but system throughput is only reduced b= y 20-30% since the commits are pipelined (see HDFS-895 for gory details)

I believe Cassandra has similar tuning options about wh= ether to sync every commit to the log or only do so periodically.

If you're interested in learning more, feel free to ref= erence this documentation:
http://hbase.apache.org/docs/r0.89.20100726/acid-semantics.html

=A0
Besides that, you k= now that WAL is written to HDFS that takes care of replication and fault to= lerance, right? Of course, even so, there's a "window of inconsist= ency" before the HLog is flushed to disk, but I don't think you ca= n dismiss this as not consistent. At most, you may classify it as "eve= ntual consistent". :)

[1] http://www.larsgeorge.com/2009/10/hbase-ar= chitecture-101-storage.html
[2] htt= p://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html<= /a>

E. Ribeiro


--000325576e325c27a70495992b74--