Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 33858 invoked from network); 12 Feb 2011 23:03:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Feb 2011 23:03:00 -0000 Received: (qmail 2076 invoked by uid 500); 12 Feb 2011 23:02:58 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 2013 invoked by uid 500); 12 Feb 2011 23:02:57 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 2005 invoked by uid 99); 12 Feb 2011 23:02:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Feb 2011 23:02:57 +0000 X-ASF-Spam-Status: No, hits=4.7 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dan.hendry.junk@gmail.com designates 209.85.220.172 as permitted sender) Received: from [209.85.220.172] (HELO mail-vx0-f172.google.com) (209.85.220.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Feb 2011 23:02:51 +0000 Received: by vxi40 with SMTP id 40so2145710vxi.31 for ; Sat, 12 Feb 2011 15:02:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:from:to:references:in-reply-to:subject:date :message-id:mime-version:content-type:x-mailer:thread-index :content-language; bh=/X4/xh+FMXBbNs48M/HWQTNt08D3hmy+NFm499ffufc=; b=KcTupuAKojqbkSpseZCPzbNOJ0Kuf3M+9ztuAmJ0TDpnxhcZOBXm2gYvOhHGjnFhvI 4+AxwHM5XweoKuagdTfGrD1DYAruNjRklh7iLM2gAQwxJMyzAo/WHpKqqXszcgKvwoXq 40Vj2uGLlCYB2uB2k1fQhdV1j9bx3aHJ/zQXw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:references:in-reply-to:subject:date:message-id:mime-version :content-type:x-mailer:thread-index:content-language; b=uJ3Qt9PGEOHeccfjyHZJ9LdAk1BHlg3KCIciIiFRtscm5pQEp9KPY51/Nmq+1n54xD q+F2PIiNwV/VYfwVzp6HIj7fmhIfgjnf7BED1Lc0gg/jo9y8NmCyaCgV7PkwkAUN3FKe jzND/d+gfMiLGPADQijtvTf8RINqk4SicHgMA= Received: by 10.220.179.204 with SMTP id br12mr1481516vcb.218.1297551750430; Sat, 12 Feb 2011 15:02:30 -0800 (PST) Received: from DHTABLET (out-pq-251.wireless.telus.com [216.218.29.251]) by mx.google.com with ESMTPS id b6sm228898vci.0.2011.02.12.15.02.25 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 12 Feb 2011 15:02:29 -0800 (PST) From: "Dan Hendry" To: References: <4d570708.4407dc0a.2c44.1110@mx.google.com> In-Reply-To: Subject: RE: per-connection "read-after-my-write" consistency Date: Sat, 12 Feb 2011 18:02:09 -0500 Message-ID: <4d571185.863fdc0a.1d57.121d@mx.google.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_003F_01CBCADE.FB76D2D0" X-Mailer: Microsoft Office Outlook 12.0 thread-index: AcvLBW+7rfTC2bt4QIClIlC1W+sRcwAAzdXg Content-Language: en-ca This is a multi-part message in MIME format. ------=_NextPart_000_003F_01CBCADE.FB76D2D0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable > So the suggestion is to use at least 3 nodes with RF=3D3 and CL.QUORUM = for both write and reads where high consistency is required, right? =20 Yes, this is the typical way to use Cassandra when both consistency and availability are required. =20 Dan =20 From: Michal August=FDn [mailto:augustyn.michal@gmail.com]=20 Sent: February-12-11 17:37 To: user@cassandra.apache.org Subject: Re: per-connection "read-after-my-write" consistency =20 Hi, =20 I'm using .NET and I wrote my own client library (over Thrift) so I'm absolutely sure that both operations are performed using the same connection. I can handle the current issue in application but I'm sure that I will = not be able to handle some future situation in application. =20 So the suggestion is to use at least 3 nodes with RF=3D3 and CL.QUORUM = for both write and reads where high consistency is required, right? =20 Thanks! 2011/2/12 Dan Hendry Are you using a higher level client (hector/pelops/pycassa/etc) or the actual thrift API? Higher level clients often pool connections and two subsequent operations (read then write) may be performed with = connections to different nodes. =20 If you are sure you are using the same connection (the actual thrift = api), there is a possible race condition. To the best of my understanding, = here is how a write happens at cl ONE in your case :=20 - You make a request to node A which initiates a write to node = A and B - The server reports successful when the write to node A OR B = is complete (can somebody else confirm?) =20 Typically the write to A will complete quicker since that is the node = you are connected to and there is additional network overhead initiating the write on node B. I suppose a 1:1000 chance of B completing first is possible, particularly if all nodes and the client are on the same = network (or same machine) with very low latencies.=20 =20 Cassandra allows you to explicitly specify the trade-off between = consistency and availability. When you read and write at ONE with RF=3D2, = consistency is not guaranteed but high availability is (you can lose a node and = continue to operate). If you require strong consistency you will either have to read = or write at consistency level ALL. My suggestion is to either design your application to tolerate inconsistency (if possible) or move to RF=3D3 = and quorum read and quorum writes. =20 Dan =20 From: Michal August=FDn [mailto:augustyn.michal@gmail.com]=20 Sent: February-12-11 4:13 To: user@cassandra.apache.org Subject: per-connection "read-after-my-write" consistency =20 Hi, =20 I'm running 2 nodes with RF=3D2 (not optimal, I know), Cassandra 0.7.1. =20 During one connection, I write (CL.ONE) a row and subsequently read = (CL.ONE) the same row (via Thrift). I supposed that if I write row to one node then I can immediately read = this row from this node. It seems to be true for most cases, but circa 1 of 1000 attempts doesn't work as expected - I get no row :( =20 Where is the problem please? Should I use another CL for read and/or = write? I would like just to achieve "per connection read-after-my-write consistency". =20 Thank you very much! =20 Augi No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3439 - Release Date: 02/12/11 02:34:00 =20 No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3439 - Release Date: 02/12/11 02:34:00 ------=_NextPart_000_003F_01CBCADE.FB76D2D0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

> So the suggestion is to use at least 3 = nodes with RF=3D3 and CL.QUORUM for both write and reads where high = consistency is required, right?

 

Yes, this is the typical way to use Cassandra when both consistency = and availability are required.

 

Dan

 

From:= Michal = August=FDn [mailto:augustyn.michal@gmail.com]
Sent: = February-12-11 17:37
To: = user@cassandra.apache.org
Subject: Re: per-connection = "read-after-my-write" = consistency

 

Hi,

 

I'm using .NET and I wrote my own client library (over = Thrift) so I'm absolutely sure that both operations are performed using = the same connection.

I can = handle the current issue in application but I'm sure that I will not be = able to handle some future situation in = application.

 

So the suggestion is to use at least 3 nodes with = RF=3D3 and CL.QUORUM for both write and reads where high consistency is = required, right?

 

Thanks!

2011/2/12 Dan Hendry <dan.hendry.junk@gmail.com&g= t;

Are you using a higher level = client (hector/pelops/pycassa/etc) or the actual thrift API? Higher = level clients often pool connections and two subsequent operations (read = then write) may be performed with connections to different = nodes.

 

If you are sure you are using = the same connection (the actual thrift api), there is a possible race = condition. To the best of my understanding, here is how a write happens = at cl ONE in your case :

-     &nb= sp;    You make a request to node A = which initiates a write to node A and B

-     &nb= sp;    The server reports successful = when the write to node A OR B is complete (can somebody else = confirm?)

 

Typically the write to A will = complete quicker since that is the node you are connected to and there = is additional network overhead initiating the write on node B. I suppose = a 1:1000 chance of B completing first is possible, particularly if all = nodes and the client are on the same network (or same machine) with very = low latencies.

 

Cassandra allows you to = explicitly specify the trade-off between consistency and availability. = When you read and write at ONE with RF=3D2, consistency is not = guaranteed but high availability is (you can lose a node and continue to = operate). If you require strong consistency you will either have to read = or write at consistency level ALL. My suggestion is to either design = your application to tolerate inconsistency (if possible) or move to = RF=3D3 and quorum read and quorum writes.

 

Dan

 

From: Michal August=FDn [mailto:augustyn.michal@gmail.com]
Sent: = February-12-11 4:13
To: user@cassandra.apache.org
Subject: = per-connection "read-after-my-write" = consistency

 <= /o:p>

Hi,

 <= /o:p>

I'm running = 2 nodes with RF=3D2 (not optimal, I know), Cassandra = 0.7.1.

 <= /o:p>

During one = connection, I write (CL.ONE) a row and subsequently read (CL.ONE) the = same row (via Thrift).

I supposed = that if I write row to one node then I can immediately read this row = from this node.

It seems to = be true for most cases, but circa 1 of 1000 attempts doesn't work as = expected - I get no row :(

 <= /o:p>

Where is = the problem please? Should I use another CL for read and/or write? I = would like just to achieve "per connection read-after-my-write = consistency".

 <= /o:p>

Thank you = very much!

 <= /o:p>

Augi

No virus = found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: = 271.1.1/3439 - Release Date: 02/12/11 = 02:34:00

 

No virus = found in this incoming message.
Checked by AVG - = www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3439 - Release = Date: 02/12/11 02:34:00

------=_NextPart_000_003F_01CBCADE.FB76D2D0--