Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 9839 invoked from network); 18 Feb 2011 17:28:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Feb 2011 17:28:53 -0000 Received: (qmail 75070 invoked by uid 500); 18 Feb 2011 17:28:50 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 74876 invoked by uid 500); 18 Feb 2011 17:28:47 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 74868 invoked by uid 99); 18 Feb 2011 17:28:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 17:28:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chirayithaj@gmail.com designates 209.85.161.44 as permitted sender) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 17:28:38 +0000 Received: by fxm15 with SMTP id 15so351898fxm.31 for ; Fri, 18 Feb 2011 09:28:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=KZ8P45oaxBadqSIw+whGdczCqmHEG/SzACrwXRu4SQg=; b=Nfg9m6+QGDVBlczS4W/zs3deoDnbcdHQZ6720fAPUiEz0/rUgoCcFbD/7xR70g9Gc4 hQtwetitqYmk10dgWWqsnN7mekffWE5SMMV7SY6QwxAPWRxo5U5hbwsKTmIwQHpOl9PG O0IK0k20ar0eCDDl1u4vqNTFLuuB35PvnGiGw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=XlDjbd1Z9KDx8cKzjrkiTYWqIMkOy1Uj4Nj/jmxsqUmOQE5tX1KcqX4/+FfNfifr9N kU1THvLiqnmEoTnuKin/NH/mJVXebp0xHfbT8h0Ryngqs/y1hOVLibqmebfAmuVO4uwD XEfvjZq73QrMBkez8yd6j1WoYKsJo7+Q0lxl8= MIME-Version: 1.0 Received: by 10.223.96.6 with SMTP id f6mr1324430fan.22.1298050096532; Fri, 18 Feb 2011 09:28:16 -0800 (PST) Received: by 10.223.86.144 with HTTP; Fri, 18 Feb 2011 09:28:16 -0800 (PST) In-Reply-To: References: Date: Fri, 18 Feb 2011 11:28:16 -0600 Message-ID: Subject: Re: R and N From: Anthony John To: user@cassandra.apache.org Cc: A J Content-Type: multipart/alternative; boundary=20cf30433e3ee798f9049c91d6b1 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30433e3ee798f9049c91d6b1 Content-Type: text/plain; charset=ISO-8859-1 K - let me state the facts first (As I see know them) - I do not know the inner workings, so interpret my response with that caveat. Although, at an architectural level, one should be able to keep detailed implementation at bay - Quorum is (N+!)/2 where N is the Replication Factor (RF) - And consistency is a guarantee if R(ead) + W(rite) > RF (Which Quorum gives you, but can be achieved via other permutations, depending on whether Read or Write performance is desired) No getting to your questions:- 1. If Read at Q is nondeterministic, it would likely have to read the other (RF-Q) nodes to achieve Quorum on a deterministic value. At which point - sync'ing all with writes should not be that expensive. But at what point precisely the read is returned - do not know - you will have to look at the code. IMO - at this level it should not matter. 2. Should be at the granularity of data divergence 3. Read Repair or Nodetool (which ever comes first) 4. All peer - there is no primary. There might be a connected node - but no special role/privileges 5. Tries to Q - returns on deterministic read. If not - see (1) 6. Writer supplies timestamp value - can be any value that makes sense within the scope of data/application. HTH, -JA On Fri, Feb 18, 2011 at 10:28 AM, A J wrote: > Couple of more related questions: > > 5. For reads, does Cassandra first read N nodes or just the R nodes it > selects ? I am thinking unless it reads all the N nodes, how will it > know which node has the latest write. > > 6. Who decides the timestamp that gets inserted into the timestamp > field of every column. I would guess the coordinator node picks up its > system's timestamp. If that is true, the clocks on all the nodes > should be synchronized, right ? Otherwise conflict resolution cannot > be done correctly. > For a distributed system, this is not always possible. How do folks > get around this issue ? > > Thanks. > > > > On Fri, Feb 18, 2011 at 10:23 AM, A J wrote: > > Questions about R and N (and W): > > 1. If I set R to Quorum and cassandra identifies a need for read > > repair before returning, would the read repair happen on R nodes (I > > mean subset of R that needs repair) or N nodes before the data is > > delivered to the client ? > > 2. Also does the repair happen at level of row (key) or at level of > column ? > > > > 3. During write, if W is met but N-W is not met for some reason; would > > cassandra try to repair N-W nodes in the background as and when it > > can. Or the N-W are only repaired when a read is issued ? > > > > 4. What is the significance of the 'primary' replica for writes from > > usage point ? Writes to primary and non-primary replicas all happen > > simultaneously. Ensuring W is decided irrespective of it being primary > > or not. Ensuring R is decided by any of the R nodes out of N. > > I know the tokens are divided per the primary replica. But other than > > that, for read and write operations, do the primary replica play any > > special role ? > > > > Thanks. > > > --20cf30433e3ee798f9049c91d6b1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable K - let me state the facts first (As I see know them)
- I do not know t= he inner workings, so interpret my response with that caveat. Although, at = an architectural level, one should be able to keep detailed implementation = at bay
- Quorum is (N+!)/2 where N is the Replication Factor (RF)
-= And consistency is a guarantee if R(ead) + W(rite) > RF (Which Quorum g= ives you, but can be achieved via other permutations, depending on whether = Read or Write performance is desired)

No getting to your questions:-=A0
1. If Read at= Q is nondeterministic, it would likely have to read the other (RF-Q) nodes= to achieve Quorum on a deterministic value. At which point - sync'ing = all with writes should not be that expensive. But at what point precisely t= he read is returned - do not know - you will have to look at the code. IMO = - at this level it should not matter.
2. Should be at the granularity of data divergence
3. Read R= epair or Nodetool (which ever comes first)
4. All peer - there is= no primary. There might be a connected node - but no special role/privileg= es
5. Tries to Q - returns on deterministic read. If not - see (1)
<= div>6. Writer supplies timestamp value - can be any value that makes sense = within the scope of data/application.

HTH,

-JA

On Fri, Fe= b 18, 2011 at 10:28 AM, A J <s5alye@gmail.com> wrote:
Couple of more related questions:

5. For reads, does Cassandra first read N nodes or just the R nodes it
selects ? I am thinking unless it reads all the N nodes, how will it
know which node has the latest write.

6. Who decides the timestamp that gets inserted into the timestamp
field of every column. I would guess the coordinator node picks up its
system's timestamp. =A0If that is true, the clocks on all the nodes
should be synchronized, right ? Otherwise conflict resolution cannot
be done correctly.
For a distributed system, this is not always possible. How do folks
get around this issue ?

Thanks.



On Fri, Feb 18, 2011 at 10:23 AM, A J <s5alye@gmail.com> wrote:
> Questions about R and N (and W):
> 1. If I set R to Quorum and cassandra identifies a need for read
> repair before returning, would the read repair happen on R nodes (I > mean subset of R that needs repair) or N nodes before the data is
> delivered to the client ?
> 2. Also does the repair happen at level of row (key) or at level of co= lumn ?
>
> 3. During write, if W is met but N-W is not met for some reason; would=
> cassandra try to repair N-W nodes in the background as and when it
> can. Or the N-W are only repaired when a read is issued ?
>
> 4. What is the significance of the 'primary' replica for write= s from
> usage point ? Writes to primary and non-primary replicas all happen > simultaneously. Ensuring W is decided irrespective of it being primary=
> or not. Ensuring R is decided by any of the R nodes out of N.
> I know the tokens are divided per the primary replica. But other than<= br> > that, for read and write operations, do the primary replica play any > special role ?
>
> Thanks.
>

--20cf30433e3ee798f9049c91d6b1--