Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 688E38EDE for ; Thu, 1 Sep 2011 20:57:24 +0000 (UTC) Received: (qmail 14266 invoked by uid 500); 1 Sep 2011 20:57:22 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 14034 invoked by uid 500); 1 Sep 2011 20:57:21 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 14026 invoked by uid 99); 1 Sep 2011 20:57:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 20:57:21 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.115.88.169] (HELO smtp-out-2.01.com) (216.115.88.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 20:57:14 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp-out-2.01.com (Postfix) with ESMTP id D8C76A08342 for ; Thu, 1 Sep 2011 15:56:50 -0500 (CDT) X-Virus-Scanned: amavisd-new at smtp-out-2.01.com Received: from smtp-out-2.01.com ([127.0.0.1]) by localhost (smtp-out-2.01.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QgzUncA1HzHx for ; Thu, 1 Sep 2011 15:56:50 -0500 (CDT) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp-out-2.01.com (Postfix) with ESMTP id B1FE6A081EA for ; Thu, 1 Sep 2011 15:56:50 -0500 (CDT) Received: from mail-1.01.com (mail.01.com [10.17.30.166]) by smtp-out-2.01.com (Postfix) with ESMTP id 9E5C8A0838C for ; Thu, 1 Sep 2011 15:56:50 -0500 (CDT) Date: Thu, 1 Sep 2011 15:56:50 -0500 (CDT) From: "Konstantin Naryshkin" To: user@cassandra.apache.org Subject: Re: Replicate On Write behavior Message-ID: <1921d25c-e6bc-49a3-8794-dc1fdeec5cd9@klap> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Originating-IP: [10.17.31.172] X-Mailer: Zimbra 7.1.1_GA_3223 (ZCS/7.1.1_GA_3223) X-Virus-Checked: Checked by ClamAV on apache.org Yeah, I believe that Yan has a type in his post. A CF is no read in one go,= a row is. As for the scalability of having all the columns being read at o= nce, I do not believe that it was ever meant to be. All the columns in a ro= w are stored together, on the same set of machines. This means that if you = have very large rows, you can have an unbalanced cluster, but it also allow= s reads of several columns out of a row to be more efficient since they are= all together on the same machine (no need to gather results from several m= achines) and should read quickly since they are all together on disk. ----- Original Message ----- From: "Ian Danforth" To: user@cassandra.apache.org Sent: Thursday, September 1, 2011 4:35:33 PM Subject: Re: Replicate On Write behavior I'm not sure I understand the scalability of this approach. A given column family can be HUGE with millions of rows and columns. In my cluster I have a single column family that accounts for 90GB of load on each node. Not only that but column family is distributed over the entire ring. Clearly I'm misunderstanding something. Ian On Thu, Sep 1, 2011 at 1:17 PM, Yang wrote: > when Cassandra reads, the entire CF is always read together, only at the > hand-over to client does the pruning happens > > On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne > wrote: >> >> I'm curious... digging through the source, it looks like replicate on >> write triggers a read of the entire row, and not just the >> columns/supercolumns that are affected by the counter update. =C2=A0Is t= his the >> case? =C2=A0It would certainly explain why my inserts/sec decay over tim= e and why >> the average insert latency increases over time. =C2=A0The strange thing = is that >> I'm not seeing disk read IO increase over that same period, but that mig= ht >> be due to the OS buffer cache... >> >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. =C2=A0Is th= at >> normal? =C2=A0I'm using RandomPartitioner... >> >> Address =C2=A0 =C2=A0 =C2=A0 =C2=A0 DC =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Rack =C2=A0 =C2=A0 =C2=A0 =C2=A0Status State =C2=A0 Load >> =C2=A0Owns =C2=A0 =C2=A0Token >> >> =C2=A0136112946768375385385349842972707284580 >> 10.0.0.57 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 = =C2=A0 Normal =C2=A02.26 GB =C2=A0 =C2=A0 =C2=A0 =C2=A0 20.00% >> =C2=A00 >> 10.0.0.56 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 = =C2=A0 Normal =C2=A02.47 GB =C2=A0 =C2=A0 =C2=A0 =C2=A0 20.00% >> =C2=A034028236692093846346337460743176821145 >> 10.0.0.55 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 = =C2=A0 Normal =C2=A02.52 GB =C2=A0 =C2=A0 =C2=A0 =C2=A0 20.00% >> =C2=A068056473384187692692674921486353642290 >> 10.0.0.54 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 = =C2=A0 Normal =C2=A0950.97 MB =C2=A0 =C2=A0 =C2=A0 20.00% >> =C2=A0102084710076281539039012382229530463435 >> 10.0.0.72 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 = =C2=A0 Normal =C2=A0383.25 MB =C2=A0 =C2=A0 =C2=A0 20.00% >> =C2=A0136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. =C2=A0The firs= t node >> and last node both have a count of 0. =C2=A0This is a clean cluster, and= I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 1= 2 >> hours. =C2=A0The last time this test ran, it went all the way down to 50= 0 >> inserts/sec before I killed it. >