Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7866F86E0 for ; Thu, 1 Sep 2011 21:06:07 +0000 (UTC) Received: (qmail 25962 invoked by uid 500); 1 Sep 2011 21:06:05 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 25892 invoked by uid 500); 1 Sep 2011 21:06:04 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 25884 invoked by uid 99); 1 Sep 2011 21:06:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 21:06:04 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of teddyyyy123@gmail.com designates 74.125.83.44 as permitted sender) Received: from [74.125.83.44] (HELO mail-gw0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 21:05:58 +0000 Received: by gwb20 with SMTP id 20so1501321gwb.31 for ; Thu, 01 Sep 2011 14:05:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=gHXbjaPtm8t5hYlzwUME33CNIYpwBnTaOfynkS+yZtc=; b=II/s0nupfanyhnpfAgg+MJaSxeOEsmDuqpjQKcGJ3w22zeTa2t1Ew42W2y6JHoJm9J hSPURglqxFBJNYUSE/9b14+S7TT6E2tU0Y6JxI6RXhyJML2c5Jk65Oo7jH+Fgtws1QBe ByljEp9PnjALm2uoBC7BoOFvVHTi010dP+RWI= MIME-Version: 1.0 Received: by 10.150.118.28 with SMTP id q28mr464277ybc.354.1314911136783; Thu, 01 Sep 2011 14:05:36 -0700 (PDT) Received: by 10.151.142.3 with HTTP; Thu, 1 Sep 2011 14:05:36 -0700 (PDT) Received: by 10.151.142.3 with HTTP; Thu, 1 Sep 2011 14:05:36 -0700 (PDT) In-Reply-To: References: <857c8958a0c24f9fa73cd22727cd7be3@HUB021-CA-2.exch021.domain.local> Date: Thu, 1 Sep 2011 14:05:36 -0700 Message-ID: Subject: Re: Replicate On Write behavior From: Yang To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd72a36381c4604abe79b5d X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd72a36381c4604abe79b5d Content-Type: text/plain; charset=ISO-8859-1 sorry i mean cf * row if you look in the code, db.cf is just basically a set of columns On Sep 1, 2011 1:36 PM, "Ian Danforth" wrote: > I'm not sure I understand the scalability of this approach. A given > column family can be HUGE with millions of rows and columns. In my > cluster I have a single column family that accounts for 90GB of load > on each node. Not only that but column family is distributed over the > entire ring. > > Clearly I'm misunderstanding something. > > Ian > > On Thu, Sep 1, 2011 at 1:17 PM, Yang wrote: >> when Cassandra reads, the entire CF is always read together, only at the >> hand-over to client does the pruning happens >> >> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne >> wrote: >>> >>> I'm curious... digging through the source, it looks like replicate on >>> write triggers a read of the entire row, and not just the >>> columns/supercolumns that are affected by the counter update. Is this the >>> case? It would certainly explain why my inserts/sec decay over time and why >>> the average insert latency increases over time. The strange thing is that >>> I'm not seeing disk read IO increase over that same period, but that might >>> be due to the OS buffer cache... >>> >>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >>> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >>> normal? I'm using RandomPartitioner... >>> >>> Address DC Rack Status State Load >>> Owns Token >>> >>> 136112946768375385385349842972707284580 >>> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% >>> 0 >>> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >>> 34028236692093846346337460743176821145 >>> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >>> 68056473384187692692674921486353642290 >>> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >>> 102084710076281539039012382229530463435 >>> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >>> 136112946768375385385349842972707284580 >>> >>> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >>> and last node both have a count of 0. This is a clean cluster, and I've >>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >>> hours. The last time this test ran, it went all the way down to 500 >>> inserts/sec before I killed it. >> --000e0cd72a36381c4604abe79b5d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

sorry i mean=A0 cf * row

if you look in the code, db.cf=A0 is just b= asically a set of columns

On Sep 1, 2011 1:36 PM, "Ian Danforth"= <idanforth@numenta.com>= wrote:
> I'm not sure I understand the scal= ability of this approach. A given
> column family can be HUGE with millions of rows and columns. In my
= > cluster I have a single column family that accounts for 90GB of load> on each node. Not only that but column family is distributed over th= e
> entire ring.
>
> Clearly I'm misunderstanding somethi= ng.
>
> Ian
>
> On Thu, Sep 1, 2011 at 1:17 PM, Y= ang <teddyyyy123@gmail.com&= gt; wrote:
>> when Cassandra reads, the entire CF is always read together, only = at the
>> hand-over to client does the pruning happens
>>=
>> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne <dhawth@gmx.3crowd.com>
>> wrote:
>>>
>>> I'm curious... digging = through the source, it looks like replicate on
>>> write trigge= rs a read of the entire row, and not just the
>>> columns/super= columns that are affected by the counter update. =A0Is this the
>>> case? =A0It would certainly explain why my inserts/sec decay o= ver time and why
>>> the average insert latency increases over = time. =A0The strange thing is that
>>> I'm not seeing disk = read IO increase over that same period, but that might
>>> be due to the OS buffer cache...
>>>
>>&g= t; On another note, on a 5-node cluster, I'm only seeing 3 nodes with>>> ReplicateOnWrite Completed tasks in nodetool tpstats output.= =A0Is that
>>> normal? =A0I'm using RandomPartitioner...
>>><= br>>>> Address =A0 =A0 =A0 =A0 DC =A0 =A0 =A0 =A0 =A0Rack =A0 =A0 = =A0 =A0Status State =A0 Load
>>> =A0Owns =A0 =A0Token
>&g= t;>
>>> =A0136112946768375385385349842972707284580
>>> 10.0.0.57 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Norm= al =A02.26 GB =A0 =A0 =A0 =A0 20.00%
>>> =A00
>>> 1= 0.0.0.56 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Normal =A02.47 GB = =A0 =A0 =A0 =A0 20.00%
>>> =A0340282366920938463463374607431768= 21145
>>> 10.0.0.55 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Norm= al =A02.52 GB =A0 =A0 =A0 =A0 20.00%
>>> =A06805647338418769269= 2674921486353642290
>>> 10.0.0.54 =A0 =A0datacenter1 rack1 =A0 = =A0 =A0 Up =A0 =A0 Normal =A0950.97 MB =A0 =A0 =A0 20.00%
>>> =A0102084710076281539039012382229530463435
>>> 10.= 0.0.72 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Normal =A0383.25 MB = =A0 =A0 =A0 20.00%
>>> =A01361129467683753853853498429727072845= 80
>>>
>>> The nodes with ReplicateOnWrites are the 3 in the middle. =A0T= he first node
>>> and last node both have a count of 0. =A0This= is a clean cluster, and I've
>>> been doing 3k ... 2.5k (d= ecaying performance) inserts/sec for the last 12
>>> hours. =A0The last time this test ran, it went all the way dow= n to 500
>>> inserts/sec before I killed it.
>>
--000e0cd72a36381c4604abe79b5d--