Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of teddyyyy123@gmail.com
 designates 74.125.83.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACE81Mjc=-Wa17u4ZuubJYvSuEgumSwbDaQyirPhTpi4sFn6YA@mail.gmail.com>
References: <C3A36D2F-31B0-4EB5-BF60-A4F46660D335@gmx.3crowd.com>
	<857c8958a0c24f9fa73cd22727cd7be3@HUB021-CA-2.exch021.domain.local>
	<CACE81Mjc=-Wa17u4ZuubJYvSuEgumSwbDaQyirPhTpi4sFn6YA@mail.gmail.com>
Date: Thu, 1 Sep 2011 14:05:36 -0700
Message-ID: 
 <CAAnh3__Da25nBO5WBKp4wfD0GE81Rrzx35oYr0VDX-L=NP2r6w@mail.gmail.com>
Subject: Re: Replicate On Write behavior
From: Yang <teddyyyy123@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000e0cd72a36381c4604abe79b5d

--000e0cd72a36381c4604abe79b5d
Content-Type: text/plain; charset=ISO-8859-1

sorry i mean  cf * row

if you look in the code, db.cf  is just basically a set of columns
On Sep 1, 2011 1:36 PM, "Ian Danforth" <idanforth@numenta.com> wrote:
> I'm not sure I understand the scalability of this approach. A given
> column family can be HUGE with millions of rows and columns. In my
> cluster I have a single column family that accounts for 90GB of load
> on each node. Not only that but column family is distributed over the
> entire ring.
>
> Clearly I'm misunderstanding something.
>
> Ian
>
> On Thu, Sep 1, 2011 at 1:17 PM, Yang <teddyyyy123@gmail.com> wrote:
>> when Cassandra reads, the entire CF is always read together, only at the
>> hand-over to client does the pruning happens
>>
>> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne <dhawth@gmx.3crowd.com>
>> wrote:
>>>
>>> I'm curious... digging through the source, it looks like replicate on
>>> write triggers a read of the entire row, and not just the
>>> columns/supercolumns that are affected by the counter update.  Is this
the
>>> case?  It would certainly explain why my inserts/sec decay over time and
why
>>> the average insert latency increases over time.  The strange thing is
that
>>> I'm not seeing disk read IO increase over that same period, but that
might
>>> be due to the OS buffer cache...
>>>
>>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with
>>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
>>> normal?  I'm using RandomPartitioner...
>>>
>>> Address         DC          Rack        Status State   Load
>>>  Owns    Token
>>>
>>>  136112946768375385385349842972707284580
>>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB
20.00%
>>>  0
>>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB
20.00%
>>>  34028236692093846346337460743176821145
>>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB
20.00%
>>>  68056473384187692692674921486353642290
>>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB
20.00%
>>>  102084710076281539039012382229530463435
>>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB
20.00%
>>>  136112946768375385385349842972707284580
>>>
>>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first
node
>>> and last node both have a count of 0.  This is a clean cluster, and I've
>>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last
12
>>> hours.  The last time this test ran, it went all the way down to 500
>>> inserts/sec before I killed it.
>>

--000e0cd72a36381c4604abe79b5d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p>sorry i mean=A0 cf * row</p>
<p>if you look in the code, <a href=3D"http://db.cf">db.cf</a>=A0 is just b=
asically a set of columns</p>
<div class=3D"gmail_quote">On Sep 1, 2011 1:36 PM, &quot;Ian Danforth&quot;=
 &lt;<a href=3D"mailto:idanforth@numenta.com">idanforth@numenta.com</a>&gt;=
 wrote:<br type=3D"attribution">&gt; I&#39;m not sure I understand the scal=
ability of this approach. A given<br>
&gt; column family can be HUGE with millions of rows and columns. In my<br>=
&gt; cluster I have a single column family that accounts for 90GB of load<b=
r>&gt; on each node. Not only that but column family is distributed over th=
e<br>
&gt; entire ring.<br>&gt; <br>&gt; Clearly I&#39;m misunderstanding somethi=
ng.<br>&gt; <br>&gt; Ian<br>&gt; <br>&gt; On Thu, Sep 1, 2011 at 1:17 PM, Y=
ang &lt;<a href=3D"mailto:teddyyyy123@gmail.com">teddyyyy123@gmail.com</a>&=
gt; wrote:<br>
&gt;&gt; when Cassandra reads, the entire CF is always read together, only =
at the<br>&gt;&gt; hand-over to client does the pruning happens<br>&gt;&gt;=
<br>&gt;&gt; On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne &lt;<a href=
=3D"mailto:dhawth@gmx.3crowd.com">dhawth@gmx.3crowd.com</a>&gt;<br>
&gt;&gt; wrote:<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; I&#39;m curious... digging =
through the source, it looks like replicate on<br>&gt;&gt;&gt; write trigge=
rs a read of the entire row, and not just the<br>&gt;&gt;&gt; columns/super=
columns that are affected by the counter update. =A0Is this the<br>
&gt;&gt;&gt; case? =A0It would certainly explain why my inserts/sec decay o=
ver time and why<br>&gt;&gt;&gt; the average insert latency increases over =
time. =A0The strange thing is that<br>&gt;&gt;&gt; I&#39;m not seeing disk =
read IO increase over that same period, but that might<br>
&gt;&gt;&gt; be due to the OS buffer cache...<br>&gt;&gt;&gt;<br>&gt;&gt;&g=
t; On another note, on a 5-node cluster, I&#39;m only seeing 3 nodes with<b=
r>&gt;&gt;&gt; ReplicateOnWrite Completed tasks in nodetool tpstats output.=
 =A0Is that<br>
&gt;&gt;&gt; normal? =A0I&#39;m using RandomPartitioner...<br>&gt;&gt;&gt;<=
br>&gt;&gt;&gt; Address =A0 =A0 =A0 =A0 DC =A0 =A0 =A0 =A0 =A0Rack =A0 =A0 =
=A0 =A0Status State =A0 Load<br>&gt;&gt;&gt; =A0Owns =A0 =A0Token<br>&gt;&g=
t;&gt;<br>&gt;&gt;&gt; =A0136112946768375385385349842972707284580<br>
&gt;&gt;&gt; 10.0.0.57 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Norm=
al =A02.26 GB =A0 =A0 =A0 =A0 20.00%<br>&gt;&gt;&gt; =A00<br>&gt;&gt;&gt; 1=
0.0.0.56 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Normal =A02.47 GB =
=A0 =A0 =A0 =A0 20.00%<br>&gt;&gt;&gt; =A0340282366920938463463374607431768=
21145<br>
&gt;&gt;&gt; 10.0.0.55 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Norm=
al =A02.52 GB =A0 =A0 =A0 =A0 20.00%<br>&gt;&gt;&gt; =A06805647338418769269=
2674921486353642290<br>&gt;&gt;&gt; 10.0.0.54 =A0 =A0datacenter1 rack1 =A0 =
=A0 =A0 Up =A0 =A0 Normal =A0950.97 MB =A0 =A0 =A0 20.00%<br>
&gt;&gt;&gt; =A0102084710076281539039012382229530463435<br>&gt;&gt;&gt; 10.=
0.0.72 =A0 =A0datacenter1 rack1 =A0 =A0 =A0 Up =A0 =A0 Normal =A0383.25 MB =
=A0 =A0 =A0 20.00%<br>&gt;&gt;&gt; =A01361129467683753853853498429727072845=
80<br>&gt;&gt;&gt;<br>
&gt;&gt;&gt; The nodes with ReplicateOnWrites are the 3 in the middle. =A0T=
he first node<br>&gt;&gt;&gt; and last node both have a count of 0. =A0This=
 is a clean cluster, and I&#39;ve<br>&gt;&gt;&gt; been doing 3k ... 2.5k (d=
ecaying performance) inserts/sec for the last 12<br>
&gt;&gt;&gt; hours. =A0The last time this test ran, it went all the way dow=
n to 500<br>&gt;&gt;&gt; inserts/sec before I killed it.<br>&gt;&gt;<br></d=
iv>

--000e0cd72a36381c4604abe79b5d--