Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Date: Thu, 1 Sep 2011 15:56:50 -0500 (CDT)
From: "Konstantin  Naryshkin" <konstantinn@a-bb.net>
To: user@cassandra.apache.org
Subject: Re: Replicate On Write behavior
Message-ID: <1921d25c-e6bc-49a3-8794-dc1fdeec5cd9@klap>
In-Reply-To: 
 <CACE81Mjc=-Wa17u4ZuubJYvSuEgumSwbDaQyirPhTpi4sFn6YA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Yeah, I believe that Yan has a type in his post. A CF is no read in one go,=
 a row is. As for the scalability of having all the columns being read at o=
nce, I do not believe that it was ever meant to be. All the columns in a ro=
w are stored together, on the same set of machines. This means that if you =
have very large rows, you can have an unbalanced cluster, but it also allow=
s reads of several columns out of a row to be more efficient since they are=
 all together on the same machine (no need to gather results from several m=
achines) and should read quickly since they are all together on disk.

----- Original Message -----
From: "Ian Danforth" <idanforth@numenta.com>
To: user@cassandra.apache.org
Sent: Thursday, September 1, 2011 4:35:33 PM
Subject: Re: Replicate On Write behavior

I'm not sure I understand the scalability of this approach. A given
column family can be HUGE with millions of rows and columns. In my
cluster I have a single column family that accounts for 90GB of load
on each node. Not only that but column family is distributed over the
entire ring.

Clearly I'm misunderstanding something.

Ian

On Thu, Sep 1, 2011 at 1:17 PM, Yang <teddyyyy123@gmail.com> wrote:
> when Cassandra reads, the entire CF is always read together, only at the
> hand-over to client does the pruning happens
>
> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne <dhawth@gmx.3crowd.com>
> wrote:
>>
>> I'm curious... digging through the source, it looks like replicate on
>> write triggers a read of the entire row, and not just the
>> columns/supercolumns that are affected by the counter update. =C2=A0Is t=
his the
>> case? =C2=A0It would certainly explain why my inserts/sec decay over tim=
e and why
>> the average insert latency increases over time. =C2=A0The strange thing =
is that
>> I'm not seeing disk read IO increase over that same period, but that mig=
ht
>> be due to the OS buffer cache...
>>
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with
>> ReplicateOnWrite Completed tasks in nodetool tpstats output. =C2=A0Is th=
at
>> normal? =C2=A0I'm using RandomPartitioner...
>>
>> Address =C2=A0 =C2=A0 =C2=A0 =C2=A0 DC =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Rack =C2=A0 =C2=A0 =C2=A0 =C2=A0Status State =C2=A0 Load
>> =C2=A0Owns =C2=A0 =C2=A0Token
>>
>> =C2=A0136112946768375385385349842972707284580
>> 10.0.0.57 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 =
=C2=A0 Normal =C2=A02.26 GB =C2=A0 =C2=A0 =C2=A0 =C2=A0 20.00%
>> =C2=A00
>> 10.0.0.56 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 =
=C2=A0 Normal =C2=A02.47 GB =C2=A0 =C2=A0 =C2=A0 =C2=A0 20.00%
>> =C2=A034028236692093846346337460743176821145
>> 10.0.0.55 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 =
=C2=A0 Normal =C2=A02.52 GB =C2=A0 =C2=A0 =C2=A0 =C2=A0 20.00%
>> =C2=A068056473384187692692674921486353642290
>> 10.0.0.54 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 =
=C2=A0 Normal =C2=A0950.97 MB =C2=A0 =C2=A0 =C2=A0 20.00%
>> =C2=A0102084710076281539039012382229530463435
>> 10.0.0.72 =C2=A0 =C2=A0datacenter1 rack1 =C2=A0 =C2=A0 =C2=A0 Up =C2=A0 =
=C2=A0 Normal =C2=A0383.25 MB =C2=A0 =C2=A0 =C2=A0 20.00%
>> =C2=A0136112946768375385385349842972707284580
>>
>> The nodes with ReplicateOnWrites are the 3 in the middle. =C2=A0The firs=
t node
>> and last node both have a count of 0. =C2=A0This is a clean cluster, and=
 I've
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 1=
2
>> hours. =C2=A0The last time this test ran, it went all the way down to 50=
0
>> inserts/sec before I killed it.
>