Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: James Cipar <jcipar@cmu.edu>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: multipart/alternative; boundary=Apple-Mail-2-472141330
Subject: Re: Consistency model
Date: Sun, 17 Apr 2011 10:09:32 -0400
In-Reply-To: <BANLkTinPvUu2e6KYNzb_r06scFuYQJpecQ@mail.gmail.com>
To: user@cassandra.apache.org
References: <8747B84D-8AAB-4DDB-A4B5-215EC7962B03@cmu.edu>
 <BANLkTi=V5n3TjRw_EkuXiTP_XYB4mu6DWw@mail.gmail.com>
 <64E44C90-0AAB-40E3-B972-FD94C3D6C02F@cmu.edu>
 <BANLkTi=-0qrUmSYc2+neL3AJSYyQt8a46w@mail.gmail.com>
 <BANLkTikCUh-jHOYxTKMqNKK0GoA=qsG1tQ@mail.gmail.com>
 <BANLkTikFksa5-PawGoBz95uK+5YHc9w60w@mail.gmail.com>
 <BANLkTinPvUu2e6KYNzb_r06scFuYQJpecQ@mail.gmail.com>
Message-Id: <86974EFE-F65B-47DB-86E7-0BDEB24E65FB@cmu.edu>


--Apple-Mail-2-472141330
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

I'm pretty new to Cassandra, but I've also written a client in C++ using =
the thrift API directly.  =46rom what I've seen, wrapping writes in a =
retry loop is pretty much mandatory because if you are pushing a lot of =
data around, you're basically guaranteed to have TimedOutExceptions.  I =
suppose what I'm getting at is: if you don't have consistency in the =
case of a TimedOutException, you don't have consistency for any =
high-throughput application.  Is there a solution to this that I am =
missing?


On Apr 17, 2011, at 9:42 AM, William Oberman wrote:

> At first I was concerned and was going to +1  on a fix, but I think I =
was confused on one detail and I'd like to confirm it.
> -An unsuccessful write implies readers can see either the old or new =
value
> ?
>=20
> The trick is using a library, it sounds like there is a period of time =
a write is unsuccessful but you don't know about it (as the retry is =
internal).  But, (assuming writes are idempotent) QUORUM is actually =
consistent from successful writes to successful reads... right?
>=20
> On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbellis@gmail.com> =
wrote:
> Tyler is correct, because Cassandra doesn't wait until repair writes
> are acked before the answer is returned. This is something we can fix.
>=20
> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges =
<sean.bridges@gmail.com> wrote:
> > Tyler, your answer seems to contradict this email by Jonathan Ellis
> > [1].  In it Jonathan says,
> >
> > "The important guarantee this gives you is that once one quorum read
> > sees the new value, all others will too.   You can't see the newest
> > version, then see an older version on a subsequent write [sic, I
> > assume he meant read], which is the characteristic of non-strong
> > consistency"
> >
> > Jonathan also says,
> >
> > "{X, Y} and {X, Z} are equivalent: one node with the write, and one
> > without. The read will recognize that X's version needs to be sent =
to
> > Z, and the write will be complete.  This read and all subsequent =
ones
> > will see the write.  (Z [sic, I assume he meant Y] will be =
replicated
> > to asynchronously via read repair.)"
> >
> > To me, the statement "this read and all subsequent ones will see the
> > write" implies that the new value must be committed to Y or Z before
> > the read can return.  If not, the statement must be false.
> >
> > Sean
> >
> >
> > [1] : =
http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3CAAN=
LkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.com%3E
> >
> > Sean
> >
> > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <tyler@datastax.com> =
wrote:
> >> Here's what's probably happening:
> >>
> >> I'm assuming RF=3D3 and QUORUM writes/reads here.  I'll call the =
replicas A,
> >> B, and C.
> >>
> >> 1.  Writer process writes sequence number 1 and everything works =
fine.  A,
> >> B, and C all have sequence number 1.
> >> 2.  Writer process writes sequence number 2.  Replica A writes =
successfully,
> >> B and C fail to respond in time, and a TimedOutException is =
returned.
> >> pycassa waits to retry the operation.
> >> 3.  Reader process reads, gets a response from A and B.  When the =
row from A
> >> and B is merged, sequence number 2 is the newest and is returned.  =
A read
> >> repair is pushed to B and C, but they don't yet update their data.
> >> 4.  Reader process reads again, gets a response from B and C =
(before they've
> >> repaired).  These both report sequence number 1, so that's returned =
to the
> >> client.  This is were you get a decreasing sequence number.
> >> 5.  pycassa eventually retries the write; B and C eventually repair =
their
> >> data.  Either way, both B and C shortly have sequence number 2.
> >>
> >> I've left out some of the details of read repair, and this scenario =
could
> >> happen in several slightly different ways, but it should give you =
an idea of
> >> what's happening.
> >>
> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jcipar@cmu.edu> =
wrote:
> >>>
> >>> Here it is.  There is some setup code and global variable =
definitions that
> >>> I left out of the previous code, but they are pretty similar to =
the setup
> >>> code here.
> >>>     import pycassa
> >>>     import random
> >>>     import time
> >>>     consistency_level =3D =
pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
> >>>     duration =3D 600
> >>>     sleeptime =3D 0.0
> >>>     hostlist =3D 'worker-hostlist'
> >>>     def read_servers(fn):
> >>>         f =3D open(fn)
> >>>         servers =3D []
> >>>         for line in f:
> >>>             servers.append(line.strip())
> >>>         f.close()
> >>>         return servers
> >>>     servers =3D read_servers(hostlist)
> >>>     start_time =3D time.time()
> >>>     seqnum =3D -1
> >>>     timestamp =3D 0
> >>>     while time.time() < start_time + duration:
> >>>         target_server =3D random.sample(servers, 1)[0]
> >>>         target_server =3D '%s:9160'%target_server
> >>>         try:
> >>>             pool =3D pycassa.connect('Keyspace1', [target_server])
> >>>             cf =3D pycassa.ColumnFamily(pool, 'Standard1')
> >>>             row =3D cf.get('foo', =
read_consistency_level=3Dconsistency_level)
> >>>             pool.dispose()
> >>>         except:
> >>>             time.sleep(sleeptime)
> >>>             continue
> >>>         sq =3D int(row['seqnum'])
> >>>         ts =3D float(row['timestamp'])
> >>>         if sq < seqnum:
> >>>             print 'Row changed: %i %f -> %i %f'%(seqnum, =
timestamp, sq,
> >>> ts)
> >>>         seqnum =3D sq
> >>>         timestamp =3D ts
> >>>         if sleeptime > 0.0:
> >>>             time.sleep(sleeptime)
> >>>
> >>>
> >>>
> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
> >>>
> >>> James,
> >>>
> >>> Would you mind sharing your reader process code as well?
> >>>
> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jcipar@cmu.edu> =
wrote:
> >>>>
> >>>> I've been experimenting with the consistency model of Cassandra, =
and I
> >>>> found something that seems a bit unexpected.  In my experiment, I =
have 2
> >>>> processes, a reader and a writer, each accessing a Cassandra =
cluster with a
> >>>> replication factor greater than 1.  In addition, sometimes I =
generate
> >>>> background traffic to simulate a busy cluster by uploading a =
large data file
> >>>> to another table.
> >>>>
> >>>> The writer executes a loop where it writes a single row that =
contains
> >>>> just an sequentially increasing sequence number and a timestamp.  =
In python
> >>>> this looks something like:
> >>>>
> >>>>    while time.time() < start_time + duration:
> >>>>        target_server =3D random.sample(servers, 1)[0]
> >>>>        target_server =3D '%s:9160'%target_server
> >>>>
> >>>>        row =3D {'seqnum':str(seqnum), =
'timestamp':str(time.time())}
> >>>>        seqnum +=3D 1
> >>>>        # print 'uploading to server %s, %s'%(target_server, row)
> >>>>
> >>>>        pool =3D pycassa.connect('Keyspace1', [target_server])
> >>>>        cf =3D pycassa.ColumnFamily(pool, 'Standard1')
> >>>>        cf.insert('foo', row, =
write_consistency_level=3Dconsistency_level)
> >>>>        pool.dispose()
> >>>>
> >>>>        if sleeptime > 0.0:
> >>>>            time.sleep(sleeptime)
> >>>>
> >>>>
> >>>> The reader simply executes a loop reading this row and reporting =
whenever
> >>>> a sequence number is *less* than the previous sequence number.  =
As expected,
> >>>> with consistency_level=3DConsistencyLevel.ONE there are many =
inconsistencies,
> >>>> especially with a high replication factor.
> >>>>
> >>>> What is unexpected is that I still detect inconsistencies when it =
is set
> >>>> at ConsistencyLevel.QUORUM.  This is unexpected because the =
documentation
> >>>> seems to imply that QUORUM will give consistent results.  With =
background
> >>>> traffic the average difference in timestamps was 0.6s, and the =
maximum was
> >>>> >3.5s.  This means that a client sees a version of the row, and =
can
> >>>> subsequently see another version of the row that is 3.5s older =
than the
> >>>> previous.
> >>>>
> >>>> What I imagine is happening is this, but I'd like someone who =
knows that
> >>>> they're talking about to tell me if it's actually the case:
> >>>>
> >>>> I think Cassandra is not using an atomic commit protocol to =
commit to the
> >>>> quorum of servers chosen when the write is made.  This means that =
at some
> >>>> point in the middle of the write, some subset of the quorum have =
seen the
> >>>> write, while others have not.  At this time, there is a quorum of =
servers
> >>>> that have not seen the update, so depending on which quorum the =
client reads
> >>>> from, it may or may not see the update.
> >>>>
> >>>> Of course, I understand that the client is not *choosing* a bad =
quorum to
> >>>> read from, it is just the first `q` servers to respond, but in =
this case it
> >>>> is effectively random and sometimes an bad quorum is "chosen".
> >>>>
> >>>> Does anyone have any other insight into what is going on here?
> >>>
> >>>
> >>> --
> >>> Tyler Hobbs
> >>> Software Engineer, DataStax
> >>> Maintainer of the pycassa Cassandra Python client library
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Tyler Hobbs
> >> Software Engineer, DataStax
> >> Maintainer of the pycassa Cassandra Python client library
> >>
> >>
> >
>=20
>=20
>=20
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>=20
>=20
>=20
> --=20
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com


--Apple-Mail-2-472141330
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I'm =
pretty new to Cassandra, but I've also written a client in C++ using the =
thrift API directly. &nbsp;=46rom what I've seen, wrapping writes in a =
retry loop is pretty much mandatory because if you are pushing a lot of =
data around, you're basically guaranteed to have TimedOutExceptions. =
&nbsp;I suppose what I'm getting at is: if you don't have consistency in =
the case of a TimedOutException, you don't have consistency for any =
high-throughput application. &nbsp;Is there a solution to this that I am =
missing?<div><div><br><div><br><div><div>On Apr 17, 2011, at 9:42 AM, =
William Oberman wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite">At first I =
was concerned and was going to +1 &nbsp;on a fix, but I think I was =
confused on one detail and I'd like to confirm it.<div>-An unsuccessful =
write implies readers can see either the old or new =
value</div><div>?</div>

<div><br></div><div>The trick is using a library, it sounds like there =
is a period of time a write is unsuccessful but you don't know about it =
(as the retry is internal). &nbsp;But, (assuming writes are idempotent) =
QUORUM is actually consistent from successful writes to successful =
reads... right?</div>

<div><div><div><br><div class=3D"gmail_quote">On Sun, Apr 17, 2011 at =
1:53 AM, Jonathan Ellis <span dir=3D"ltr">&lt;<a =
href=3D"mailto:jbellis@gmail.com">jbellis@gmail.com</a>&gt;</span> =
wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex;">

Tyler is correct, because Cassandra doesn't wait until repair writes<br>
are acked before the answer is returned. This is something we can =
fix.<br>
<div><div></div><div class=3D"h5"><br>
On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges &lt;<a =
href=3D"mailto:sean.bridges@gmail.com">sean.bridges@gmail.com</a>&gt; =
wrote:<br>
&gt; Tyler, your answer seems to contradict this email by Jonathan =
Ellis<br>
&gt; [1]. &nbsp;In it Jonathan says,<br>
&gt;<br>
&gt; "The important guarantee this gives you is that once one quorum =
read<br>
&gt; sees the new value, all others will too. &nbsp; You can't see the =
newest<br>
&gt; version, then see an older version on a subsequent write [sic, =
I<br>
&gt; assume he meant read], which is the characteristic of =
non-strong<br>
&gt; consistency"<br>
&gt;<br>
&gt; Jonathan also says,<br>
&gt;<br>
&gt; "{X, Y} and {X, Z} are equivalent: one node with the write, and =
one<br>
&gt; without. The read will recognize that X's version needs to be sent =
to<br>
&gt; Z, and the write will be complete. &nbsp;This read and all =
subsequent ones<br>
&gt; will see the write. &nbsp;(Z [sic, I assume he meant Y] will be =
replicated<br>
&gt; to asynchronously via read repair.)"<br>
&gt;<br>
&gt; To me, the statement "this read and all subsequent ones will see =
the<br>
&gt; write" implies that the new value must be committed to Y or Z =
before<br>
&gt; the read can return. &nbsp;If not, the statement must be false.<br>
&gt;<br>
&gt; Sean<br>
&gt;<br>
&gt;<br>
&gt; [1] : <a =
href=3D"http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbo=
x/%3CAANLkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.com%3E" =
target=3D"_blank">http://mail-archives.apache.org/mod_mbox/cassandra-user/=
201102.mbox/%3CAANLkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.co=
m%3E</a><br>


&gt;<br>
&gt; Sean<br>
&gt;<br>
&gt; On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs &lt;<a =
href=3D"mailto:tyler@datastax.com">tyler@datastax.com</a>&gt; wrote:<br>
&gt;&gt; Here's what's probably happening:<br>
&gt;&gt;<br>
&gt;&gt; I'm assuming RF=3D3 and QUORUM writes/reads here.&nbsp; I'll =
call the replicas A,<br>
&gt;&gt; B, and C.<br>
&gt;&gt;<br>
&gt;&gt; 1.&nbsp; Writer process writes sequence number 1 and everything =
works fine.&nbsp; A,<br>
&gt;&gt; B, and C all have sequence number 1.<br>
&gt;&gt; 2.&nbsp; Writer process writes sequence number 2.&nbsp; Replica =
A writes successfully,<br>
&gt;&gt; B and C fail to respond in time, and a TimedOutException is =
returned.<br>
&gt;&gt; pycassa waits to retry the operation.<br>
&gt;&gt; 3.&nbsp; Reader process reads, gets a response from A and =
B.&nbsp; When the row from A<br>
&gt;&gt; and B is merged, sequence number 2 is the newest and is =
returned.&nbsp; A read<br>
&gt;&gt; repair is pushed to B and C, but they don't yet update their =
data.<br>
&gt;&gt; 4.&nbsp; Reader process reads again, gets a response from B and =
C (before they've<br>
&gt;&gt; repaired).&nbsp; These both report sequence number 1, so that's =
returned to the<br>
&gt;&gt; client.&nbsp; This is were you get a decreasing sequence =
number.<br>
&gt;&gt; 5.&nbsp; pycassa eventually retries the write; B and C =
eventually repair their<br>
&gt;&gt; data.&nbsp; Either way, both B and C shortly have sequence =
number 2.<br>
&gt;&gt;<br>
&gt;&gt; I've left out some of the details of read repair, and this =
scenario could<br>
&gt;&gt; happen in several slightly different ways, but it should give =
you an idea of<br>
&gt;&gt; what's happening.<br>
&gt;&gt;<br>
&gt;&gt; On Sat, Apr 16, 2011 at 8:35 PM, James Cipar &lt;<a =
href=3D"mailto:jcipar@cmu.edu">jcipar@cmu.edu</a>&gt; wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Here it is. &nbsp;There is some setup code and global =
variable definitions that<br>
&gt;&gt;&gt; I left out of the previous code, but they are pretty =
similar to the setup<br>
&gt;&gt;&gt; code here.<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;import pycassa<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;import random<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;import time<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;consistency_level =3D =
pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;duration =3D 600<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;sleeptime =3D 0.0<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;hostlist =3D 'worker-hostlist'<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;def read_servers(fn):<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;f =3D open(fn)<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;servers =3D []<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;for line in f:<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;servers.append(line.strip())<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;f.close()<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;return servers<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;servers =3D read_servers(hostlist)<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;start_time =3D time.time()<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;seqnum =3D -1<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;timestamp =3D 0<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp;while time.time() &lt; start_time + =
duration:<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;target_server =3D =
random.sample(servers, 1)[0]<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;target_server =3D =
'%s:9160'%target_server<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;try:<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;pool =3D =
pycassa.connect('Keyspace1', [target_server])<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cf =3D =
pycassa.ColumnFamily(pool, 'Standard1')<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;row =3D =
cf.get('foo', read_consistency_level=3Dconsistency_level)<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;pool.dispose()<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;except:<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;time.sleep(sleeptime)<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;continue<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;sq =3D =
int(row['seqnum'])<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;ts =3D =
float(row['timestamp'])<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;if sq &lt; seqnum:<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;print 'Row =
changed: %i %f -&gt; %i %f'%(seqnum, timestamp, sq,<br>
&gt;&gt;&gt; ts)<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;seqnum =3D sq<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;timestamp =3D ts<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;if sleeptime &gt; 0.0:<br>
&gt;&gt;&gt; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;time.sleep(sleeptime)<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; James,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Would you mind sharing your reader process code as =
well?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Fri, Apr 15, 2011 at 1:14 PM, James Cipar &lt;<a =
href=3D"mailto:jcipar@cmu.edu">jcipar@cmu.edu</a>&gt; wrote:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I've been experimenting with the consistency model of =
Cassandra, and I<br>
&gt;&gt;&gt;&gt; found something that seems a bit unexpected. &nbsp;In =
my experiment, I have 2<br>
&gt;&gt;&gt;&gt; processes, a reader and a writer, each accessing a =
Cassandra cluster with a<br>
&gt;&gt;&gt;&gt; replication factor greater than 1. &nbsp;In addition, =
sometimes I generate<br>
&gt;&gt;&gt;&gt; background traffic to simulate a busy cluster by =
uploading a large data file<br>
&gt;&gt;&gt;&gt; to another table.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; The writer executes a loop where it writes a single row =
that contains<br>
&gt;&gt;&gt;&gt; just an sequentially increasing sequence number and a =
timestamp. &nbsp;In python<br>
&gt;&gt;&gt;&gt; this looks something like:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp;while time.time() &lt; start_time + =
duration:<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;target_server =3D =
random.sample(servers, 1)[0]<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;target_server =3D =
'%s:9160'%target_server<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;row =3D =
{'seqnum':str(seqnum), 'timestamp':str(time.time())}<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;seqnum +=3D 1<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;# print 'uploading to server =
%s, %s'%(target_server, row)<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;pool =3D =
pycassa.connect('Keyspace1', [target_server])<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;cf =3D =
pycassa.ColumnFamily(pool, 'Standard1')<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;cf.insert('foo', row, =
write_consistency_level=3Dconsistency_level)<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;pool.dispose()<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp;if sleeptime &gt; 0.0:<br>
&gt;&gt;&gt;&gt; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;time.sleep(sleeptime)<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; The reader simply executes a loop reading this row and =
reporting whenever<br>
&gt;&gt;&gt;&gt; a sequence number is *less* than the previous sequence =
number. &nbsp;As expected,<br>
&gt;&gt;&gt;&gt; with consistency_level=3DConsistencyLevel.ONE there are =
many inconsistencies,<br>
&gt;&gt;&gt;&gt; especially with a high replication factor.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; What is unexpected is that I still detect =
inconsistencies when it is set<br>
&gt;&gt;&gt;&gt; at ConsistencyLevel.QUORUM. &nbsp;This is unexpected =
because the documentation<br>
&gt;&gt;&gt;&gt; seems to imply that QUORUM will give consistent =
results. &nbsp;With background<br>
&gt;&gt;&gt;&gt; traffic the average difference in timestamps was 0.6s, =
and the maximum was<br>
&gt;&gt;&gt;&gt; &gt;3.5s. &nbsp;This means that a client sees a version =
of the row, and can<br>
&gt;&gt;&gt;&gt; subsequently see another version of the row that is =
3.5s older than the<br>
&gt;&gt;&gt;&gt; previous.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; What I imagine is happening is this, but I'd like =
someone who knows that<br>
&gt;&gt;&gt;&gt; they're talking about to tell me if it's actually the =
case:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I think Cassandra is not using an atomic commit =
protocol to commit to the<br>
&gt;&gt;&gt;&gt; quorum of servers chosen when the write is made. =
&nbsp;This means that at some<br>
&gt;&gt;&gt;&gt; point in the middle of the write, some subset of the =
quorum have seen the<br>
&gt;&gt;&gt;&gt; write, while others have not. &nbsp;At this time, there =
is a quorum of servers<br>
&gt;&gt;&gt;&gt; that have not seen the update, so depending on which =
quorum the client reads<br>
&gt;&gt;&gt;&gt; from, it may or may not see the update.<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Of course, I understand that the client is not =
*choosing* a bad quorum to<br>
&gt;&gt;&gt;&gt; read from, it is just the first `q` servers to respond, =
but in this case it<br>
&gt;&gt;&gt;&gt; is effectively random and sometimes an bad quorum is =
"chosen".<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Does anyone have any other insight into what is going =
on here?<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; --<br>
&gt;&gt;&gt; Tyler Hobbs<br>
&gt;&gt;&gt; Software Engineer, DataStax<br>
&gt;&gt;&gt; Maintainer of the pycassa Cassandra Python client =
library<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Tyler Hobbs<br>
&gt;&gt; Software Engineer, DataStax<br>
&gt;&gt; Maintainer of the pycassa Cassandra Python client library<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><font color=3D"#888888">--<br>
Jonathan Ellis<br>
Project Chair, Apache Cassandra<br>
co-founder of DataStax, the source for professional Cassandra =
support<br>
<a href=3D"http://www.datastax.com/" =
target=3D"_blank">http://www.datastax.com</a><br>
</font></blockquote></div><br><br clear=3D"all"><br>-- <br>Will =
Oberman<br>Civic Science, Inc.<br>3030 Penn Avenue., First =
Floor<br>Pittsburgh, PA 15201<br>(M) 412-480-7835<br>(E) <a =
href=3D"mailto:oberman@civicscience.com">oberman@civicscience.com</a><br>


</div></div></div>
</blockquote></div><br></div></div></div></body></html>=

--Apple-Mail-2-472141330--