Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 58908 invoked from network); 17 Apr 2011 14:10:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Apr 2011 14:10:07 -0000 Received: (qmail 17348 invoked by uid 500); 17 Apr 2011 14:10:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 17230 invoked by uid 500); 17 Apr 2011 14:10:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 17222 invoked by uid 99); 17 Apr 2011 14:10:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Apr 2011 14:10:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [128.2.11.95] (HELO smtp.andrew.cmu.edu) (128.2.11.95) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Apr 2011 14:09:58 +0000 Received: from [192.168.2.4] (c-71-206-238-245.hsd1.pa.comcast.net [71.206.238.245]) (user=jcipar mech=PLAIN (0 bits)) by smtp.andrew.cmu.edu (8.14.4/8.14.4) with ESMTP id p3HE9W3i004758 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Sun, 17 Apr 2011 10:09:36 -0400 From: James Cipar Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-2-472141330 Subject: Re: Consistency model Date: Sun, 17 Apr 2011 10:09:32 -0400 In-Reply-To: To: user@cassandra.apache.org References: <8747B84D-8AAB-4DDB-A4B5-215EC7962B03@cmu.edu> <64E44C90-0AAB-40E3-B972-FD94C3D6C02F@cmu.edu> Message-Id: <86974EFE-F65B-47DB-86E7-0BDEB24E65FB@cmu.edu> X-Mailer: Apple Mail (2.1084) X-PMX-Version: 5.5.9.388399, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2011.4.17.135726 X-SMTP-Spam-Clean: 9% ( HTML_NO_HTTP 0.1, LEO_OBFU_SUBJ_RE 0.1, SUPERLONG_LINE 0.05, BODY_SIZE_10000_PLUS 0, CTYPE_MULTIPART_NO_QUOTE 0, FROM_EDU_TLD 0, RDNS_BROADBAND 0, RDNS_GENERIC_POOLED 0, RDNS_POOLED 0, RDNS_SUSP 0, RDNS_SUSP_GENERIC 0, RDNS_SUSP_SPECIFIC 0, __BOUNCE_CHALLENGE_SUBJ 0, __BOUNCE_NDR_SUBJ_EXEMPT 0, __C230066_P5 0, __CP_URI_IN_BODY 0, __CT 0, __CTYPE_HAS_BOUNDARY 0, __CTYPE_MULTIPART 0, __CTYPE_MULTIPART_ALT 0, __FRAUD_BODY_WEBMAIL 0, __FRAUD_WEBMAIL 0, __HAS_HTML 0, __HAS_MSGID 0, __HAS_X_MAILER 0, __MIME_HTML 0, __MIME_VERSION 0, __MIME_VERSION_APPLEMAIL 0, __MSGID_APPLEMAIL 0, __RDNS_BROADBAND_5 0, __RDNS_POOLED_11 0, __SANE_MSGID 0, __STOCK_PHRASE_7 0, __TAG_EXISTS_HTML 0, __TO_MALFORMED_2 0, __TO_NO_NAME 0, __USER_AGENT_APPLEMAIL 0, __X_MAILER_APPLEMAIL 0) X-SMTP-Spam-Score: 9% X-Scanned-By: MIMEDefang 2.60 on 128.2.11.95 X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-2-472141330 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii I'm pretty new to Cassandra, but I've also written a client in C++ using = the thrift API directly. =46rom what I've seen, wrapping writes in a = retry loop is pretty much mandatory because if you are pushing a lot of = data around, you're basically guaranteed to have TimedOutExceptions. I = suppose what I'm getting at is: if you don't have consistency in the = case of a TimedOutException, you don't have consistency for any = high-throughput application. Is there a solution to this that I am = missing? On Apr 17, 2011, at 9:42 AM, William Oberman wrote: > At first I was concerned and was going to +1 on a fix, but I think I = was confused on one detail and I'd like to confirm it. > -An unsuccessful write implies readers can see either the old or new = value > ? >=20 > The trick is using a library, it sounds like there is a period of time = a write is unsuccessful but you don't know about it (as the retry is = internal). But, (assuming writes are idempotent) QUORUM is actually = consistent from successful writes to successful reads... right? >=20 > On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis = wrote: > Tyler is correct, because Cassandra doesn't wait until repair writes > are acked before the answer is returned. This is something we can fix. >=20 > On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges = wrote: > > Tyler, your answer seems to contradict this email by Jonathan Ellis > > [1]. In it Jonathan says, > > > > "The important guarantee this gives you is that once one quorum read > > sees the new value, all others will too. You can't see the newest > > version, then see an older version on a subsequent write [sic, I > > assume he meant read], which is the characteristic of non-strong > > consistency" > > > > Jonathan also says, > > > > "{X, Y} and {X, Z} are equivalent: one node with the write, and one > > without. The read will recognize that X's version needs to be sent = to > > Z, and the write will be complete. This read and all subsequent = ones > > will see the write. (Z [sic, I assume he meant Y] will be = replicated > > to asynchronously via read repair.)" > > > > To me, the statement "this read and all subsequent ones will see the > > write" implies that the new value must be committed to Y or Z before > > the read can return. If not, the statement must be false. > > > > Sean > > > > > > [1] : = http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3CAAN= LkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.com%3E > > > > Sean > > > > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs = wrote: > >> Here's what's probably happening: > >> > >> I'm assuming RF=3D3 and QUORUM writes/reads here. I'll call the = replicas A, > >> B, and C. > >> > >> 1. Writer process writes sequence number 1 and everything works = fine. A, > >> B, and C all have sequence number 1. > >> 2. Writer process writes sequence number 2. Replica A writes = successfully, > >> B and C fail to respond in time, and a TimedOutException is = returned. > >> pycassa waits to retry the operation. > >> 3. Reader process reads, gets a response from A and B. When the = row from A > >> and B is merged, sequence number 2 is the newest and is returned. = A read > >> repair is pushed to B and C, but they don't yet update their data. > >> 4. Reader process reads again, gets a response from B and C = (before they've > >> repaired). These both report sequence number 1, so that's returned = to the > >> client. This is were you get a decreasing sequence number. > >> 5. pycassa eventually retries the write; B and C eventually repair = their > >> data. Either way, both B and C shortly have sequence number 2. > >> > >> I've left out some of the details of read repair, and this scenario = could > >> happen in several slightly different ways, but it should give you = an idea of > >> what's happening. > >> > >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar = wrote: > >>> > >>> Here it is. There is some setup code and global variable = definitions that > >>> I left out of the previous code, but they are pretty similar to = the setup > >>> code here. > >>> import pycassa > >>> import random > >>> import time > >>> consistency_level =3D = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM > >>> duration =3D 600 > >>> sleeptime =3D 0.0 > >>> hostlist =3D 'worker-hostlist' > >>> def read_servers(fn): > >>> f =3D open(fn) > >>> servers =3D [] > >>> for line in f: > >>> servers.append(line.strip()) > >>> f.close() > >>> return servers > >>> servers =3D read_servers(hostlist) > >>> start_time =3D time.time() > >>> seqnum =3D -1 > >>> timestamp =3D 0 > >>> while time.time() < start_time + duration: > >>> target_server =3D random.sample(servers, 1)[0] > >>> target_server =3D '%s:9160'%target_server > >>> try: > >>> pool =3D pycassa.connect('Keyspace1', [target_server]) > >>> cf =3D pycassa.ColumnFamily(pool, 'Standard1') > >>> row =3D cf.get('foo', = read_consistency_level=3Dconsistency_level) > >>> pool.dispose() > >>> except: > >>> time.sleep(sleeptime) > >>> continue > >>> sq =3D int(row['seqnum']) > >>> ts =3D float(row['timestamp']) > >>> if sq < seqnum: > >>> print 'Row changed: %i %f -> %i %f'%(seqnum, = timestamp, sq, > >>> ts) > >>> seqnum =3D sq > >>> timestamp =3D ts > >>> if sleeptime > 0.0: > >>> time.sleep(sleeptime) > >>> > >>> > >>> > >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote: > >>> > >>> James, > >>> > >>> Would you mind sharing your reader process code as well? > >>> > >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar = wrote: > >>>> > >>>> I've been experimenting with the consistency model of Cassandra, = and I > >>>> found something that seems a bit unexpected. In my experiment, I = have 2 > >>>> processes, a reader and a writer, each accessing a Cassandra = cluster with a > >>>> replication factor greater than 1. In addition, sometimes I = generate > >>>> background traffic to simulate a busy cluster by uploading a = large data file > >>>> to another table. > >>>> > >>>> The writer executes a loop where it writes a single row that = contains > >>>> just an sequentially increasing sequence number and a timestamp. = In python > >>>> this looks something like: > >>>> > >>>> while time.time() < start_time + duration: > >>>> target_server =3D random.sample(servers, 1)[0] > >>>> target_server =3D '%s:9160'%target_server > >>>> > >>>> row =3D {'seqnum':str(seqnum), = 'timestamp':str(time.time())} > >>>> seqnum +=3D 1 > >>>> # print 'uploading to server %s, %s'%(target_server, row) > >>>> > >>>> pool =3D pycassa.connect('Keyspace1', [target_server]) > >>>> cf =3D pycassa.ColumnFamily(pool, 'Standard1') > >>>> cf.insert('foo', row, = write_consistency_level=3Dconsistency_level) > >>>> pool.dispose() > >>>> > >>>> if sleeptime > 0.0: > >>>> time.sleep(sleeptime) > >>>> > >>>> > >>>> The reader simply executes a loop reading this row and reporting = whenever > >>>> a sequence number is *less* than the previous sequence number. = As expected, > >>>> with consistency_level=3DConsistencyLevel.ONE there are many = inconsistencies, > >>>> especially with a high replication factor. > >>>> > >>>> What is unexpected is that I still detect inconsistencies when it = is set > >>>> at ConsistencyLevel.QUORUM. This is unexpected because the = documentation > >>>> seems to imply that QUORUM will give consistent results. With = background > >>>> traffic the average difference in timestamps was 0.6s, and the = maximum was > >>>> >3.5s. This means that a client sees a version of the row, and = can > >>>> subsequently see another version of the row that is 3.5s older = than the > >>>> previous. > >>>> > >>>> What I imagine is happening is this, but I'd like someone who = knows that > >>>> they're talking about to tell me if it's actually the case: > >>>> > >>>> I think Cassandra is not using an atomic commit protocol to = commit to the > >>>> quorum of servers chosen when the write is made. This means that = at some > >>>> point in the middle of the write, some subset of the quorum have = seen the > >>>> write, while others have not. At this time, there is a quorum of = servers > >>>> that have not seen the update, so depending on which quorum the = client reads > >>>> from, it may or may not see the update. > >>>> > >>>> Of course, I understand that the client is not *choosing* a bad = quorum to > >>>> read from, it is just the first `q` servers to respond, but in = this case it > >>>> is effectively random and sometimes an bad quorum is "chosen". > >>>> > >>>> Does anyone have any other insight into what is going on here? > >>> > >>> > >>> -- > >>> Tyler Hobbs > >>> Software Engineer, DataStax > >>> Maintainer of the pycassa Cassandra Python client library > >>> > >>> > >> > >> > >> > >> -- > >> Tyler Hobbs > >> Software Engineer, DataStax > >> Maintainer of the pycassa Cassandra Python client library > >> > >> > > >=20 >=20 >=20 > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >=20 >=20 >=20 > --=20 > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) oberman@civicscience.com --Apple-Mail-2-472141330 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii I'm = pretty new to Cassandra, but I've also written a client in C++ using the = thrift API directly.  =46rom what I've seen, wrapping writes in a = retry loop is pretty much mandatory because if you are pushing a lot of = data around, you're basically guaranteed to have TimedOutExceptions. =  I suppose what I'm getting at is: if you don't have consistency in = the case of a TimedOutException, you don't have consistency for any = high-throughput application.  Is there a solution to this that I am = missing?


On Apr 17, 2011, at 9:42 AM, = William Oberman wrote:

At first I = was concerned and was going to +1  on a fix, but I think I was = confused on one detail and I'd like to confirm it.
-An unsuccessful = write implies readers can see either the old or new = value
?

The trick is using a library, it sounds like there = is a period of time a write is unsuccessful but you don't know about it = (as the retry is internal).  But, (assuming writes are idempotent) = QUORUM is actually consistent from successful writes to successful = reads... right?

On Sun, Apr 17, 2011 at = 1:53 AM, Jonathan Ellis <jbellis@gmail.com> = wrote:
Tyler is correct, because Cassandra doesn't wait until repair writes
are acked before the answer is returned. This is something we can = fix.

On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.bridges@gmail.com> = wrote:
> Tyler, your answer seems to contradict this email by Jonathan = Ellis
> [1].  In it Jonathan says,
>
> "The important guarantee this gives you is that once one quorum = read
> sees the new value, all others will too.   You can't see the = newest
> version, then see an older version on a subsequent write [sic, = I
> assume he meant read], which is the characteristic of = non-strong
> consistency"
>
> Jonathan also says,
>
> "{X, Y} and {X, Z} are equivalent: one node with the write, and = one
> without. The read will recognize that X's version needs to be sent = to
> Z, and the write will be complete.  This read and all = subsequent ones
> will see the write.  (Z [sic, I assume he meant Y] will be = replicated
> to asynchronously via read repair.)"
>
> To me, the statement "this read and all subsequent ones will see = the
> write" implies that the new value must be committed to Y or Z = before
> the read can return.  If not, the statement must be false.
>
> Sean
>
>
> [1] : http://mail-archives.apache.org/mod_mbox/cassandra-user/= 201102.mbox/%3CAANLkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.co= m%3E
>
> Sean
>
> On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <tyler@datastax.com> wrote:
>> Here's what's probably happening:
>>
>> I'm assuming RF=3D3 and QUORUM writes/reads here.  I'll = call the replicas A,
>> B, and C.
>>
>> 1.  Writer process writes sequence number 1 and everything = works fine.  A,
>> B, and C all have sequence number 1.
>> 2.  Writer process writes sequence number 2.  Replica = A writes successfully,
>> B and C fail to respond in time, and a TimedOutException is = returned.
>> pycassa waits to retry the operation.
>> 3.  Reader process reads, gets a response from A and = B.  When the row from A
>> and B is merged, sequence number 2 is the newest and is = returned.  A read
>> repair is pushed to B and C, but they don't yet update their = data.
>> 4.  Reader process reads again, gets a response from B and = C (before they've
>> repaired).  These both report sequence number 1, so that's = returned to the
>> client.  This is were you get a decreasing sequence = number.
>> 5.  pycassa eventually retries the write; B and C = eventually repair their
>> data.  Either way, both B and C shortly have sequence = number 2.
>>
>> I've left out some of the details of read repair, and this = scenario could
>> happen in several slightly different ways, but it should give = you an idea of
>> what's happening.
>>
>> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jcipar@cmu.edu> wrote:
>>>
>>> Here it is.  There is some setup code and global = variable definitions that
>>> I left out of the previous code, but they are pretty = similar to the setup
>>> code here.
>>>     import pycassa
>>>     import random
>>>     import time
>>>     consistency_level =3D = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
>>>     duration =3D 600
>>>     sleeptime =3D 0.0
>>>     hostlist =3D 'worker-hostlist'
>>>     def read_servers(fn):
>>>         f =3D open(fn)
>>>         servers =3D []
>>>         for line in f:
>>>            =  servers.append(line.strip())
>>>         f.close()
>>>         return servers
>>>     servers =3D read_servers(hostlist)
>>>     start_time =3D time.time()
>>>     seqnum =3D -1
>>>     timestamp =3D 0
>>>     while time.time() < start_time + = duration:
>>>         target_server =3D = random.sample(servers, 1)[0]
>>>         target_server =3D = '%s:9160'%target_server
>>>         try:
>>>             pool =3D = pycassa.connect('Keyspace1', [target_server])
>>>             cf =3D = pycassa.ColumnFamily(pool, 'Standard1')
>>>             row =3D = cf.get('foo', read_consistency_level=3Dconsistency_level)
>>>            =  pool.dispose()
>>>         except:
>>>            =  time.sleep(sleeptime)
>>>             continue
>>>         sq =3D = int(row['seqnum'])
>>>         ts =3D = float(row['timestamp'])
>>>         if sq < seqnum:
>>>             print 'Row = changed: %i %f -> %i %f'%(seqnum, timestamp, sq,
>>> ts)
>>>         seqnum =3D sq
>>>         timestamp =3D ts
>>>         if sleeptime > 0.0:
>>>            =  time.sleep(sleeptime)
>>>
>>>
>>>
>>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
>>>
>>> James,
>>>
>>> Would you mind sharing your reader process code as = well?
>>>
>>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jcipar@cmu.edu> wrote:
>>>>
>>>> I've been experimenting with the consistency model of = Cassandra, and I
>>>> found something that seems a bit unexpected.  In = my experiment, I have 2
>>>> processes, a reader and a writer, each accessing a = Cassandra cluster with a
>>>> replication factor greater than 1.  In addition, = sometimes I generate
>>>> background traffic to simulate a busy cluster by = uploading a large data file
>>>> to another table.
>>>>
>>>> The writer executes a loop where it writes a single row = that contains
>>>> just an sequentially increasing sequence number and a = timestamp.  In python
>>>> this looks something like:
>>>>
>>>>    while time.time() < start_time + = duration:
>>>>        target_server =3D = random.sample(servers, 1)[0]
>>>>        target_server =3D = '%s:9160'%target_server
>>>>
>>>>        row =3D = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
>>>>        seqnum +=3D 1
>>>>        # print 'uploading to server = %s, %s'%(target_server, row)
>>>>
>>>>        pool =3D = pycassa.connect('Keyspace1', [target_server])
>>>>        cf =3D = pycassa.ColumnFamily(pool, 'Standard1')
>>>>        cf.insert('foo', row, = write_consistency_level=3Dconsistency_level)
>>>>        pool.dispose()
>>>>
>>>>        if sleeptime > 0.0:
>>>>           =  time.sleep(sleeptime)
>>>>
>>>>
>>>> The reader simply executes a loop reading this row and = reporting whenever
>>>> a sequence number is *less* than the previous sequence = number.  As expected,
>>>> with consistency_level=3DConsistencyLevel.ONE there are = many inconsistencies,
>>>> especially with a high replication factor.
>>>>
>>>> What is unexpected is that I still detect = inconsistencies when it is set
>>>> at ConsistencyLevel.QUORUM.  This is unexpected = because the documentation
>>>> seems to imply that QUORUM will give consistent = results.  With background
>>>> traffic the average difference in timestamps was 0.6s, = and the maximum was
>>>> >3.5s.  This means that a client sees a version = of the row, and can
>>>> subsequently see another version of the row that is = 3.5s older than the
>>>> previous.
>>>>
>>>> What I imagine is happening is this, but I'd like = someone who knows that
>>>> they're talking about to tell me if it's actually the = case:
>>>>
>>>> I think Cassandra is not using an atomic commit = protocol to commit to the
>>>> quorum of servers chosen when the write is made. =  This means that at some
>>>> point in the middle of the write, some subset of the = quorum have seen the
>>>> write, while others have not.  At this time, there = is a quorum of servers
>>>> that have not seen the update, so depending on which = quorum the client reads
>>>> from, it may or may not see the update.
>>>>
>>>> Of course, I understand that the client is not = *choosing* a bad quorum to
>>>> read from, it is just the first `q` servers to respond, = but in this case it
>>>> is effectively random and sometimes an bad quorum is = "chosen".
>>>>
>>>> Does anyone have any other insight into what is going = on here?
>>>
>>>
>>> --
>>> Tyler Hobbs
>>> Software Engineer, DataStax
>>> Maintainer of the pycassa Cassandra Python client = library
>>>
>>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> Software Engineer, DataStax
>> Maintainer of the pycassa Cassandra Python client library
>>
>>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra = support
http://www.datastax.com



--
Will = Oberman
Civic Science, Inc.
3030 Penn Avenue., First = Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

= --Apple-Mail-2-472141330--