cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <ober...@civicscience.com>
Subject Re: Consistency model
Date Sun, 17 Apr 2011 14:48:44 GMT
I'm pretty new to all of this, and I'm in the process of building my mental
model of Cassandra, but I'm still feeling better about this thread. The way
I figure it:
1. I'm trying to mutate the state of a key's column from A to B from a
thread somewhere (quorum)
2. I'm trying to read the state of a key from a thread somewhere else
(quorum)

If #1 succeeds I'm guaranteed to see B. If #1 fails (with an exception) I'll
see either A or B. I think I was concerned about that, and wanted to see A
in #2 until success in #1.  But, I wanted to get to state B, and if #1
retries until guaranteed success, do I care if I set B earlier than I
expected?  I'm thinking no.

I guess in terms of distributed algorithms/reasoning about systems, I'm
feeling ok with this level of guarantee (again, given the failed write tells
the client code of the undefined state).

On Apr 17, 2011, at 10:10 AM, James Cipar <jcipar@cmu.edu> wrote:

I'm pretty new to Cassandra, but I've also written a client in C++ using the
thrift API directly.  From what I've seen, wrapping writes in a retry loop
is pretty much mandatory because if you are pushing a lot of data around,
you're basically guaranteed to have TimedOutExceptions.  I suppose what I'm
getting at is: if you don't have consistency in the case of a
TimedOutException, you don't have consistency for any high-throughput
application.  Is there a solution to this that I am missing?


On Apr 17, 2011, at 9:42 AM, William Oberman wrote:

At first I was concerned and was going to +1  on a fix, but I think I was
confused on one detail and I'd like to confirm it.
-An unsuccessful write implies readers can see either the old or new value
?

The trick is using a library, it sounds like there is a period of time a
write is unsuccessful but you don't know about it (as the retry is
internal).  But, (assuming writes are idempotent) QUORUM is actually
consistent from successful writes to successful reads... right?

On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbellis@gmail.com> wrote:

> Tyler is correct, because Cassandra doesn't wait until repair writes
> are acked before the answer is returned. This is something we can fix.
>
> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.bridges@gmail.com>
> wrote:
> > Tyler, your answer seems to contradict this email by Jonathan Ellis
> > [1].  In it Jonathan says,
> >
> > "The important guarantee this gives you is that once one quorum read
> > sees the new value, all others will too.   You can't see the newest
> > version, then see an older version on a subsequent write [sic, I
> > assume he meant read], which is the characteristic of non-strong
> > consistency"
> >
> > Jonathan also says,
> >
> > "{X, Y} and {X, Z} are equivalent: one node with the write, and one
> > without. The read will recognize that X's version needs to be sent to
> > Z, and the write will be complete.  This read and all subsequent ones
> > will see the write.  (Z [sic, I assume he meant Y] will be replicated
> > to asynchronously via read repair.)"
> >
> > To me, the statement "this read and all subsequent ones will see the
> > write" implies that the new value must be committed to Y or Z before
> > the read can return.  If not, the statement must be false.
> >
> > Sean
> >
> >
> > [1] :
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3CAANLkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.com%3E
> >
> > Sean
> >
> > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <tyler@datastax.com> wrote:
> >> Here's what's probably happening:
> >>
> >> I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas
> A,
> >> B, and C.
> >>
> >> 1.  Writer process writes sequence number 1 and everything works fine.
> A,
> >> B, and C all have sequence number 1.
> >> 2.  Writer process writes sequence number 2.  Replica A writes
> successfully,
> >> B and C fail to respond in time, and a TimedOutException is returned.
> >> pycassa waits to retry the operation.
> >> 3.  Reader process reads, gets a response from A and B.  When the row
> from A
> >> and B is merged, sequence number 2 is the newest and is returned.  A
> read
> >> repair is pushed to B and C, but they don't yet update their data.
> >> 4.  Reader process reads again, gets a response from B and C (before
> they've
> >> repaired).  These both report sequence number 1, so that's returned to
> the
> >> client.  This is were you get a decreasing sequence number.
> >> 5.  pycassa eventually retries the write; B and C eventually repair
> their
> >> data.  Either way, both B and C shortly have sequence number 2.
> >>
> >> I've left out some of the details of read repair, and this scenario
> could
> >> happen in several slightly different ways, but it should give you an
> idea of
> >> what's happening.
> >>
> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jcipar@cmu.edu> wrote:
> >>>
> >>> Here it is.  There is some setup code and global variable definitions
> that
> >>> I left out of the previous code, but they are pretty similar to the
> setup
> >>> code here.
> >>>     import pycassa
> >>>     import random
> >>>     import time
> >>>     consistency_level =
> pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
> >>>     duration = 600
> >>>     sleeptime = 0.0
> >>>     hostlist = 'worker-hostlist'
> >>>     def read_servers(fn):
> >>>         f = open(fn)
> >>>         servers = []
> >>>         for line in f:
> >>>             servers.append(line.strip())
> >>>         f.close()
> >>>         return servers
> >>>     servers = read_servers(hostlist)
> >>>     start_time = time.time()
> >>>     seqnum = -1
> >>>     timestamp = 0
> >>>     while time.time() < start_time + duration:
> >>>         target_server = random.sample(servers, 1)[0]
> >>>         target_server = '%s:9160'%target_server
> >>>         try:
> >>>             pool = pycassa.connect('Keyspace1', [target_server])
> >>>             cf = pycassa.ColumnFamily(pool, 'Standard1')
> >>>             row = cf.get('foo',
> read_consistency_level=consistency_level)
> >>>             pool.dispose()
> >>>         except:
> >>>             time.sleep(sleeptime)
> >>>             continue
> >>>         sq = int(row['seqnum'])
> >>>         ts = float(row['timestamp'])
> >>>         if sq < seqnum:
> >>>             print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq,
> >>> ts)
> >>>         seqnum = sq
> >>>         timestamp = ts
> >>>         if sleeptime > 0.0:
> >>>             time.sleep(sleeptime)
> >>>
> >>>
> >>>
> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
> >>>
> >>> James,
> >>>
> >>> Would you mind sharing your reader process code as well?
> >>>
> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jcipar@cmu.edu> wrote:
> >>>>
> >>>> I've been experimenting with the consistency model of Cassandra, and
I
> >>>> found something that seems a bit unexpected.  In my experiment, I have
> 2
> >>>> processes, a reader and a writer, each accessing a Cassandra cluster
> with a
> >>>> replication factor greater than 1.  In addition, sometimes I generate
> >>>> background traffic to simulate a busy cluster by uploading a large
> data file
> >>>> to another table.
> >>>>
> >>>> The writer executes a loop where it writes a single row that contains
> >>>> just an sequentially increasing sequence number and a timestamp.  In
> python
> >>>> this looks something like:
> >>>>
> >>>>    while time.time() < start_time + duration:
> >>>>        target_server = random.sample(servers, 1)[0]
> >>>>        target_server = '%s:9160'%target_server
> >>>>
> >>>>        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
> >>>>        seqnum += 1
> >>>>        # print 'uploading to server %s, %s'%(target_server, row)
> >>>>
> >>>>        pool = pycassa.connect('Keyspace1', [target_server])
> >>>>        cf = pycassa.ColumnFamily(pool, 'Standard1')
> >>>>        cf.insert('foo', row,
> write_consistency_level=consistency_level)
> >>>>        pool.dispose()
> >>>>
> >>>>        if sleeptime > 0.0:
> >>>>            time.sleep(sleeptime)
> >>>>
> >>>>
> >>>> The reader simply executes a loop reading this row and reporting
> whenever
> >>>> a sequence number is *less* than the previous sequence number.  As
> expected,
> >>>> with consistency_level=ConsistencyLevel.ONE there are many
> inconsistencies,
> >>>> especially with a high replication factor.
> >>>>
> >>>> What is unexpected is that I still detect inconsistencies when it is
> set
> >>>> at ConsistencyLevel.QUORUM.  This is unexpected because the
> documentation
> >>>> seems to imply that QUORUM will give consistent results.  With
> background
> >>>> traffic the average difference in timestamps was 0.6s, and the maximum
> was
> >>>> >3.5s.  This means that a client sees a version of the row, and can
> >>>> subsequently see another version of the row that is 3.5s older than
> the
> >>>> previous.
> >>>>
> >>>> What I imagine is happening is this, but I'd like someone who knows
> that
> >>>> they're talking about to tell me if it's actually the case:
> >>>>
> >>>> I think Cassandra is not using an atomic commit protocol to commit to
> the
> >>>> quorum of servers chosen when the write is made.  This means that at
> some
> >>>> point in the middle of the write, some subset of the quorum have seen
> the
> >>>> write, while others have not.  At this time, there is a quorum of
> servers
> >>>> that have not seen the update, so depending on which quorum the client
> reads
> >>>> from, it may or may not see the update.
> >>>>
> >>>> Of course, I understand that the client is not *choosing* a bad quorum
> to
> >>>> read from, it is just the first `q` servers to respond, but in this
> case it
> >>>> is effectively random and sometimes an bad quorum is "chosen".
> >>>>
> >>>> Does anyone have any other insight into what is going on here?
> >>>
> >>>
> >>> --
> >>> Tyler Hobbs
> >>> Software Engineer, DataStax
> >>> Maintainer of the pycassa Cassandra Python client library
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Tyler Hobbs
> >> Software Engineer, DataStax
> >> Maintainer of the pycassa Cassandra Python client library
> >>
> >>
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>



-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Mime
View raw message