On Sun, Apr 17, 2011 at 6:42 AM, William Oberman
<oberman@civicscience.com> wrote:
> At first I was concerned and was going to +1 on a fix, but I think I was
> confused on one detail and I'd like to confirm it.
> -An unsuccessful write implies readers can see either the old or new value
Yes. Fixing CASSANDRA-2494 will simply mean that once the new value
is seen in a quorum read, all future quorum reads will see it.
> The trick is using a library, it sounds like there is a period of time a
> write is unsuccessful but you don't know about it (as the retry is
> internal). But, (assuming writes are idempotent) QUORUM is actually
> consistent from successful writes to successful reads... right?
Yes, a successful quorum write implies that future quorum reads will
see the write.
Sean
> On Sun, Apr 17, 2011 at 1:53 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>> Tyler is correct, because Cassandra doesn't wait until repair writes
>> are acked before the answer is returned. This is something we can fix.
>>
>> On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.bridges@gmail.com>
>> wrote:
>> > Tyler, your answer seems to contradict this email by Jonathan Ellis
>> > [1]. In it Jonathan says,
>> >
>> > "The important guarantee this gives you is that once one quorum read
>> > sees the new value, all others will too. You can't see the newest
>> > version, then see an older version on a subsequent write [sic, I
>> > assume he meant read], which is the characteristic of non-strong
>> > consistency"
>> >
>> > Jonathan also says,
>> >
>> > "{X, Y} and {X, Z} are equivalent: one node with the write, and one
>> > without. The read will recognize that X's version needs to be sent to
>> > Z, and the write will be complete. This read and all subsequent ones
>> > will see the write. (Z [sic, I assume he meant Y] will be replicated
>> > to asynchronously via read repair.)"
>> >
>> > To me, the statement "this read and all subsequent ones will see the
>> > write" implies that the new value must be committed to Y or Z before
>> > the read can return. If not, the statement must be false.
>> >
>> > Sean
>> >
>> >
>> > [1] :
>> > http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3CAANLkTimEGp8H87mGs_BxZKNCk-A59whXF-Xx58HcAWZm@mail.gmail.com%3E
>> >
>> > Sean
>> >
>> > On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <tyler@datastax.com> wrote:
>> >> Here's what's probably happening:
>> >>
>> >> I'm assuming RF=3 and QUORUM writes/reads here. I'll call the replicas
>> >> A,
>> >> B, and C.
>> >>
>> >> 1. Writer process writes sequence number 1 and everything works fine.
>> >> A,
>> >> B, and C all have sequence number 1.
>> >> 2. Writer process writes sequence number 2. Replica A writes
>> >> successfully,
>> >> B and C fail to respond in time, and a TimedOutException is returned.
>> >> pycassa waits to retry the operation.
>> >> 3. Reader process reads, gets a response from A and B. When the row
>> >> from A
>> >> and B is merged, sequence number 2 is the newest and is returned. A
>> >> read
>> >> repair is pushed to B and C, but they don't yet update their data.
>> >> 4. Reader process reads again, gets a response from B and C (before
>> >> they've
>> >> repaired). These both report sequence number 1, so that's returned to
>> >> the
>> >> client. This is were you get a decreasing sequence number.
>> >> 5. pycassa eventually retries the write; B and C eventually repair
>> >> their
>> >> data. Either way, both B and C shortly have sequence number 2.
>> >>
>> >> I've left out some of the details of read repair, and this scenario
>> >> could
>> >> happen in several slightly different ways, but it should give you an
>> >> idea of
>> >> what's happening.
>> >>
>> >> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jcipar@cmu.edu> wrote:
>> >>>
>> >>> Here it is. There is some setup code and global variable definitions
>> >>> that
>> >>> I left out of the previous code, but they are pretty similar to the
>> >>> setup
>> >>> code here.
>> >>> import pycassa
>> >>> import random
>> >>> import time
>> >>> consistency_level =
>> >>> pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
>> >>> duration = 600
>> >>> sleeptime = 0.0
>> >>> hostlist = 'worker-hostlist'
>> >>> def read_servers(fn):
>> >>> f = open(fn)
>> >>> servers = []
>> >>> for line in f:
>> >>> servers.append(line.strip())
>> >>> f.close()
>> >>> return servers
>> >>> servers = read_servers(hostlist)
>> >>> start_time = time.time()
>> >>> seqnum = -1
>> >>> timestamp = 0
>> >>> while time.time() < start_time + duration:
>> >>> target_server = random.sample(servers, 1)[0]
>> >>> target_server = '%s:9160'%target_server
>> >>> try:
>> >>> pool = pycassa.connect('Keyspace1', [target_server])
>> >>> cf = pycassa.ColumnFamily(pool, 'Standard1')
>> >>> row = cf.get('foo',
>> >>> read_consistency_level=consistency_level)
>> >>> pool.dispose()
>> >>> except:
>> >>> time.sleep(sleeptime)
>> >>> continue
>> >>> sq = int(row['seqnum'])
>> >>> ts = float(row['timestamp'])
>> >>> if sq < seqnum:
>> >>> print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp,
>> >>> sq,
>> >>> ts)
>> >>> seqnum = sq
>> >>> timestamp = ts
>> >>> if sleeptime > 0.0:
>> >>> time.sleep(sleeptime)
>> >>>
>> >>>
>> >>>
>> >>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
>> >>>
>> >>> James,
>> >>>
>> >>> Would you mind sharing your reader process code as well?
>> >>>
>> >>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jcipar@cmu.edu>
wrote:
>> >>>>
>> >>>> I've been experimenting with the consistency model of Cassandra,
and
>> >>>> I
>> >>>> found something that seems a bit unexpected. In my experiment,
I
>> >>>> have 2
>> >>>> processes, a reader and a writer, each accessing a Cassandra cluster
>> >>>> with a
>> >>>> replication factor greater than 1. In addition, sometimes I generate
>> >>>> background traffic to simulate a busy cluster by uploading a large
>> >>>> data file
>> >>>> to another table.
>> >>>>
>> >>>> The writer executes a loop where it writes a single row that contains
>> >>>> just an sequentially increasing sequence number and a timestamp.
In
>> >>>> python
>> >>>> this looks something like:
>> >>>>
>> >>>> while time.time() < start_time + duration:
>> >>>> target_server = random.sample(servers, 1)[0]
>> >>>> target_server = '%s:9160'%target_server
>> >>>>
>> >>>> row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
>> >>>> seqnum += 1
>> >>>> # print 'uploading to server %s, %s'%(target_server,
row)
>> >>>>
>> >>>> pool = pycassa.connect('Keyspace1', [target_server])
>> >>>> cf = pycassa.ColumnFamily(pool, 'Standard1')
>> >>>> cf.insert('foo', row,
>> >>>> write_consistency_level=consistency_level)
>> >>>> pool.dispose()
>> >>>>
>> >>>> if sleeptime > 0.0:
>> >>>> time.sleep(sleeptime)
>> >>>>
>> >>>>
>> >>>> The reader simply executes a loop reading this row and reporting
>> >>>> whenever
>> >>>> a sequence number is *less* than the previous sequence number. As
>> >>>> expected,
>> >>>> with consistency_level=ConsistencyLevel.ONE there are many
>> >>>> inconsistencies,
>> >>>> especially with a high replication factor.
>> >>>>
>> >>>> What is unexpected is that I still detect inconsistencies when it
is
>> >>>> set
>> >>>> at ConsistencyLevel.QUORUM. This is unexpected because the
>> >>>> documentation
>> >>>> seems to imply that QUORUM will give consistent results. With
>> >>>> background
>> >>>> traffic the average difference in timestamps was 0.6s, and the
>> >>>> maximum was
>> >>>> >3.5s. This means that a client sees a version of the row,
and can
>> >>>> subsequently see another version of the row that is 3.5s older than
>> >>>> the
>> >>>> previous.
>> >>>>
>> >>>> What I imagine is happening is this, but I'd like someone who knows
>> >>>> that
>> >>>> they're talking about to tell me if it's actually the case:
>> >>>>
>> >>>> I think Cassandra is not using an atomic commit protocol to commit
to
>> >>>> the
>> >>>> quorum of servers chosen when the write is made. This means that
at
>> >>>> some
>> >>>> point in the middle of the write, some subset of the quorum have
seen
>> >>>> the
>> >>>> write, while others have not. At this time, there is a quorum
of
>> >>>> servers
>> >>>> that have not seen the update, so depending on which quorum the
>> >>>> client reads
>> >>>> from, it may or may not see the update.
>> >>>>
>> >>>> Of course, I understand that the client is not *choosing* a bad
>> >>>> quorum to
>> >>>> read from, it is just the first `q` servers to respond, but in this
>> >>>> case it
>> >>>> is effectively random and sometimes an bad quorum is "chosen".
>> >>>>
>> >>>> Does anyone have any other insight into what is going on here?
>> >>>
>> >>>
>> >>> --
>> >>> Tyler Hobbs
>> >>> Software Engineer, DataStax
>> >>> Maintainer of the pycassa Cassandra Python client library
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Tyler Hobbs
>> >> Software Engineer, DataStax
>> >> Maintainer of the pycassa Cassandra Python client library
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>
|