cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (CASSANDRA-1072) Increment counters
Date Mon, 23 Aug 2010 14:38:27 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900659#action_12900659
] 

Jonathan Ellis edited comment on CASSANDRA-1072 at 8/23/10 10:37 AM:
---------------------------------------------------------------------

I actually think the main problem with the 1072 approach is on the write side, not the read.
 Writes are fragile.  Here is what I mean by that:

Because 1072 "shards" the increments across multiple machines, it can tolerate _temporary_
failures.  This is good.  But because the shards are no longer replicas in the normal Cassandra
sense -- each is responsible for a different set of increments, rather than all maintaining
the same data -- there is no way to create a write CL greater than ONE, and thus, no defense
against _permanent_ failures of single machines.  That is, if a single machine dies at the
right time, you will lose data, and unlike normal cassandra you can't prevent that by requesting
that writes not be acked until a higher CL is achieved.  (You can try to band-aid the problem
with Sylvain's repair-on-write, but only reduces the failure window, it does not eliminate
it.)

A related source of fragility is that operations here are not idempotent from the client's
perspective.  This is why repair-on-write can't be used to simulate higher CLs -- if a client
issues an increment and it comes back TimedOut (or a socket exception to the coordinator),
it has no safe way to retry: if the operation went on to succeed later, but the client responded
to the TimedOut by issuing another increment, we have added new data erroneously; but if we
do not re-issue the operation, and the original operation failed permanently, we have lost
data.  Thus, even temporary node failures can cause data corruption.  (I say corruption, because
I mean to distinguish it from the sort of inconsistency that RR and AES can repair.  Once
introduced into the system, this corruption is not distinguishable from real increments and
thus un-repairable.)

      was (Author: jbellis):
    I actually think the main problem with the 1072 approach is on the write side, not the
read.  Writes are fragile.  Here is what I mean by that:

Because 1072 "shards" the increments across multiple machines, it can tolerate _temporary_
failures.  This is good.  But because the shards are no longer replicas in the normal Cassandra
sense -- each is responsible for a different set of increments, rather than all maintaining
the same data -- there is no way to create a write CL greater than ONE, and thus, no defense
against _permanent_ failures of single machines.  That is, if a single machine dies at the
right time, you will lose data, and unlike normal cassandra you can't prevent that by requesting
that writes not be acked until a higher CL is achieved.  (You can try to band-aid the problem
with Sylvain's repair-on-write, but only reduces the failure window, it does not eliminate
it.)

A related source of fragility is that operations here are not idempotent from the client's
perspective.  This is why repair-on-write can't be used to simulate higher CLs -- if a client
issues an increment and it comes back TimedOut (or a socket exception to the coordinator),
it has no safe way to retry; if the operation did succeed, and the client issues another delete,
we have added new data erroneously; if the operation did not succeed, and we do not retry,
we have lost data.  So here is a case where even temporary node failures can cause data corruption.
 (I say corruption, because I mean to distinguish it from the sort of inconsistency that RR
and AES can repair.  Once introduced into the system, this corruption is not distinguishable
from real increments and thus un-repairable.)
  
> Increment counters
> ------------------
>
>                 Key: CASSANDRA-1072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072-2.patch, CASSANDRA-1072-2.patch, CASSANDRA-1072.patch,
CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared between the
two features but without the plain version vector code the changeset becomes smaller and more
manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message