cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-2495) Add a proper retry mechanism for counters in case of failed request
Date Fri, 24 Jun 2011 15:33:47 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sylvain Lebresne updated CASSANDRA-2495:
----------------------------------------

    Attachment: marker_idea.txt

Here's my initial though on that:

First, I think that whatever we come up with will have intrinsic limitations.  What we want,
is being able to retry a failed increment, with the guarantee that the retry will only be
applied if the initial increment was not. By failed increment, we mean here one of:
  * the client got a TimeoutException back
  * the coordinator died and the client got some broken pipe error
  * a bug made the coordinator return a TApplicationException("some unexpected shit happened")

When that happens, different things can have happened. One possible scenario is that the first
replica (let's call him A) received the increment, did persist it on disk, but then failed
before having replicated it. If that happens, we end up in a situation where until A is brought
back up, we *cannot* decide whether a retry should be discarded or actually retried.  Because
we cannot know whether A died just before persisting the increment or just after.

Which leads me to think that whatever idea we have for this, it will likely have one of the
two following drawback:
  # either retry will be limited to CL.ALL (fairly useless in my opinion)
  # or we accept the retry at any CL, but have a way to eventually detect when both the initial
increment and it's retry have been applied, and have a way to repair when that happens. Which
quite probably imply that we will have over-count, but with the guarantee that they will be
eventually repaired.

Of course, there can be better solutions that I don't see.


Anyway, I had tried to implement the second idea (the eventual repair) back in the days on
CASSANDRA-1546. In particular, I'm attaching to this issue the txt file (marker_idea) from
there that was supposed to explain how this should work. The code in CASSANDRA-1546 is also
supposed to implement this idea, so more details on the specifics could be found there if
the text file is not so clean. Unfortunately, when I though about porting this idea to the
current code, I realized that it had corner cases it wasn't handling well: in some situation
the complete death of a node was problematic and I haven't found a good solution so far. So
the whole idea may or may not be a good starting point.


> Add a proper retry mechanism for counters in case of failed request
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-2495
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2495
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8 beta 1
>            Reporter: Sylvain Lebresne
>         Attachments: marker_idea.txt
>
>
> Contrarily to standard insert, counter increments are not idempotent. As such, replaying
a counter mutation when a TimeoutException occurs could lead to an over-count. This alone
limits the use cases for which counters are a viable solution, so we should try to come up
with a mechanism that allow the replay of a failed counter mutation without the risk of over-count.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message