cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kant Kodali <k...@peernova.com>
Subject Re: How does cassandra achieve Linearizability?
Date Fri, 10 Feb 2017 18:25:56 GMT
Thanks Ariel! Yes I knew there are so many variations and optimizations of
Paxos. I just wanted to see if we had any plans on improving the existing
Paxos implementation and it is great to see the work is under progress! I
am going to follow that ticket and read up the references pointed in it


On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg <ariel@weisberg.ws> wrote:

> Hi,
>
> Cassandra's implementation of Paxos doesn't implement many optimizations
> that would drastically improve throughput and latency. You need consensus,
> but it doesn't have to be exorbitantly expensive and fall over under any
> kind of contention.
>
> For instance you could implement EPaxos https://issues.apache.
> org/jira/browse/CASSANDRA-6246
> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>,
> batch multiple operations into the same Paxos round, have an affinity for a
> specific proposer for a specific partition, implement asynchronous commit,
> use a more efficient implementation of the Paxos log, and maybe other
> things.
>
> Ariel
>
>
> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>
> Hi Kant,
>
> If you read the published papers about Paxos, you will most probably
> recognize that there is no way to "do it better". This is a conceptional
> thing due to the nature of distributed systems + the CAP theorem.
> If you want A+P in the triangle, then C is very expensive. CS is made for
> A+P mostly with tunable C. In ACID databases this is a completely different
> thing as they are mostly either not partition tolerant, not highly
> available or not scalable (in a distributed manner, not speaking of
> "monolithic super servers").
>
> There is no free lunch ...
>
>
> 2017-02-10 11:09 GMT+01:00 Kant Kodali <kant@peernova.com>:
>
> "That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra."
>
> yes LWT's are expensive. Are there any plans to make this better?
>
> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali <kant@peernova.com> wrote:
>
> Hi Jon,
>
> Thanks a lot for your response. I am well aware that the LWW != LWT but I
> was talking more in terms of LWW with respective to LWT's which I believe
> you answered. so thanks much!
>
>
> kant
>
>
> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad <jonathan.haddad@gmail.com>
> wrote:
>
> LWT != Last Write Wins.  They are totally different.
>
> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
> meaning you are able to perform operations atomically and in isolation.
> That’s the safety blanket everyone wants but is extremely expensive,
> especially in Cassandra.  The lightweight part, btw, may be a little
> optimistic, especially if a key is under contention.  With regard to the
> “last write” part you’re asking about - w/ LWT Cassandra provides the
> timestamp and manages it as part of the ballot, and it always is
> increasing.  See org.apache.cassandra.service.ClientState#getTimestampForPaxos.
> From the code:
>
>  * Returns a timestamp suitable for paxos given the timestamp of the last
> known commit (or in progress update).
>  * Paxos ensures that the timestamp it uses for commits respects the
> serial order of those commits. It does so
>  * by having each replica reject any proposal whose timestamp is not
> strictly greater than the last proposal it
>  * accepted. So in practice, which timestamp we use for a given proposal
> doesn't affect correctness but it does
>  * affect the chance of making progress (if we pick a timestamp lower than
> what has been proposed before, our
>  * new proposal will just get rejected).
>
> Effectively paxos removes the ability to use custom timestamps and
> addresses clock variance by rejecting ballots with timestamps less than
> what was last seen.  You can learn more by reading through the other
> comments and code in that file.
>
> Last write wins is a free for all that guarantees you *nothing* except the
> timestamp is used as a tiebreaker.  Here we acknowledge things like the
> speed of light as being a real problem that isn’t going away anytime soon.
> This problem is sometimes addressed with event sourcing rather than
> mutating in place.
>
> Hope this helps.
>
>
> Jon
>
>
>
>
> On Feb 9, 2017, at 5:21 PM, Kant Kodali <kant@peernova.com> wrote:
>
> @Justin I read this article http://www.datastax.com/dev/bl
> og/lightweight-transactions-in-cassandra-2-0. And it clearly says
> Linearizable consistency can be achieved with LWT's.  so should I assume
> the Linearizability in the context of the above article is possible with
> LWT's and synchronization of clocks through ntpd ? because LWT's also
> follow Last Write Wins. isn't it? Also another question does most of the
> production clusters do setup ntpd? If so what is the time it takes to sync?
> any idea
>
> @Micheal Schuler Are you referring to  something like true time as in
> https://static.googleusercontent.com/media/research.google.c
> om/en//archive/spanner-osdi2012.pdf?  Actually I never heard of setting
> up GPS modules and how that can be helpful. Let me research on that but
> good point.
>
> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler <michael@pbandjelly.org>
> wrote:
>
> If you require the best precision you can get, setting up a pair of
> stratum 1 ntpd masters in each data center location with a GPS modules
> is not terribly complex. Low latency and jitter on servers you manage.
> 140ms is a long way away network-wise, and I would suggest that was a
> poor choice of upstream (probably stratum 2 or 3) source.
>
> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
> need as close as you can get, you'll probably need to do it yourself.
>
> (I run several stratum 2 ntpd servers for pool.ntp.org)
>
> --
> Kind regards,
> Michael
>
> On 02/09/2017 06:47 PM, Kant Kodali wrote:
> > Hi Justin,
> >
> > There are bunch of issues w.r.t to synchronization of clocks when we
> > used ntpd. Also the time it took to sync the clocks was approx 140ms
> > (don't quote me on it though because it is reported by our devops :)
> >
> > we have multiple clients (for example bunch of micro services are
> > reading from Cassandra) I am not sure how one can achieve
> > Linearizability by setting timestamps on the clients ? since there is no
> > total ordering across multiple clients.
> >
> > Thanks!
> >
> >
> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron <justin@instaclustr.com
> > <mailto:justin@instaclustr.com>> wrote:
> >
> >     Hi Kant,
> >
> >     Clock synchronization is important - you should ensure that ntpd is
> >     properly configured on all nodes. If your particular use case is
> >     especially sensitive to out-of-order mutations it is possible to set
> >     timestamps on the client side using the
> >     drivers. https://docs.datastax.com/en/d
> eveloper/java-driver/3.1/manual/query_timestamps/
> >     <https://docs.datastax.com/en/developer/java-driver/3.1/man
> ual/query_timestamps/>
> >
> >     We use our own NTP cluster to reduce clock drift as much as
> >     possible, but public NTP servers are good enough for most
> >     uses. https://www.instaclustr.com/blog/2015/11/05/apache-cassandra
> -synchronization/
> >     <https://www.instaclustr.com/blog/2015/11/05/apache-cassand
> ra-synchronization/>
> >
> >     Cheers,
> >     Justin
> >
> >     On Thu, 9 Feb 2017 at 16:09 Kant Kodali <kant@peernova.com
> >     <mailto:kant@peernova.com>> wrote:
> >
> >         How does Cassandra achieve Linearizability with “Last write
> >         wins” (conflict resolution methods based on time-of-day clocks) ?
> >
> >         Relying on synchronized clocks are almost certainly
> >         non-linearizable, because clock timestamps cannot be guaranteed
> >         to be consistent with actual event ordering due to clock skew.
> >         isn't it?
> >
> >         Thanks!
> >
> >     --
> >
> >     Justin Cameron
> >
> >     Senior Software Engineer | Instaclustr
> >
> >
> >
> >
> >     This email has been sent on behalf of Instaclustr Pty Ltd
> >     (Australia) and Instaclustr Inc (USA).
> >
> >     This email and any attachments may contain confidential and legally
> >     privileged information.  If you are not the intended recipient, do
> >     not copy or disclose its content, but please reply to this email
> >     immediately and highlight the error to the sender and then
> >     immediately delete the message.
> >
> >
>
>
>
>
>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>

Mime
View raw message