cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Weisberg <ar...@weisberg.ws>
Subject Re: How does cassandra achieve Linearizability?
Date Thu, 16 Feb 2017 22:12:55 GMT
Hi,



That would work and would help a lot with the dueling proposer issue.



A lot of the leader election stuff is designed to reduce the number of
roundtrips and not just address the dueling proposer issue. Those will
have downtime because it's there for correctness. Just adding an
affinity for a specific proposer is probably a free lunch.


I don't think you can group keys because the Paxos proposals are per
partition which is why we get linear scale out for Paxos. I don't
believe it's linearizable across multiple partitions. You can use the
clustering key and deterministically pick one of the live replicas for
that clustering key. Sort the list of replicas by IP, hash the
clustering key, use the hash as an index into the list of replicas.


Batching is of limited usefulness because we only use Paxos for CAS I
think? So in a batch by definition all but one will fail the CAS. This
is something where a distinguished coordinator could help by failing
the rest of the contending requests more inexpensively than it
currently does.


Ariel

On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:

> 

> 

> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
> <ariel@weisberg.ws> wrote:
>> __

>> Hi,

>> 

>> Classic Paxos doesn't have a leader. There are variants on the
>> original Lamport approach that will elect a leader (or some other
>> variation like Mencius) to improve throughput, latency, and
>> performance under contention. Cassandra implements the approach from
>> the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb) with no
>> additional optimizations that I am aware of. There is no
>> distinguished proposer (leader).
>> 

>> That paper does  go on to discuss electing a distinguished proposer,
>> but that was never done for C*. I believe it's not considered a good
>> fit for C* philosophically.
>> 

>> Ariel

>> 

>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

>>> @Ariel Weisberg EPaxos looks very interesting as it looks like it
>>> doesn't need any designated leader for C* but I am assuming the
>>> paxos that is implemented today for LWT's requires Leader election
>>> and If so, don't we need to have an odd number of nodes or racks or
>>> DC's to satisfy N = 2F + 1 constraint to tolerate F failures ? I
>>> understand it is not needed when not using LWT's since Cassandra is
>>> a master-less system.
>>> 

>>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali <kant@peernova.com>
>>> wrote:
>>>> Thanks Ariel! Yes I knew there are so many variations and
>>>> optimizations of Paxos. I just wanted to see if we had any plans on
>>>> improving the existing Paxos implementation and it is great to see
>>>> the work is under progress! I am going to follow that ticket and
>>>> read up the references pointed in it
>>>> 

>>>> 

>>>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg <ariel@weisberg.ws>
>>>> wrote:
>>>>> __

>>>>> Hi,

>>>>> 

>>>>> Cassandra's implementation of Paxos doesn't implement many
>>>>> optimizations that would drastically improve throughput and
>>>>> latency. You need consensus, but it doesn't have to be
>>>>> exorbitantly expensive and fall over under any kind of contention.
>>>>> 

>>>>> For instance you could implement EPaxos
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch
>>>>> multiple operations into the same Paxos round, have an affinity
>>>>> for a specific proposer for a specific partition, implement
>>>>> asynchronous commit, use a more efficient implementation of the
>>>>> Paxos log, and maybe other things.
>>>>> 

>>>>> 

>>>>> Ariel

>>>>> 

>>>>> 

>>>>> 

>>>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

>>>>>> Hi Kant,

>>>>>> 

>>>>>> If you read the published papers about Paxos, you will most
>>>>>> probably recognize that there is no way to "do it better". This
>>>>>> is a conceptional thing due to the nature of distributed systems
>>>>>> + the CAP theorem.
>>>>>> If you want A+P in the triangle, then C is very expensive. CS is
>>>>>> made for A+P mostly with tunable C. In ACID databases this is a
>>>>>> completely different thing as they are mostly either not
>>>>>> partition tolerant, not highly available or not scalable (in a
>>>>>> distributed manner, not speaking of "monolithic super servers").
>>>>>> 

>>>>>> There is no free lunch ...

>>>>>> 

>>>>>> 

>>>>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali <kant@peernova.com>:

>>>>>>> "That’s the safety blanket everyone wants but is extremely
>>>>>>> expensive, especially in Cassandra."
>>>>>>> 

>>>>>>> yes LWT's are expensive. Are there any plans to make this
>>>>>>> better?
>>>>>>> 

>>>>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali
>>>>>>> <kant@peernova.com> wrote:
>>>>>>>> Hi Jon,

>>>>>>>> 

>>>>>>>> Thanks a lot for your response. I am well aware that the
LWW !=
>>>>>>>> LWT but I was talking more in terms of LWW with respective
to
>>>>>>>> LWT's which I believe you answered. so thanks much!
>>>>>>>> 

>>>>>>>> 

>>>>>>>> kant

>>>>>>>> 

>>>>>>>> 

>>>>>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad
>>>>>>>> <jonathan.haddad@gmail.com> wrote:
>>>>>>>>> LWT != Last Write Wins.  They are totally different.
 

>>>>>>>>> 

>>>>>>>>> LWTs give you (assuming you also read at SERIAL) “atomic
>>>>>>>>> consistency”, meaning you are able to perform operations
>>>>>>>>> atomically and in isolation.  That’s the safety blanket
>>>>>>>>> everyone wants but is extremely expensive, especially
in
>>>>>>>>> Cassandra.  The lightweight part, btw, may be a little
>>>>>>>>> optimistic, especially if a key is under contention.
 With
>>>>>>>>> regard to the “last write” part you’re asking about
- w/ LWT
>>>>>>>>> Cassandra provides the timestamp and manages it as part
of the
>>>>>>>>> ballot, and it always is increasing.  See org.apache.cassandr-
>>>>>>>>> a.service.ClientState#getTimestampForPaxos.  From the
code:
>>>>>>>>> 

>>>>>>>>>  * Returns a timestamp suitable for paxos given the timestamp
>>>>>>>>>    of the last known commit (or in progress update).
>>>>>>>>>  * Paxos ensures that the timestamp it uses for commits
>>>>>>>>>    respects the serial order of those commits. It does
so
>>>>>>>>>  * by having each replica reject any proposal whose timestamp
>>>>>>>>>    is not strictly greater than the last proposal it
>>>>>>>>>  * accepted. So in practice, which timestamp we use for
a
>>>>>>>>>    given proposal doesn't affect correctness but it does
>>>>>>>>>  * affect the chance of making progress (if we pick a
>>>>>>>>>    timestamp lower than what has been proposed before,
our
>>>>>>>>>  * new proposal will just get rejected).

>>>>>>>>> 

>>>>>>>>> Effectively paxos removes the ability to use custom timestamps
>>>>>>>>> and addresses clock variance by rejecting ballots with
>>>>>>>>> timestamps less than what was last seen.  You can learn
more
>>>>>>>>> by reading through the other comments and code in that
file.
>>>>>>>>> 

>>>>>>>>> Last write wins is a free for all that guarantees you
>>>>>>>>> *nothing* except the timestamp is used as a tiebreaker.
 Here
>>>>>>>>> we acknowledge things like the speed of light as being
a real
>>>>>>>>> problem that isn’t going away anytime soon.  This problem
is
>>>>>>>>> sometimes addressed with event sourcing rather than mutating
>>>>>>>>> in place.
>>>>>>>>> 

>>>>>>>>> Hope this helps.

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>> Jon

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>> 

>>>>>>>>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <kant@peernova.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 

>>>>>>>>>> @Justin I read this article
>>>>>>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
>>>>>>>>>> And it clearly says Linearizable consistency can
be achieved
>>>>>>>>>> with LWT's.  so should I assume the Linearizability
in the
>>>>>>>>>> context of the above article is possible with LWT's
and
>>>>>>>>>> synchronization of clocks through ntpd ? because
LWT's also
>>>>>>>>>> follow Last Write Wins. isn't it? Also another question
does
>>>>>>>>>> most of the production clusters do setup ntpd? If
so what is
>>>>>>>>>> the time it takes to sync? any idea
>>>>>>>>>> 

>>>>>>>>>> @Micheal Schuler Are you referring to  something
like true
>>>>>>>>>> time as in
>>>>>>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
>>>>>>>>>> Actually I never heard of setting up GPS modules
and how that
>>>>>>>>>> can be helpful. Let me research on that but good
point.
>>>>>>>>>> 

>>>>>>>>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler
>>>>>>>>>> <michael@pbandjelly.org> wrote:
>>>>>>>>>>> If you require the best precision you can get,
setting up a
>>>>>>>>>>> pair of
>>>>>>>>>>> stratum 1 ntpd masters in each data center location
with a
>>>>>>>>>>> GPS modules
>>>>>>>>>>> is not terribly complex. Low latency and jitter
on servers
>>>>>>>>>>> you manage.
>>>>>>>>>>> 140ms is a long way away network-wise, and I
would suggest
>>>>>>>>>>> that was a
>>>>>>>>>>> poor choice of upstream (probably stratum 2 or
3) source.

>>>>>>>>>>> 

>>>>>>>>>>> As Jonathan mentioned, there's no guarantee from
Cassandra,
>>>>>>>>>>> but if you
>>>>>>>>>>> need as close as you can get, you'll probably
need to do it
>>>>>>>>>>> yourself.
>>>>>>>>>>> 

>>>>>>>>>>> (I run several stratum 2 ntpd servers for pool.ntp.org[2])

>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>  Kind regards, Michael
>>>>>>>>>>>
>>>>>>>>>>>  On 02/09/2017 06:47 PM, Kant Kodali wrote:
>>>>>>>>>>>  > Hi Justin,
>>>>>>>>>>>  >
>>>>>>>>>>>  > There are bunch of issues w.r.t to synchronization
of
>>>>>>>>>>>  > clocks when we used ntpd. Also the time
it took to sync
>>>>>>>>>>>  > the clocks was approx 140ms (don't quote
me on it though
>>>>>>>>>>>  > because it is reported by our devops :)
>>>>>>>>>>>  >
>>>>>>>>>>>  > we have multiple clients (for example bunch
of micro
>>>>>>>>>>>  > services are reading from Cassandra) I
am not sure how
>>>>>>>>>>>  > one can achieve Linearizability by setting
timestamps on
>>>>>>>>>>>  > the clients ? since there is no total ordering
across
>>>>>>>>>>>  > multiple clients.
>>>>>>>>>>>  >
>>>>>>>>>>>  > Thanks!
>>>>>>>>>>>  >
>>>>>>>>>>>  >
>>>>>>>>>>>  > On Thu, Feb 9, 2017 at 4:16 PM, Justin
Cameron
>>>>>>>>>>>  > <justin@instaclustr.com
>>>>>>>>>>> > <mailto:justin@instaclustr.com>>
wrote:
>>>>>>>>>>>  >
>>>>>>>>>>>  >     Hi Kant,
>>>>>>>>>>>  >
>>>>>>>>>>>  >     Clock synchronization is important
- you should
>>>>>>>>>>>  >     ensure that ntpd is properly configured
on all nodes.
>>>>>>>>>>>  >     If your particular use case is especially
sensitive
>>>>>>>>>>>  >     to out-of-order mutations it is possible
to set
>>>>>>>>>>>  >     timestamps on the client side using
the drivers.
>>>>>>>>>>>  >     https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>>>>>>>>>>>  >     <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>
>>>>>>>>>>>  >
>>>>>>>>>>>  >     We use our own NTP cluster to reduce
clock drift as
>>>>>>>>>>>  >     much as possible, but public NTP servers
are good
>>>>>>>>>>>  >     enough for most uses.
>>>>>>>>>>>  >     https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>>>>>>>>>>>  >     <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>
>>>>>>>>>>>  >
>>>>>>>>>>>  >     Cheers, Justin
>>>>>>>>>>>  >
>>>>>>>>>>>  >     On Thu, 9 Feb 2017 at 16:09 Kant Kodali
>>>>>>>>>>>  >     <kant@peernova.com
>>>>>>>>>>> >     <mailto:kant@peernova.com>>
wrote:
>>>>>>>>>>>  >
>>>>>>>>>>>  >         How does Cassandra achieve Linearizability
with
>>>>>>>>>>>  >         “Last write wins” (conflict
resolution methods
>>>>>>>>>>>  >         based on time-of-day clocks) ?
>>>>>>>>>>>  >
>>>>>>>>>>>  >         Relying on synchronized clocks
are almost
>>>>>>>>>>>  >         certainly non-linearizable, because
clock
>>>>>>>>>>>  >         timestamps cannot be guaranteed
to be consistent
>>>>>>>>>>>  >         with actual event ordering due
to clock skew.
>>>>>>>>>>>  >         isn't it?
>>>>>>>>>>>  >
>>>>>>>>>>>  >         Thanks!
>>>>>>>>>>>  >
>>>>>>>>>>>  >     --
>>>>>>>>>>>  >
>>>>>>>>>>>  >     Justin Cameron
>>>>>>>>>>>  >
>>>>>>>>>>>  >     Senior Software Engineer | Instaclustr
>>>>>>>>>>>  >
>>>>>>>>>>>  >
>>>>>>>>>>>  >
>>>>>>>>>>>  >
>>>>>>>>>>> >     This email has been sent on behalf of
Instaclustr Pty
>>>>>>>>>>> >     Ltd
>>>>>>>>>>> >     (Australia) and Instaclustr Inc (USA).

>>>>>>>>>>> >

>>>>>>>>>>> >     This email and any attachments may contain
>>>>>>>>>>> >     confidential and legally
>>>>>>>>>>> >     privileged information.  If you are
not the intended
>>>>>>>>>>> >     recipient, do
>>>>>>>>>>> >     not copy or disclose its content, but
please reply to
>>>>>>>>>>> >     this email
>>>>>>>>>>> >     immediately and highlight the error
to the sender and
>>>>>>>>>>> >     then
>>>>>>>>>>> >     immediately delete the message.

>>>>>>>>>>> >

>>>>>>>>>>> >

>>>>>>>>>>> 

>>>>>>>>>> 

>>>>>>>>> 

>>>>>>>> 

>>>>>>> 

>>>>>> 

>>>>>> 

>>>>>> 

>>>>>> 

>>>>>> -- 

>>>>>> Benjamin Roth

>>>>>> Prokurist

>>>>>> 

>>>>>> Jaumo GmbH · www.jaumo.com

>>>>>> Wehrstraße 46 · 73035 Göppingen · Germany

>>>>>> Phone +49 7161 304880-6[3] · Fax +49 7161 304880-1[4]

>>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

>>>>>> 

>>>>>> 

>>>>> 

>>>>> 

>>>>> 

>> 

> One thing that always bothered me: Intelligent clients and dynamic
> snitch are designed to attempt to route requests to the same node to
> attempt to take advantage of cache pinning etc. You would think under
> these conditions one could naturally elect a "leader" for a "group" of
> keys that could persist for a few hundred milliseconds and batch up
> the round trips for a number of operations. Maybe that is what the
> distinguished coordinator is in some regards.



Links:

  1. https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22
  2. http://pool.ntp.org/
  3. tel:+49%207161%203048806
  4. tel:+49%207161%203048801

Mime
View raw message