Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B5050200BB1 for ; Thu, 3 Nov 2016 17:36:27 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B380F160AFF; Thu, 3 Nov 2016 16:36:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3149B160AE5 for ; Thu, 3 Nov 2016 17:36:26 +0100 (CET) Received: (qmail 29588 invoked by uid 500); 3 Nov 2016 16:36:24 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 29578 invoked by uid 99); 3 Nov 2016 16:36:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Nov 2016 16:36:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 30469C71E7 for ; Thu, 3 Nov 2016 16:36:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id wHtXEMx8frea for ; Thu, 3 Nov 2016 16:36:21 +0000 (UTC) Received: from mail-oi0-f52.google.com (mail-oi0-f52.google.com [209.85.218.52]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C17E65F5A1 for ; Thu, 3 Nov 2016 16:36:20 +0000 (UTC) Received: by mail-oi0-f52.google.com with SMTP id x4so96261022oix.2 for ; Thu, 03 Nov 2016 09:36:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=yKIWu+KZFFQKkUlLrl/Lkuugd5/rvxIP2eOZh3mgV2U=; b=Q21HlTO6MQGX6q3QVt87wpTZI/gmG5q8mUTiKibfwbzsA2gnl84UaJKeiYhIa9QF3t 8+SEOPOj9E4xz63gWdMRJq32VYLtP5bR27WPBqqxvxsnPnpfGeOWdi6qp4x0d8FdFp2X meqYJEhiuhg3WdA++Lvd8tw1qFVpPFCgB0pc+A3UzT6oWw/PTn7DsCdHqK9R+BuqLhIP M1GnpcjJoVW/Z5P3VSLd88ghoI+08IVSE5SovIbq3UWeA+YVr/59YWXOJABj3FDMHdO7 Pi3Jd1wwx2Xp4cvHyx+FsvPgUJiEps4JYkAmB/YDOBd0Yf6LuPv9hxGDgQwjmshHXegP Renw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=yKIWu+KZFFQKkUlLrl/Lkuugd5/rvxIP2eOZh3mgV2U=; b=TuMEyVVoYJxxJO1RUoMgouhqVFljN/zip9EP9HN5PbQhaguxnzpkTmplnHIMbN9jht rHvKVsf46l9Z7AM532Ng7NqTWGHgsYKfRkSHc24d/OfBKlOtDShItt1si9TfIPsqg28z q98mn9upE82kvzXAKpAcXbT2Z1UWuymuiSoHbGbLfU2N52ZPjzUfg5lEyr6sXxkDhFd8 ROxXK1AjE7//D6fCaunN4S1wXGACD3AHtkytSDzfXd0st+aAed5HC55xELYnGPGRc8cP wSgpfqP2Mgv7YDbK5dBy7K+JfsJrOYHVCqi5GBkQ8wGwaeI4XdKKQIQ+9zcg9fFinV/1 I/gA== X-Gm-Message-State: ABUngvd0A0sKkfJyYqB5Oc22YlUf9UEC+NNUW4L3ZfGA6r3XXqhGf2+4UfpkHI21DqiI6kVKxpOPCf7z8u7vXQ== X-Received: by 10.157.18.206 with SMTP id g72mr2925082otg.123.1478190979213; Thu, 03 Nov 2016 09:36:19 -0700 (PDT) MIME-Version: 1.0 References: <878563546.787483.1476984526104.ref@mail.yahoo.com> <878563546.787483.1476984526104@mail.yahoo.com> <2147151230.2503553.1477588689292@mail.yahoo.com> <1390556912.38823.1478135562310@mail.yahoo.com> In-Reply-To: From: Eric Stevens Date: Thu, 03 Nov 2016 16:36:08 +0000 Message-ID: Subject: Re: Handle Leap Seconds with Cassandra To: "anujw_2003@yahoo.co.in" , "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=94eb2c03b28839eed20540682803 archived-at: Thu, 03 Nov 2016 16:36:27 -0000 --94eb2c03b28839eed20540682803 Content-Type: text/plain; charset=UTF-8 You're able to set the timestamp of the write in the client application. If you have a table which is especially sensitive to out of order writes and want to deal with the repeated second correctly, you could do slewing at your client application layer and be explicit with the timestamp for those statements. On Wed, Nov 2, 2016 at 9:08 PM Ben Bromhead wrote: > Based on most of what I've said previously pretty much most ways of > avoiding your ordering issue of the leap second is going to be a "hack" and > there will be some amount of hope involved. > > If the updates occur more than 300ms apart and you are confident your > nodes have clocks that are within 150ms of each other, then I'd close my > eyes and hope they all leap second at the same time within that 150ms. > > If they are less then 300ms (I'm guessing you meant less 300ms), then I > would look to figure out what the smallest gap is between those two updates > and make sure your nodes clocks are close enough in that gap that the leap > second will occur on all nodes within that gap. > > If that's not good enough, you could just halt those scenarios for 2 > seconds over the leap second and then resume them once you've confirmed all > clocks have skipped. > > > On Wed, 2 Nov 2016 at 18:13 Anuj Wadehra wrote: > > Thanks Ben for taking out time for the detailed reply !! > > We dont need strict ordering for all operations but we are looking for > scenarios where 2 quick updates to same column of same row are possible. By > quick updates, I mean >300 ms. Configuring NTP properly (as mentioned in > some blogs in your link) should give fair relative accuracy between the > Cassandra nodes. But leap second takes the clock back for an ENTIRE one > sec (huge) and the probability of old write overwriting the new one > increases drastically. So, we want to be proactive with things. > > I agree that you should avoid such scebaruos with design (if possible). > > Good to know that you guys have setup your own NTP servers as per the > recommendation. Curious..Do you also do some monitoring around NTP? > > > > Thanks > Anuj > > On Fri, 28 Oct, 2016 at 12:25 AM, Ben Bromhead > > wrote: > If you need guaranteed strict ordering in a distributed system, I would > not use Cassandra, Cassandra does not provide this out of the box. I would > look to a system that uses lamport or vector clocks. Based on your > description of how your systems runs at the moment (and how close your > updates are together), you have either already experienced out of order > updates or there is a real possibility you will in the future. > > Sorry to be so dire, but if you do require causal consistency / strict > ordering, you are not getting it at the moment. Distributed systems theory > is really tricky, even for people that are "experts" on distributed systems > over unreliable networks (I would certainly not put myself in that > category). People have made a very good name for themselves by showing that > the vast majority of distributed databases have had bugs when it comes to > their various consistency models and the claims these databases make. > > So make sure you really do need guaranteed causal consistency/strict > ordering or if you can design around it (e.g. using conflict free > replicated data types) or choose a system that is designed to provide it. > > Having said that... here are some hacky things you could do in Cassandra > to try and get this behaviour, which I in no way endorse doing :) > > - Cassandra counters do leverage a logical clock per shard and you > could hack something together with counters and lightweight transactions, > but you would want to do your homework on counters accuracy during before > diving into it... as I don't know if the implementation is safe in the > context of your question. Also this would probably require a significant > rework of your application plus a significant performance hit. I would > invite a counter guru to jump in here... > > > - You can leverage the fact that timestamps are monotonic if you > isolate writes to a single node for a single shared... but you then loose > Cassandra's availability guarantees, e.g. a keyspace with an RF of 1 and a > CL of > ONE will get monotonic timestamps (if generated on the server > side). > > > - Continuing down the path of isolating writes to a single node for a > given shard you could also isolate writes to the primary replica using your > client driver during the leap second (make it a minute either side of the > leap), but again you lose out on availability and you are probably already > experiencing out of ordered writes given how close your writes and updates > are. > > > A note on NTP: NTP is generally fine if you use it to keep the clocks > synced between the Cassandra nodes. If you are interested in how we have > implemented NTP at Instaclustr, see our blogpost on it > https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/ > . > > > > Ben > > > On Thu, 27 Oct 2016 at 10:18 Anuj Wadehra wrote: > > Hi Ben, > > Thanks for your reply. We dont use timestamps in primary key. We rely on > server side timestamps generated by coordinator. So, no functions at > client side would help. > > Yes, drifts can create problems too. But even if you ensure that nodes are > perfectly synced with NTP, you will surely mess up the order of updates > during the leap second(interleaving). Some applications update same column > of same row quickly (within a second ) and reversing the order would > corrupt the data. > > I am interested in learning how people relying on strict order of updates > handle leap second scenario when clock goes back one second(same second is > repeated). What kind of tricks people use to ensure that server side > timestamps are monotonic ? > > As per my understanding NTP slew mode may not be suitable for Cassandra as > it may cause unpredictable drift amongst the Cassandra nodes. Ideas ?? > > > Thanks > Anuj > > > > Sent from Yahoo Mail on Android > > > On Thu, 20 Oct, 2016 at 11:25 PM, Ben Bromhead > > wrote: > http://www.datastax.com/dev/blog/preparing-for-the-leap-second gives a > pretty good overview > > If you are using a timestamp as part of your primary key, this is the > situation where you could end up overwriting data. I would suggest using > timeuuid instead which will ensure that you get different primary keys even > for data inserted at the exact same timestamp. > > The blog post also suggests using certain monotonic timestamp classes in > Java however these will not help you if you have multiple clients that may > overwrite data. > > As for the interleaving or out of order problem, this is hard to address > in Cassandra without resorting to external coordination or LWTs. If you are > relying on a wall clock to guarantee order in a distributed system you will > get yourself into trouble even without leap seconds (clock drift, NTP > inaccuracy etc). > > On Thu, 20 Oct 2016 at 10:30 Anuj Wadehra wrote: > > Hi, > > I would like to know how you guys handle leap seconds with Cassandra. > > I am not bothered about the livelock issue as we are using appropriate > versions of Linux and Java. I am more interested in finding an optimum > answer for the following question: > > How do you handle wrong ordering of multiple writes (on same row and > column) during the leap second? You may overwrite the new value with old > one (disaster). > > And Downtime is no option :) > > I can see that CASSANDRA-9131 is still open.. > > FYI..we are on 2.0.14 .. > > > Thanks > Anuj > > -- > Ben Bromhead > CTO | Instaclustr > +1 650 284 9692 > Managed Cassandra / Spark on AWS, Azure and Softlayer > > -- > Ben Bromhead > CTO | Instaclustr > +1 650 284 9692 <(650)%20284-9692> > Managed Cassandra / Spark on AWS, Azure and Softlayer > > -- > Ben Bromhead > CTO | Instaclustr > +1 650 284 9692 <(650)%20284-9692> > Managed Cassandra / Spark on AWS, Azure and Softlayer > --94eb2c03b28839eed20540682803 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
You're able to set the timestamp of the write in the c= lient application.=C2=A0 If you have a table which is especially sensitive = to out of order writes and want to deal with the repeated second correctly,= you could do slewing at your client application layer and be explicit with= the timestamp for those statements.

On Wed, Nov 2, 2016 at 9:08 PM Ben Bromhead <ben@instaclustr.com> wrote:
Based on most of what I've said previously pretty much m= ost ways of avoiding your ordering issue of the leap second is going to be = a "hack" and there will be some amount of hope involved.

If the updates occur more than 300ms apart and yo= u are confident your nodes have clocks that are within 150ms of each other,= then I'd close my eyes and hope they all leap second at the same time = within that 150ms.=C2=A0

If they are less= then 300ms (I'm guessing you meant less 300ms), then I would look to f= igure out what the smallest gap is between those two updates and make sure = your nodes clocks are close enough in that gap that the leap second will oc= cur on all nodes within that gap.=C2=A0

If that's not good en= ough, you could just halt those scenarios for 2 seconds over the leap secon= d and then resume them once you've confirmed all clocks have skipped.
=C2=A0

On We= d, 2 Nov 2016 at 18:13 Anuj Wadehra <anujw_2003@yahoo.co.in>= wrote:
Thanks Ben for taking out time for the detailed reply !!

We dont need strict ordering for all = operations but we are looking for scenarios where 2 quick updates to same c= olumn of same row are possible. By quick updates, I mean >300 ms.=C2=A0<= span id=3D"m_-6810037889701396808m_-9116252362558886360yMail_cursorElementT= racker_1478109733100" style=3D"font-family:sans-serif" class=3D"gmail_msg">= Configuring NTP properly (as mentioned in some blogs in your link) should g= ive fair relative accuracy between the Cassandra nodes.=C2=A0But lea= p second takes the clock back for an ENTIRE one sec (huge) and the probabil= ity of old write overwriting the new one increases drastically. So, we want= to be proactive with things.

I agree that you should avoid such scebaruos with design (if= possible).

Good= to know that you guys have setup your own NTP servers as per the recommend= ation. Curious..Do you also do some monitoring around NTP?



Thanks
Anuj

On Fri, 28 Oct, 2016 at 12= :25 AM, Ben Bromhead
&= lt;ben@instaclustr.com> wrote:
If you need guaranteed strict order= ing in a distributed system, I would not use Cassandra, Cassandra does not = provide this out of the box. I would look to a system that uses lamport or = vector clocks. Based on your description of how your systems runs at the mo= ment (and how close your updates are together), you have either already exp= erienced out of order updates or there is a real possibility you will in th= e future.=C2=A0

Sorry to be so dire, but if you do requi= re causal consistency / strict ordering, you are not getting it at the mome= nt. Distributed systems theory is really tricky, even for people that are &= quot;experts" on distributed systems over unreliable networks (I would= certainly not put myself in that category). People have made a very good n= ame for themselves by showing that the vast majority of distributed databas= es have had bugs when it comes to their various consistency models and the = claims these databases make.

So make sure you really do = need guaranteed causal consistency/strict ordering or if you can design aro= und it (e.g. using conflict free replicated data types) or choose a system = that is designed to provide it.

Having said that... here are some hacky things you could do in Cassa= ndra to try and get this behaviour, which I in no way endorse doing :)=C2= =A0
  • Cassandra counters do leverage a logical clock per shard and you cou= ld hack something together with counters and lightweight transactions, but = you would want to do your homework on counters accuracy during before divin= g into it... as I don't know if the implementation is safe in the conte= xt of your question. Also this would probably require a significant rework = of your application plus a significant performance hit. I would invite a co= unter guru to jump in here...=C2=A0
  • You can= leverage the fact that timestamps are monotonic if you isolate writes to a= single node for a single shared... but you then loose Cassandra's avai= lability guarantees, e.g. a keyspace with an RF of 1 and a CL of > ONE w= ill get monotonic timestamps (if generated on the server side).=C2=A0
  • <= /ul>
  • Continuing down the path of isolating writes to a single node for a= given shard you could also isolate writes to the primary replica using you= r client driver during the leap second (make it a minute either side of the= leap), but again you lose out on availability and you are probably already= experiencing out of ordered writes given how close your writes and updates= are.

A= note on NTP: NTP is generally fine if you use it to keep the clocks synced= between the Cassandra nodes. If you are interested in how we have implemen= ted NTP at Instaclustr, see our blogpost on it h= ttps://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization= /.



Ben
=C2=A0=C2=A0

On Thu,= 27 Oct 2016 at 10:18 Anuj Wadehra <anujw_2003@yahoo.co.in> wrote:
Hi Ben,<= div class=3D"gmail_msg" id=3D"m_-6810037889701396808m_-9116252362558886360m= _6512757104361334589yMail_cursorElementTracker_1477586021427">
Thanks for your reply. We dont use timestamps in primar= y key.=C2=A0We r= ely on server side timestamps generated by coordinator.=C2=A0So, no functions at client side would help.=C2=A0

Yes, drifts can creat= e problems too. But even if you ensure that nodes are perfectly synced with= NTP, you will surely mess up the order of updates during the leap second(i= nterleaving). Some applications update same column of same row quickly (wit= hin a second ) and reversing the order would corrupt the data.

I am interested in learning how people relying on strict orde= r of updates handle leap second scenario when clock goes back one second(sa= me second is repeated). What kind of tricks people use =C2=A0to ensure that= server side timestamps are monotonic ?

As per my= understanding NTP slew mode may not be suitable for Cassandra as it may ca= use unpredictable drift amongst the Cassandra nodes. Ideas ??=C2=A0


Thanks
Anuj


http://www.datastax.com= /dev/blog/preparing-for-the-leap-second=C2=A0gives a pretty good overvi= ew

If you are usi= ng a timestamp as part of your primary key, this is the situation where you= could end up overwriting data. I would suggest using timeuuid instead whic= h will ensure that you get different primary keys even for data inserted at= the exact same timestamp.

The blog post also sugg= ests using certain monotonic timestamp classes in Java however these will n= ot help you if you have multiple clients that may overwrite data.

As for the interleaving or out of order problem, this is ha= rd to address in Cassandra without resorting to external coordination or LW= Ts. If you are relying on a wall clock to guarantee order in a distributed = system you will get yourself into trouble even without leap seconds (clock = drift, NTP inaccuracy etc). =C2=A0

On Thu, 20 Oct 2016 at 10:30 Anu= j Wadehra <anujw_= 2003@yahoo.co.in> wrote:
Hi,

I would like to know how you g= uys handle leap seconds with Cassandra.=C2=A0

I am not bothered about = the livelock issue as we are using appropriate versions of Linux and Java. = I am more interested in finding an optimum answer for the following questio= n:

How do you handle = wrong ordering of multiple writes (on same row and column) during the leap = second? You may overwrite the new value with old one (disaster).

And Downtime is n= o option :)

I can see that CASSAN= DRA-9131 is still open..

<= /div>
FYI..w= e are on 2.0.14 ..


Thanks
Anuj
--
Ben Bromhea= d
CTO |=C2=A0Instaclustr
Managed Cassandra / Spark = on AWS, Azure and Softlayer
--
<= /div>
Ben Bromhead
CTO |=C2=A0Instaclustr
Managed Cassandra / Spark on AWS, = Azure and Softlayer
--
Ben Bromhead
C= TO |=C2=A0Instaclustr
Managed Cassandra / Spark o= n AWS, Azure and Softlayer
--94eb2c03b28839eed20540682803--