Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC2CF187D4 for ; Tue, 20 Oct 2015 16:35:31 +0000 (UTC) Received: (qmail 46594 invoked by uid 500); 20 Oct 2015 16:35:19 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 46556 invoked by uid 500); 20 Oct 2015 16:35:19 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 46546 invoked by uid 99); 20 Oct 2015 16:35:19 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Oct 2015 16:35:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0ECDC1A2101 for ; Tue, 20 Oct 2015 16:35:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.122 X-Spam-Level: *** X-Spam-Status: No, score=3.122 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, KAM_HUGEIMGSRC=0.2, T_KAM_HTML_FONT_INVALID=0.01, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001, URI_TRY_3LD=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=datastax.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id R1Fnf5pyKc8p for ; Tue, 20 Oct 2015 16:35:05 +0000 (UTC) Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id C367A20FE0 for ; Tue, 20 Oct 2015 16:35:04 +0000 (UTC) Received: by wicll6 with SMTP id ll6so53961675wic.0 for ; Tue, 20 Oct 2015 09:35:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datastax.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=EFfLBiqqXeV46HD6pBuC2tScsLQl7hA1OuaLVZ6499A=; b=AGyFzKuO+4X1ggu1FbfqADRhj1poT96/sVPcYnMT7nQ4XvAWNF5JfiWkmh8/JeloZK K4Grs4nG3JeJaJgLB7y77q6c1fRQQ1U4PUDIxfMIq9imDkRoaTnFi9udceYrNqONzKQM nWf5DIXuQqU0ynHQw3nha1KC/vBEvgs6lD6qE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=EFfLBiqqXeV46HD6pBuC2tScsLQl7hA1OuaLVZ6499A=; b=ffX4CY+utoEw445iezlwdASeDUKaqVcAaErEse6LHqA8IoPTxX6S3Jvb+k4x1H005x SgnlIIEi9Kx32u9N661JtGyxPafZ8ThXzQ3ey8ZEdnd0H1lHEN86S0bS4mdlJMWaq3Uv rtM9EVzAKUBi+kM3XIt4EAzQWWMoLS2zEOamWzfDV6SGUSOSqPYp/gX6rfFwPaxCWNNB 9o182H+bBplzVRd3A6aP3WG5WKfItOCQD14j8rA3V2k4q3WdaNsUHF4cHVmfAxDbiJCT OLD78mXx5io1eZnlhUer/WrccN3Kgt4X4zaGtS/WVZfrURBsITjlz4ckqeDyoDlYsUNq MLIQ== X-Gm-Message-State: ALoCoQlKLQTCt6PfLUus146n5QYFYzIBPH7Ysbd4Wt8rOlSLVFm37YVXacXbMEnnz8cB1NmguPlt X-Received: by 10.194.23.2 with SMTP id i2mr5323444wjf.106.1445358903420; Tue, 20 Oct 2015 09:35:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.235.33 with HTTP; Tue, 20 Oct 2015 09:34:43 -0700 (PDT) In-Reply-To: References: From: Sebastian Estevez Date: Tue, 20 Oct 2015 12:34:43 -0400 Message-ID: Subject: Re: "invalid global counter shard detected" warning on 2.1.3 and 2.1.10 To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7b45116802ff4805228bd8d3 --047d7b45116802ff4805228bd8d3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Branton, > - How much should we be freaking out? > > The impact of this is possible counter inaccuracy (over counting or under counting). If you are expecting counters to be exactly accurate, you are already in trouble because they are not. This is because of the fact that they are not idempotent operations operating in a distributed system (you've probably read Aleksey's post by now). > > - Why is this recurring? If I understand what's happening, this is a > self-healing process. So, why would it keep happening? Are we possib= ly > using counters incorrectly? > > Even after running sstableupgrade, your counter cells will not be upgrade= d until they have all been incremented. You may still seeing the warning happening on pre 2.1 counter cells which have not been incremented yet. > > - What does it even mean that there were multiple shards for the same > counter? How does that situation even occur? > > We used to maintain "counter shards" at the sstable level in pre 2.1 counters. This means that on compaction or reads we would essentially add the shards together when getting the value or merging the cells. This caused a series of problems including the warning you are still seeing. TL;DR, we now store the final value of the counter (not the increment/shard) at the commitlog level and beyond in post 2.1 counters, so this is no longer an issue. Again, read Aleksey's post . Many users started fresh tables after upgrading to 2.1, update only the new tables, and added application logic to decide what table to read from. Something like monthly tables works well if you're doing time series counters, and would ensure that you stop seeing the warnings on the new/active tables and get the benefits of 2.1 counters quickly. All the best, [image: datastax_logo.png] Sebasti=C3=A1n Est=C3=A9vez Solutions Architect | 954 905 8615 | sebastian.estevez@datastax.com [image: linkedin.png] [image: facebook.png] [image: twitter.png] [image: g+.png] DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world=E2=80=99s most innovative enterpri= ses. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. On Tue, Oct 20, 2015 at 12:21 PM, Branton Davis wrote: > Howdy Cassandra folks. > > Crickets here and it's sort of unsettling that we're alone with this > issue. Is it appropriate to create a JIRA issue for this or is there may= be > another way to deal with it? > > Thanks! > > On Sun, Oct 18, 2015 at 1:55 PM, Branton Davis > wrote: > >> Hey all. >> >> We've been seeing this warning on one of our clusters: >> >> 2015-10-18 14:28:52,898 WARN [ValidationExecutor:14] >> org.apache.cassandra.db.context.CounterContext invalid global counter sh= ard >> detected; (4aa69016-4cf8-4585-8f23-e59af050d174, 1, 67158) and >> (4aa69016-4cf8-4585-8f23-e59af050d174, 1, 21486) differ only in count; w= ill >> pick highest to self-heal on compaction >> >> >> From what I've read and heard in the IRC channel, this warning could be >> related to not running upgradesstables after upgrading from 2.0.x to >> 2.1.x. I don't think we ran that then, but we've been at 2.1 since last >> November. Looking back, the warnings start appearing around June, when = no >> maintenance had been performed on the cluster. At that time, we had bee= n >> on 2.1.3 for a couple of months. We've been on 2.1.10 for the last week >> (the upgrade was when we noticed this warning for the first time). >> >> From a suggestion in IRC, I went ahead and ran upgradesstables on all th= e >> nodes. Our weekly repair also ran this morning. But the warnings still >> show up throughout the day. >> >> So, we have many questions: >> >> - How much should we be freaking out? >> - Why is this recurring? If I understand what's happening, this is a >> self-healing process. So, why would it keep happening? Are we possi= bly >> using counters incorrectly? >> - What does it even mean that there were multiple shards for the same >> counter? How does that situation even occur? >> >> We're pretty lost here, so any help would be greatly appreciated. >> >> Thanks! >> > > --047d7b45116802ff4805228bd8d3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Branton,

  • How much should we be = freaking out?
T= he impact of this is possible counter=C2=A0inaccuracy=C2=A0(over counting o= r under counting). If you are expecting counters to be exactly accurate, yo= u are already in trouble because they are not. This is because of the fact = that they are not idempotent operations operating in a distributed system (= you've probably read Aleksey's= post by now).
    =
  • Why is this recurring?=C2=A0 If I understand= what's happening, this is a self-healing process.=C2=A0 So, why would = it keep happening?=C2=A0 Are we possibly using counters incorrectly?
Even after running ss= tableupgrade, your counter cells will not be upgraded until they have all b= een incremented. You may still seeing the warning happening on pre 2.1 coun= ter cells which have not been incremented yet.=C2=A0
  • What d= oes it even mean that there were multiple shards for the same counter?=C2= =A0 How does that situation even occur?
We used to maintain "counter shards" at = the sstable level in pre 2.1 counters. This means that on compaction or rea= ds we would essentially add the shards together when getting the value or m= erging the cells. This caused a series of problems including the warning yo= u are still seeing. TL;DR, we now store the final value of the counter (not= the increment/shard) at the commitlog level and beyond in post 2.1 counter= s, so this is no longer an issue. Again, read Aleksey's post.

Many users starte= d fresh tables after upgrading to 2.1, update only the new tables, and adde= d application logic to decide what table to read from. Something like month= ly tables works well if you're doing time series counters, and would en= sure that you stop seeing the warnings on the new/active tables and get the= benefits of 2.1 counters quickly.=C2=A0


=
=C2=A0
<= br clear=3D"all">
=

All the best,

3D"datastax_logo.png"

Sebasti=C3=A1n Est=C3=A9v= ez

Solutions Architect | 954 905 8615 | sebastian.estevez@datasta= x.com

3D"linkedin.png" 3D"facebook.png"<= /span> 3D"twitter.png"= 3D"g+.png" = =


3D""


DataStax is the fastest, most scalabl= e distributed database technology, delivering Apache Cassandra to the world= =E2=80=99s most innovative enterprises. Datastax is built to be agile, alwa= ys-on, and predictably scalable to any size. With more than 500 customers i= n 45 countries, DataStax is the database technolo= gy and transactional backbone of choice for the worlds most innovative comp= anies such as Netflix, Adobe, Intuit, and eBay.

On Tue, Oct 20, 2015 at 12:21 PM, Branton Da= vis <branton.davis@spanning.com> wrote:
Howdy Cassandra folks.

Crickets here and it's sort of unsettling that we're alone w= ith this issue.=C2=A0 Is it appropriate to create a JIRA issue for this or = is there maybe another way to deal with it?

Thanks!

On Sun, Oct 18, 2015 at 1:55 PM, Branto= n Davis <branton.davis@spanning.com> wrote:
Hey all.

We&= #39;ve been seeing this warning on one of our clusters:

2015-10-18= 14:28:52,898 WARN =C2=A0[ValidationExecutor:14] org.apache.cassandra.db.co= ntext.CounterContext invalid global counter shard detected; (4aa69016-4cf8-= 4585-8f23-e59af050d174, 1, 67158) and (4aa69016-4cf8-4585-8f23-e59af050d174= , 1, 21486) differ only in count; will pick highest to self-heal on compact= ion

From what I've read and heard in th= e IRC channel, this warning could be related to not running upgradesstables= after upgrading from 2.0.x to 2.1.x.=C2=A0 I don't think we ran that t= hen, but we've been at 2.1 since last November.=C2=A0 Looking back, the= warnings start appearing around June, when no maintenance had been perform= ed on the cluster.=C2=A0 At that time, we had been on 2.1.3 for a couple of= months.=C2=A0 We've been on 2.1.10 for the last week (the upgrade was = when we noticed this warning for the first time).

= >From a suggestion in IRC, I went ahead and ran upgradesstables on all the n= odes.=C2=A0 Our weekly repair also ran this morning.=C2=A0 But the warnings= still show up throughout the day.

So, we have man= y questions:
  • How much should we be freaking out?
  • = Why is this recurring?=C2=A0 If I understand what's happening, this is = a self-healing process.=C2=A0 So, why would it keep happening?=C2=A0 Are we= possibly using counters incorrectly?
  • What does it even mean that t= here were multiple shards for the same counter?=C2=A0 How does that situati= on even occur?
We're pretty lost here, so any help would be gr= eatly appreciated.

Thanks!


--047d7b45116802ff4805228bd8d3--