From user-return-63995-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org  Thu Jun  6 02:31:12 2019
Return-Path: <user-return-63995-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id EAF2C18065D
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  6 Jun 2019 04:31:11 +0200 (CEST)
Received: (qmail 66024 invoked by uid 500); 6 Jun 2019 02:31:08 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@cassandra.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@cassandra.apache.org>
List-Post: <mailto:user@cassandra.apache.org>
List-Id: <user.cassandra.apache.org>
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 66014 invoked by uid 99); 6 Jun 2019 02:31:07 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2019 02:31:07 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5DFF3C2CFB
	for <user@cassandra.apache.org>; Thu,  6 Jun 2019 02:31:07 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 4.004
X-Spam-Level: ****
X-Spam-Status: No, score=4.004 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	KAM_BADIPHTTP=2, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key)
	header.d=koppedomain-com.20150623.gappssmtp.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id puAzNTCqsISO for <user@cassandra.apache.org>;
	Thu,  6 Jun 2019 02:31:04 +0000 (UTC)
Received: from mail-io1-f46.google.com (mail-io1-f46.google.com [209.85.166.46])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0B6655F1B9
	for <user@cassandra.apache.org>; Thu,  6 Jun 2019 02:31:04 +0000 (UTC)
Received: by mail-io1-f46.google.com with SMTP id h6so636321ioh.3
        for <user@cassandra.apache.org>; Wed, 05 Jun 2019 19:31:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=koppedomain-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=U3i5RO1/AHbS5sUPt5S4MpX8W1Y/j1ALi0dA5xPRZ+k=;
        b=xMlC4XIraFLhZJ9Cojd/eykeFuxsUE/p+fXxi8gQBIkg1ECunXmE9dv7piAkZMu2Fx
         45HWF/Hr6hnwTpHibg7DxBC/2T2M7v3rUHqUuq/QYIkjiSWvQjJqtxZuVzXJZ/a7D3yR
         gEAIvM+IkbJba+1yRx85ksh7hJQ8tuzMbK2EIjKYG3fHWMVPm5Y1Zin/mpZpFNIifr6t
         NrH1YhXVxIoC0CgbcPPs4jHtCRZ7fBBN+/1+1N617AgfTbev023QkfTMy3t28GvHU3Tq
         k/OPfCtGze/Ft3dScuiyOA3duy4sUEFdojpE2lTphDg/+N/HaeqOBSXryJvk8Ds0u1xY
         niZw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=U3i5RO1/AHbS5sUPt5S4MpX8W1Y/j1ALi0dA5xPRZ+k=;
        b=naoMQIvZbsJVpF68VtsJAGpAMDWYZbqdz5oA0Q76GdwFwGKQpGkw6HUcy1udfG3KZ/
         fme242QZs4wvnqEJAQF61juNqN3xJ4sIJGr7eDDek2l5F7rYthcoNQEr0X+s2KXbhmaM
         3boR1PVQ+Ko7D8nINbwh0FDEsR2p0AvxSUPC1CuxpY1ViTTQi2gi2ouzc35kxccaGj3P
         ra9yVLl7+Nvo60TE5j+HO+abKBjVWUrbzJaRuNb2j7+M1000fJw0jDQZMGCsRjyo6fXr
         gLV5Wu8Siz4PLIawYzbUpBF5Rg5lnpPzpTX6YyFhR+XMYzNcOyQ2NGvwMHKMYXqzu9jM
         9AJA==
X-Gm-Message-State: APjAAAXbAdFxx1eltQvOeIhuOL64mlaOA8XbLRPOVlDG4HaIlT0NYxy5
	QK+lTIouYtcOHk2rdJ4lFw26PuiLY/+J/ShN2B2Kzkux
X-Google-Smtp-Source: APXvYqxN4JM0ZKl1Rw6IismrrnED5u7p0zf8DtEqJBbHrJ/MRjCxcWDfjhjM2AyjpZDJB0ikqqO2iNCdNdsfVXeZcnc=
X-Received: by 2002:a6b:f90f:: with SMTP id j15mr2079472iog.43.1559788262467;
 Wed, 05 Jun 2019 19:31:02 -0700 (PDT)
MIME-Version: 1.0
References: <CAM3v5FtLSM3jqbKfQGiQ8f-u0pp0uPv6+AfEc2jn5mTjOQms-A@mail.gmail.com>
 <CAM3v5Fuzg9+JFRPcm_Z+7Tw4rQ_W3sE+LxqWPMuLDETWUwx_eQ@mail.gmail.com>
In-Reply-To: <CAM3v5Fuzg9+JFRPcm_Z+7Tw4rQ_W3sE+LxqWPMuLDETWUwx_eQ@mail.gmail.com>
From: Jonathan Koppenhofer <jon@koppedomain.com>
Date: Wed, 5 Jun 2019 22:30:50 -0400
Message-ID: <CAAhpcZPWk+jJxMMgPk-6HBKzT+2p1QjpqRu7WO8mKRbBzkyB7w@mail.gmail.com>
Subject: Re: AbstractLocalAwareExecutorService Exception During Upgrade
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="0000000000004f42ad058a9e8148"

--0000000000004f42ad058a9e8148
Content-Type: text/plain; charset="UTF-8"

Not sure about why repair is running, but we are also seeing the same
merkle tree issue in a mixed version cluster in which we have intentionally
started a repair against 2 upgraded DCs. We are currently researching, and
can post back if we find the issue, but also would appreciate if someone
has a suggestion. We have also run a local repair in an upgraded DC in this
same mixed version cluster without issue.

We are going 2.1.x to 3.0.x... and yes, we know you are not supposed to run
repairs in mixed version clusters, so don't do it :) this is kind of a
special circumstances where other things have gone wrong.

Thanks

On Wed, Jun 5, 2019, 5:23 PM shalom sagges <shalomsagges@gmail.com> wrote:

> If anyone has any idea on what might cause this issue, it'd be great.
>
> I don't understand what could trigger this exception.
> But what I really can't understand is why repairs started to run suddenly
> :-\
> There's no cron job running, no active repair process, no Validation
> compactions, Reaper is turned off....  I see repair running only in the
> logs.
>
> Thanks!
>
>
> On Wed, Jun 5, 2019 at 2:32 PM shalom sagges <shalomsagges@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I'm having a bad situation where after upgrading 2 nodes (binaries only)
>> from 2.1.21 to 3.11.4 I'm getting a lot of warnings as follows:
>>
>> AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread
>> Thread[ReadStage-5,5,main]: {}
>> java.lang.ArrayIndexOutOfBoundsException: null
>>
>>
>> I also see errors on repairs but no repair is running at all. I verified
>> this with ps -ef command and nodetool compactionstats. The error I see is:
>> Failed creating a merkle tree for [repair
>> #a95498f0-8783-11e9-b065-81cdbc6bee08 on system_auth/users, []], /1.2.3.4
>> (see log for details)
>>
>> I saw repair errors on data tables as well.
>> nodetool status shows all are UN and nodetool describecluster shows two
>> schema versions as expected.
>>
>>
>> After the warnings appeared, clients started to get timed out read/write
>> queries.
>> Restarting the 2 nodes solved the clients' connection issues, but the
>> warnings are still being generated in the logs.
>>
>> Did anyone encounter such an issue and knows what this means?
>>
>> Thanks!
>>
>>

--0000000000004f42ad058a9e8148
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">Not sure about why repair is running, but we are also see=
ing the same merkle tree issue in a mixed version cluster in which we have =
intentionally started a repair against 2 upgraded DCs. We are currently res=
earching, and can post back if we find the issue, but also would appreciate=
 if someone has a suggestion. We have also run a local repair in an upgrade=
d DC in this same mixed version cluster without issue.<div dir=3D"auto"><br=
></div><div dir=3D"auto">We are going 2.1.x to 3.0.x... and yes, we know yo=
u are not supposed to run repairs in mixed version clusters, so don&#39;t d=
o it :) this is kind of a special circumstances where other things have gon=
e wrong.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Thanks</div></d=
iv><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On =
Wed, Jun 5, 2019, 5:23 PM shalom sagges &lt;<a href=3D"mailto:shalomsagges@=
gmail.com">shalomsagges@gmail.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><div>If anyone has any idea on what might ca=
use this issue, it&#39;d be great. <br></div><div><br></div><div>I don&#39;=
t understand what could trigger this exception. <br></div><div>But what I r=
eally can&#39;t understand is why repairs started to run suddenly :-\</div>=
<div>There&#39;s no cron job running, no active repair process, no Validati=
on compactions, Reaper is turned off....=C2=A0 I see repair running only in=
 the logs. <br></div><div><br></div><div>Thanks!<br></div><div><br></div></=
div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On=
 Wed, Jun 5, 2019 at 2:32 PM shalom sagges &lt;<a href=3D"mailto:shalomsagg=
es@gmail.com" target=3D"_blank" rel=3D"noreferrer">shalomsagges@gmail.com</=
a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><d=
iv dir=3D"ltr"><div>Hi All,</div><div><br></div><div>I&#39;m having a bad s=
ituation where after upgrading 2 nodes (binaries only) from 2.1.21 to 3.11.=
4 I&#39;m getting a lot of warnings as follows:</div><div><br></div><div>Ab=
stractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thr=
ead[ReadStage-5,5,main]: {}<br>java.lang.ArrayIndexOutOfBoundsException: nu=
ll</div><div><br></div><div><br></div><div>I also see errors on repairs but=
 no repair is running at all. I verified this with ps -ef command and nodet=
ool compactionstats. The error I see is:</div><div>Failed creating a merkle=
 tree for [repair #a95498f0-8783-11e9-b065-81cdbc6bee08 on system_auth/user=
s, []], /<a href=3D"http://1.2.3.4" target=3D"_blank" rel=3D"noreferrer">1.=
2.3.4</a> (see log for details) <br></div><div><br></div><div>I saw repair =
errors on data tables as well. <br></div><div>nodetool status shows all are=
 UN and nodetool describecluster shows two schema versions as expected. <br=
></div><div><br></div><div><br></div><div>After the warnings appeared, clie=
nts started to get timed out read/write queries. <br></div><div>Restarting =
the 2 nodes solved the clients&#39; connection issues, but the warnings are=
 still being generated in the logs. <br></div><div><br></div><div>Did anyon=
e encounter such an issue and knows what this means?</div><div><br></div><d=
iv>Thanks!</div><div><br></div></div>
</blockquote></div>
</blockquote></div>

--0000000000004f42ad058a9e8148--