From user-return-65334-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Thu Feb 27 02:55:26 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id AA77918065C for ; Thu, 27 Feb 2020 03:55:25 +0100 (CET) Received: (qmail 17068 invoked by uid 500); 27 Feb 2020 02:55:22 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 17053 invoked by uid 99); 27 Feb 2020 02:55:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2020 02:55:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B8C8AC1E80 for ; Thu, 27 Feb 2020 02:55:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.699 X-Spam-Level: X-Spam-Status: No, score=-0.699 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=datastax.com header.b=Xkz1cyR8; dkim=pass (1024-bit key) header.d=datastax.com header.b=Hkbx/s7W Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 5og8bElcNhVI for ; Thu, 27 Feb 2020 02:55:20 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=148.163.151.30; helo=mx0a-002bad01.pphosted.com; envelope-from=erick.ramirez@datastax.com; receiver= Received: from mx0a-002bad01.pphosted.com (mx0a-002bad01.pphosted.com [148.163.151.30]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1EE81BB80E for ; Thu, 27 Feb 2020 02:55:19 +0000 (UTC) Received: from pps.filterd (m0121912.ppops.net [127.0.0.1]) by mx0a-002bad01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 01R2sVkV010182 for ; Wed, 26 Feb 2020 18:55:13 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datastax.com; h=mime-version : references : in-reply-to : from : date : message-id : subject : to : content-type; s=proofpoint20180122; bh=ld0tUfL57261ln2/uXH9GhGWn9gmujGIwPQroC8N1fA=; b=Xkz1cyR82qCMnexQNMKqj7+hSXVB3ZuFGBxWT9yCshBclS2x5EpvfBVYdziszvYYup6D /58y9rFo/HmCo62rh1xg14uqWcU8RB+3UI0lB/dTuWS0RKqr6z/OOr5q/ol/hcf9hg/X hEPAHEHQdJEPFYfYRzpWNqxqV4N7gh6g3QdfCDdg0vbUtgL4UClN6p75dWIrOhMRJGMC 9hQrWteqRSdN/uCnGwTclV+Lpysl1i0qfwrjyiZOPzS+RC2NM27Ro9mwRRm7Z3/SDE3v FBxaed7u/NtBk1m4fb+ioQMU+MwGuvjb3oiEfRSfyvbcYMlzmSqlmJZWBAkvH7Gq8qHQ Lg== Received: from mail-lj1-f199.google.com (mail-lj1-f199.google.com [209.85.208.199]) by mx0a-002bad01.pphosted.com with ESMTP id 2ydcvv9pxc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2020 18:55:13 -0800 Received: by mail-lj1-f199.google.com with SMTP id k25so364910lji.4 for ; Wed, 26 Feb 2020 18:55:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datastax.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ld0tUfL57261ln2/uXH9GhGWn9gmujGIwPQroC8N1fA=; b=Hkbx/s7Wrccy9e3DOfs39wbp1YlLuDbHz0ySXOhVCckTMCHSctePdP7XTReI7UWZM6 hQuYD//SmNH0Vz1e6gRp7oFHw161C+0+4TY/7a6H/+kScll3nuvho5watAudJy2ph5IG oFdZvhqk8teRH22iTVdV4IlBG2vTnX34W1eog= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ld0tUfL57261ln2/uXH9GhGWn9gmujGIwPQroC8N1fA=; b=g5I8ZKjEMk2tPbj0Wyj22s9kRs11mzcjcrp23kpzdV1+QVLqZXsrNFIvGNx96LUpAB JqDUnFWEDGawExpxObYEghOBdRX5MtMGWplea/6NQmseQWM3z2AWzwD/24k+V7irF0cf EK1OihOsaFsZwBhZefT1d6ijW5csTQ94BAa1GvH4Oq44qh/OeKO6Uqxr6Fy1U5BChwoQ /yi0eoL3vMCYsGzErDGKdwe+StRm8564XABAfwlfQyRsn3vdrDGNmHz7G57i5HuQTb+8 Y7lgp7z7MrppQJdWgG0Bc6/CfAr7yEm3PofrKAHgwhgmjNA2GleN1t85nfzqwmSrN7o4 r3zw== X-Gm-Message-State: ANhLgQ37sf+2XPBQkKM3O7RyKEin3ThhNBO21i+e8GGG4rvaHcoEbqqT rv+cRX+2h5LSjKXDIV8/eZokhL1+8Kf+NMxsmJ5QdHLSK+B3DqOZCb6v/No7ZdvHpgqOoIwcfff VgB5Ru3SUYU8bXg7LfmMHqZkUKO9/2a10y7OkPKET X-Received: by 2002:a2e:86c4:: with SMTP id n4mr1290428ljj.97.1582772110764; Wed, 26 Feb 2020 18:55:10 -0800 (PST) X-Google-Smtp-Source: ADFU+vt2SDRe40Zj2T2oI76O5mZ9uNTscVFDYkL0LFgxnL5FU/72b6zLubuZS2WLoswulXDQwwJnYPWt6xWW3n6RFB0= X-Received: by 2002:a2e:86c4:: with SMTP id n4mr1290414ljj.97.1582772110432; Wed, 26 Feb 2020 18:55:10 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Erick Ramirez Date: Thu, 27 Feb 2020 13:54:59 +1100 Message-ID: Subject: Re: Hints replays very slow in one DC To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="00000000000067433d059f85d986" X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-02-26_09:2020-02-26,2020-02-26 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 mlxscore=0 lowpriorityscore=0 phishscore=0 clxscore=1015 malwarescore=0 priorityscore=1501 suspectscore=0 spamscore=0 bulkscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002270019 --00000000000067433d059f85d986 Content-Type: text/plain; charset="UTF-8" > > Nodes are going down due to Out of Memory and we are using 31GB heap size > in DC1 , however in DC2 (Which serves the traffic) has 16GB heap . > Why we had to increase heap in DC1 is because , DC1 nodes were going down > due Out of Memory issue but DC2 nodes never went down . > It doesn't sound right that the primary DC is DC2 but DC1 is under load. You might not be aware of it but the symptom suggests DC1 is getting hit with lots of traffic. If you run netstat (or whatever utility/tool of your choice), you should see established connections to the cluster. That should give you clues as to where it's coming from. > We also noticed below kind of messages in system.log > FailureDetector.java:288 - Not marking nodes down due to local pause of > 9532654114 > 5000000000 > That's another smoking gun that the nodes are buried in GC. A 9.5-second pause is significant. The slow hinted handoffs is really the least of your problem right now. If nodes weren't going down, there wouldn't be hints to handoff in the first place. Cheers! GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have answers! Share your expertise on https://community.datastax.com/. --00000000000067433d059f85d986 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Nodes are going down due to Out of Memo= ry and we are using 31GB heap size in DC1 , however in DC2 (Which serves th= e traffic) has 16GB heap .
Why we had to increase heap in DC1 is becaus= e , DC1 nodes were going down due Out of Memory issue but DC2 nodes never w= ent down .

It doesn't sound right that the primary DC is DC2 bu= t DC1 is under load. You might not be aware of it but the symptom suggests = DC1 is getting hit with lots of traffic. If you run netstat=C2=A0(o= r whatever utility/tool of your choice), you should see established connect= ions to the cluster. That should give you clues as to where it's coming= from.
=C2=A0
We also noticed below kind= of messages in system.log
FailureDetector.java:288 - Not marking= nodes down due to local pause of 9532654114 > 5000000000

That's anot= her smoking gun that the nodes are buried in GC. A 9.5-second pause is sign= ificant. The slow hinted handoffs is really the least of your problem right= now. If nodes weren't going down, there wouldn't be hints to hando= ff in the first place. Cheers!

GOT QU= ESTIONS? Apache Cassandra experts from the communit= y and DataStax have answers! Share your expertise on https://community.datastax.com/.
--00000000000067433d059f85d986--