From issues-return-120503-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Wed Jun 2 11:04:05 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 26FFB180638 for ; Wed, 2 Jun 2021 13:04:05 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 4C5B03F3A0 for ; Wed, 2 Jun 2021 11:04:01 +0000 (UTC) Received: (qmail 5247 invoked by uid 500); 2 Jun 2021 11:04:01 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 5207 invoked by uid 99); 2 Jun 2021 11:04:01 -0000 Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2021 11:04:01 +0000 Received: from jira2-he-de.apache.org (jira2-he-de.apache.org [168.119.33.54]) by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPS id 4365B3E8AF for ; Wed, 2 Jun 2021 11:04:00 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 247C9C803F2 for ; Wed, 2 Jun 2021 11:04:00 +0000 (UTC) Date: Wed, 2 Jun 2021 11:04:00 +0000 (UTC) From: "Rodion Smolnikov (Jira)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (IGNITE-14474) Improve error message in case rebalance fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/IGNITE-14474?page=3Dcom.atlass= ian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rodion Smolnikov updated IGNITE-14474: -------------------------------------- Description:=20 Currently we can get a message like this when rebalance fails with an excep= tion (examples from ignite 2.5, in newer versions the log messages were cha= nged but the problem is still actual): {code:java} 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] = Rebalancing from node cancelled [grp=3Dignite-sys-cache, topVer=3DAffinityT= opologyVersion [topVer=3D1932, minorTopVer=3D1], supplier=3Df014f30a-77f2-4= 459-aa5b-6c12907a7449, topic=3D0]. Supply message couldn't be unmarshalled:= class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimi= zed marshaller 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] = Cancelled rebalancing [grp=3Dignite-sys-cache, supplier=3Df014f30a-77f2-445= 9-aa5b-6c12907a7449, topVer=3DAffinityTopologyVersion [topVer=3D1932, minor= TopVer=3D1], time=3D88 ms] 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] = Rebalancing from node cancelled [grp=3Dignite-sys-cache, topVer=3DAffinityT= opologyVersion [topVer=3D1932, minorTopVer=3D1], supplier=3Ddfa5ee06-48c9-4= 458-ae55-48cc6ceda998, topic=3D0]. Supply message couldn't be unmarshalled:= class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimi= zed marshaller {code} In the case above, a marshalling exception leads to rebalance failure which= will never be resolved - i.e. the cluster enters into a erroneous state. We should report issues like this as ERROR. The message should explain that= the rebalance has failed, data for the cache was not fully copied to the n= ode, the backup factor is not recovered and the cluster may not work correc= tly. =C2=A0 After fix: New message will looks like this: {code:java} 2021-06-02 13:52:33,762[ERROR][utility-#79%xxx%][GridDhtPartitionDemander] = Rebalancing routine has failed, some partitions could be unavailable for re= ading [grp=3Dcache, rebalanceId=3D1, topVer=3DAffinityTopologyVersion [topV= er=3D2, minorTopVer=3D0], supplier=3Dbf744bda-ba3d-4f48-8172-26d642000000, = unavailablePartitions=3D[1-256, 768-1024]] {code} added rebalanceId and unavailablePartitions was: Currently we can get a message like this when rebalance fails with an excep= tion (examples from ignite 2.5, in newer versions the log messages were cha= nged but the problem is still actual): {code:java} 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] = Rebalancing from node cancelled [grp=3Dignite-sys-cache, topVer=3DAffinityT= opologyVersion [topVer=3D1932, minorTopVer=3D1], supplier=3Df014f30a-77f2-4= 459-aa5b-6c12907a7449, topic=3D0]. Supply message couldn't be unmarshalled:= class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimi= zed marshaller 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] = Cancelled rebalancing [grp=3Dignite-sys-cache, supplier=3Df014f30a-77f2-445= 9-aa5b-6c12907a7449, topVer=3DAffinityTopologyVersion [topVer=3D1932, minor= TopVer=3D1], time=3D88 ms] 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] = Rebalancing from node cancelled [grp=3Dignite-sys-cache, topVer=3DAffinityT= opologyVersion [topVer=3D1932, minorTopVer=3D1], supplier=3Ddfa5ee06-48c9-4= 458-ae55-48cc6ceda998, topic=3D0]. Supply message couldn't be unmarshalled:= class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimi= zed marshaller {code} In the case above, a marshalling exception leads to rebalance failure which= will never be resolved - i.e. the cluster enters into a erroneous state. We should report issues like this as ERROR. The message should explain that= the rebalance has failed, data for the cache was not fully copied to the n= ode, the backup factor is not recovered and the cluster may not work correc= tly. =C2=A0 After fix: New message will looks like this: {code:java} [2021-06-02 13:52:33,762][ERROR][rebalance-#110%rebalancing.GridCacheRebala= ncingUnmarshallingFailedSelfTest1%][root] Rebalancing routine has failed, s= ome partitions could be unavailable for reading [grp=3Dcache, rebalanceId= =3D1, topVer=3DAffinityTopologyVersion [topVer=3D2, minorTopVer=3D0], suppl= ier=3Dbf744bda-ba3d-4f48-8172-26d642000000, unavailablePartitions=3D[1-256,= 768-1024]] {code} added rebalanceId and unavailablePartitions > Improve error message in case rebalance fails > --------------------------------------------- > > Key: IGNITE-14474 > URL: https://issues.apache.org/jira/browse/IGNITE-14474 > Project: Ignite > Issue Type: Improvement > Affects Versions: 2.5 > Reporter: Denis Chudov > Assignee: Rodion Smolnikov > Priority: Major > Fix For: 2.9.2 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently we can get a message like this when rebalance fails with an exc= eption (examples from ignite 2.5, in newer versions the log messages were c= hanged but the problem is still actual): > {code:java} > 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander= ] Rebalancing from node cancelled [grp=3Dignite-sys-cache, topVer=3DAffinit= yTopologyVersion [topVer=3D1932, minorTopVer=3D1], supplier=3Df014f30a-77f2= -4459-aa5b-6c12907a7449, topic=3D0]. Supply message couldn't be unmarshalle= d: class o.a.i.IgniteCheckedException: Failed to unmarshal object with opti= mized marshaller > 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander= ] Cancelled rebalancing [grp=3Dignite-sys-cache, supplier=3Df014f30a-77f2-4= 459-aa5b-6c12907a7449, topVer=3DAffinityTopologyVersion [topVer=3D1932, min= orTopVer=3D1], time=3D88 ms] > 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander= ] Rebalancing from node cancelled [grp=3Dignite-sys-cache, topVer=3DAffinit= yTopologyVersion [topVer=3D1932, minorTopVer=3D1], supplier=3Ddfa5ee06-48c9= -4458-ae55-48cc6ceda998, topic=3D0]. Supply message couldn't be unmarshalle= d: class o.a.i.IgniteCheckedException: Failed to unmarshal object with opti= mized marshaller > {code} > In the case above, a marshalling exception leads to rebalance failure whi= ch will never be resolved - i.e. the cluster enters into a erroneous state. > We should report issues like this as ERROR. The message should explain th= at the rebalance has failed, data for the cache was not fully copied to the= node, the backup factor is not recovered and the cluster may not work corr= ectly. > =C2=A0 > After fix: > New message will looks like this: > {code:java} > 2021-06-02 13:52:33,762[ERROR][utility-#79%xxx%][GridDhtPartitionDemander= ] Rebalancing routine has failed, some partitions could be unavailable for = reading [grp=3Dcache, rebalanceId=3D1, topVer=3DAffinityTopologyVersion [to= pVer=3D2, minorTopVer=3D0], supplier=3Dbf744bda-ba3d-4f48-8172-26d642000000= , unavailablePartitions=3D[1-256, 768-1024]] > {code} > added rebalanceId and unavailablePartitions -- This message was sent by Atlassian Jira (v8.3.4#803005)