ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rodion Smolnikov (Jira)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-14474) Improve error message in case rebalance fails
Date Wed, 02 Jun 2021 11:04:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rodion Smolnikov updated IGNITE-14474:
--------------------------------------
    Description: 
Currently we can get a message like this when rebalance fails with an exception (examples
from ignite 2.5, in newer versions the log messages were changed but the problem is still
actual):
{code:java}
2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] Rebalancing from
node cancelled [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message couldn't be unmarshalled:
class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimized marshaller
2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] Cancelled rebalancing
[grp=ignite-sys-cache, supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion
[topVer=1932, minorTopVer=1], time=88 ms]
2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] Rebalancing from
node cancelled [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message couldn't be unmarshalled:
class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimized marshaller
{code}
In the case above, a marshalling exception leads to rebalance failure which will never be
resolved - i.e. the cluster enters into a erroneous state.

We should report issues like this as ERROR. The message should explain that the rebalance
has failed, data for the cache was not fully copied to the node, the backup factor is not
recovered and the cluster may not work correctly.

 

After fix:

New message will looks like this:
{code:java}
2021-06-02 13:52:33,762[ERROR][utility-#79%xxx%][GridDhtPartitionDemander] Rebalancing routine
has failed, some partitions could be unavailable for reading [grp=cache, rebalanceId=1, topVer=AffinityTopologyVersion
[topVer=2, minorTopVer=0], supplier=bf744bda-ba3d-4f48-8172-26d642000000, unavailablePartitions=[1-256,
768-1024]]
{code}

added rebalanceId and unavailablePartitions

  was:
Currently we can get a message like this when rebalance fails with an exception (examples
from ignite 2.5, in newer versions the log messages were changed but the problem is still
actual):
{code:java}
2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] Rebalancing from
node cancelled [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message couldn't be unmarshalled:
class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimized marshaller
2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] Cancelled rebalancing
[grp=ignite-sys-cache, supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion
[topVer=1932, minorTopVer=1], time=88 ms]
2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] Rebalancing from
node cancelled [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message couldn't be unmarshalled:
class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimized marshaller
{code}
In the case above, a marshalling exception leads to rebalance failure which will never be
resolved - i.e. the cluster enters into a erroneous state.

We should report issues like this as ERROR. The message should explain that the rebalance
has failed, data for the cache was not fully copied to the node, the backup factor is not
recovered and the cluster may not work correctly.

 

After fix:

New message will looks like this:
{code:java}
[2021-06-02 13:52:33,762][ERROR][rebalance-#110%rebalancing.GridCacheRebalancingUnmarshallingFailedSelfTest1%][root]
Rebalancing routine has failed, some partitions could be unavailable for reading [grp=cache,
rebalanceId=1, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], supplier=bf744bda-ba3d-4f48-8172-26d642000000,
unavailablePartitions=[1-256, 768-1024]]
{code}

added rebalanceId and unavailablePartitions


> Improve error message in case rebalance fails
> ---------------------------------------------
>
>                 Key: IGNITE-14474
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14474
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.5
>            Reporter: Denis Chudov
>            Assignee: Rodion Smolnikov
>            Priority: Major
>             Fix For: 2.9.2
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently we can get a message like this when rebalance fails with an exception (examples
from ignite 2.5, in newer versions the log messages were changed but the problem is still
actual):
> {code:java}
> 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] Rebalancing
from node cancelled [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message couldn't be unmarshalled:
class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimized marshaller
> 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] Cancelled
rebalancing [grp=ignite-sys-cache, supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion
[topVer=1932, minorTopVer=1], time=88 ms]
> 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] Rebalancing
from node cancelled [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1],
supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message couldn't be unmarshalled:
class o.a.i.IgniteCheckedException: Failed to unmarshal object with optimized marshaller
> {code}
> In the case above, a marshalling exception leads to rebalance failure which will never
be resolved - i.e. the cluster enters into a erroneous state.
> We should report issues like this as ERROR. The message should explain that the rebalance
has failed, data for the cache was not fully copied to the node, the backup factor is not
recovered and the cluster may not work correctly.
>  
> After fix:
> New message will looks like this:
> {code:java}
> 2021-06-02 13:52:33,762[ERROR][utility-#79%xxx%][GridDhtPartitionDemander] Rebalancing
routine has failed, some partitions could be unavailable for reading [grp=cache, rebalanceId=1,
topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], supplier=bf744bda-ba3d-4f48-8172-26d642000000,
unavailablePartitions=[1-256, 768-1024]]
> {code}
> added rebalanceId and unavailablePartitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message