ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Mekhanikov <dmekhani...@gmail.com>
Subject Re: High priority TCP discovery messages
Date Wed, 30 Jan 2019 13:07:36 GMT
Yakov,

> You can put hard limit and process enqued MetricsUpdate message
> if last one of the kind was processed more than metricsUpdFreq millisecs
ago.
 Makes sense. I'll try implementing it.

> I would suggest we allow queue overflow for 1 min, but if situation does
not go to normal then node
> should fire a special event and then kill itself.
Let's start with a warning in log and see, how they correlate with problems
with network/GC.
I'd like to make sure we don't kill innocents.

Anton,

> Maybe, better case it to have special "discovery like" channel (with ring
or analog) for metrics like messages
I don't think, that creating another data channel is reasonable. It will
require additional network connections and more complex configuration.
But splitting pings and metrics into different types of messages, as it was
before, and moving metrics distribution to communication
makes sense to me. Some kind of a gossip protocol could be used for it.

> Anyway, Why are fighting with duplicates inside the queue instead of
> fighting with new message initial creation while previous not yet
processed
> on the cluster?

A situation, when multiple metrics update messages exist in the cluster, is
normal.
Node availability check is based on the fact, that it receives fresh
metrics once in metricsUpdateFreq ms.
If you make a coordinator wait for a previous metrics update message to be
delivered before issuing a new one,
then this frequency will depend on the number of nodes in the cluster,
since time of one round-trip with differ on different topologies.

Alex,

I didn't check it yet. Theoretically, nodes will fail a bit more often,
when their discovery worker queues are flooded with messages.
This change definitely requires extensive testing.

I think you can make metric update messages have a regular priority
separately from fixing the issue, that I described.

Denis

вт, 29 янв. 2019 г. в 20:44, Alexey Goncharuk <alexey.goncharuk@gmail.com>:

> Folks,
>
> Did we already check that omitting hearbeat priority does not break
> discovery? I am currently working on another issue with discovery and
> skipping hearbeat priority would help a lot in my case.
>
> --AG
>
> пт, 11 янв. 2019 г. в 23:21, Yakov Zhdanov <yzhdanov@apache.org>:
>
> > > How big the message worker's queue may grow until it becomes a problem?
> >
> > Denis, you never know. Imagine node may be flooded with messages because
> of
> > the increased timeouts and network problems. I remember some cases with
> > hundreds of messages in queue on large topologies. Please, no O(n)
> > approaches =)
> >
> > > So, we may never come to a point, when an actual
> > TcpDiscoveryMetricsUpdateMessage is processed.
> >
> > Good catch! You can put hard limit and process enqued MetricsUpdate
> message
> > if last one of the kind was processed more than metricsUpdFreq millisecs
> > ago.
> >
> > Denis, also note - initial problem is message queue growth. When we
> choose
> > to skip messages it means that node cannot process certain messages and
> > most probably experiencing problems. We need to think of killing such
> > nodes. I would suggest we allow queue overflow for 1 min, but if
> situation
> > does not go to normal then node should fire a special event and then kill
> > itself. Thoughts?
> >
> > --Yakov
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message