ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Rosenvold <krosenv...@apache.org>
Subject Re: Adding a third node to REPLICATED cluster fails to get correct number of elements
Date Fri, 17 Jun 2016 16:06:01 GMT
Sigh, this has all the hallmarks of a thread safety issue or race condition.

I had a perfect testcase that replicated the problem 100% of the time,
but only when running on distinct nodes (never occurs on same box)
with 2 distinct caches and with ignite 1.5; I just expanded the
testcase I posted initially . Typically I'd be missing the last 10-20
elements in the cache. I was about 2 seconds from reporting an issue
and then I switched to yesterday's 1.7-SNAPSHOT version and it went
away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
production data, it just broke my testcase :( Assumably I just need to
tweak the cache sizes or element counts to hit some kind of non-sweet
spot, and then it probably fails on my machine.

The testcase always worked on a single box, which lead me to think
about socket-related issues. But it also required 2 caches to fail,
which lead me to think about race conditions like the rebalance
terminating once the first node finishes.

I'm no stranger to reading bug reports like this myself, and I must
admit this seems pretty tough to diagnose.

Kristian


2016-06-17 14:57 GMT+02:00 Denis Magda <dmagda@gridgain.com>:
> Hi Kristian,
>
> Your test looks absolutely correct for me. However I didn’t manage to
> reproduce this issue on my side as well.
>
> Alex G., do you have any ideas on what can be a reason of that? Can you
> recommend Kristian enabling of DEBUG/TRACE log levels for particular
> modules? Probably advanced logging will let us to pin point the issue that
> happens in Kristian’s environment.
>
> —
> Denis
>
> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <krosenvold@apache.org>
> wrote:
>
> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since
> REPLICATED caches seem to be broken on 1.6 and beyond, I am testing
> this on 1.5:
>
> I can reliably start two nodes and get consistent correct results,
> lets say each node has 1.5 million elements in a given cache.
>
> Once I start a third or fourth node in the same cluster, it
> consistently gets a random incorrect number of elements in the same
> cache, typically 1.1 million or so.
>
> I tried to create a testcase to reproduce this on my local machine
> (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95),
> but this fails to reproduce the problem.
>
> I have two nodes in 2 different datacenters, so there will invariably
> be some differences in latencies/response times between the existing 2
> nodes and the newly started node.
>
> This sounds like some kind of timing related bug, any tips ? Is there
> any way I kan skew the timing in the testcase ?
>
>
> Kristian
>
>

Mime
View raw message