Mailing-List: contact user-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@ignite.apache.org
MIME-Version: 1.0
In-Reply-To: <CAJZRQKztG6W52uzrt6z1uZ0F7V_wdGGAVAMX-MKJUcsxLzG5cQ@mail.gmail.com>
References: <CAJZRQKykpTrRoE9LtQOpd6LLNzoitZhxu8=et=TLrzc6LRfTtQ@mail.gmail.com>
 <379C29BE-B685-4B7D-AE30-C73407859F38@gridgain.com> <CAJZRQKztG6W52uzrt6z1uZ0F7V_wdGGAVAMX-MKJUcsxLzG5cQ@mail.gmail.com>
From: Alexey Goncharuk <alexey.goncharuk@gmail.com>
Date: Fri, 17 Jun 2016 11:00:58 -0700
Message-ID: <CABDss3gv2-UXMjC__bJdgh6fkeWy=k98VPLfCt3K-QES39mDHA@mail.gmail.com>
Subject: Re: Adding a third node to REPLICATED cluster fails to get correct
 number of elements
To: user@ignite.apache.org
Content-Type: multipart/alternative; boundary=94eb2c0b604011169505357d231f
archived-at: Fri, 17 Jun 2016 18:01:05 -0000

--94eb2c0b604011169505357d231f
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Kristian,

Are you sure you are using the latest 1.7-SNAPSHOT for your production
data? Did you build binaries yourself? Can you confirm the commit# of the
binaries you are using? The issue you are reporting seems to be the same as
IGNITE-3305 and, since the fix was committed only a couple of days ago, it
might not get to nightly snapshot.

2016-06-17 9:06 GMT-07:00 Kristian Rosenvold <krosenvold@apache.org>:

> Sigh, this has all the hallmarks of a thread safety issue or race
> condition.
>
> I had a perfect testcase that replicated the problem 100% of the time,
> but only when running on distinct nodes (never occurs on same box)
> with 2 distinct caches and with ignite 1.5; I just expanded the
> testcase I posted initially . Typically I'd be missing the last 10-20
> elements in the cache. I was about 2 seconds from reporting an issue
> and then I switched to yesterday's 1.7-SNAPSHOT version and it went
> away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
> production data, it just broke my testcase :( Assumably I just need to
> tweak the cache sizes or element counts to hit some kind of non-sweet
> spot, and then it probably fails on my machine.
>
> The testcase always worked on a single box, which lead me to think
> about socket-related issues. But it also required 2 caches to fail,
> which lead me to think about race conditions like the rebalance
> terminating once the first node finishes.
>
> I'm no stranger to reading bug reports like this myself, and I must
> admit this seems pretty tough to diagnose.
>
> Kristian
>
>
> 2016-06-17 14:57 GMT+02:00 Denis Magda <dmagda@gridgain.com>:
> > Hi Kristian,
> >
> > Your test looks absolutely correct for me. However I didn=E2=80=99t man=
age to
> > reproduce this issue on my side as well.
> >
> > Alex G., do you have any ideas on what can be a reason of that? Can you
> > recommend Kristian enabling of DEBUG/TRACE log levels for particular
> > modules? Probably advanced logging will let us to pin point the issue
> that
> > happens in Kristian=E2=80=99s environment.
> >
> > =E2=80=94
> > Denis
> >
> > On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <krosenvold@apache.org=
>
> > wrote:
> >
> > For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since
> > REPLICATED caches seem to be broken on 1.6 and beyond, I am testing
> > this on 1.5:
> >
> > I can reliably start two nodes and get consistent correct results,
> > lets say each node has 1.5 million elements in a given cache.
> >
> > Once I start a third or fourth node in the same cluster, it
> > consistently gets a random incorrect number of elements in the same
> > cache, typically 1.1 million or so.
> >
> > I tried to create a testcase to reproduce this on my local machine
> > (
> https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2=
bae95b76b95
> ),
> > but this fails to reproduce the problem.
> >
> > I have two nodes in 2 different datacenters, so there will invariably
> > be some differences in latencies/response times between the existing 2
> > nodes and the newly started node.
> >
> > This sounds like some kind of timing related bug, any tips ? Is there
> > any way I kan skew the timing in the testcase ?
> >
> >
> > Kristian
> >
> >
>

--94eb2c0b604011169505357d231f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Kristian,<div><br></div><div>Are you sure you are using th=
e latest 1.7-SNAPSHOT for your production data? Did you build binaries your=
self? Can you confirm the commit# of the binaries you are using? The issue =
you are reporting seems to be the same as IGNITE-3305 and, since the fix wa=
s committed only a couple of days ago, it might not get to nightly snapshot=
.</div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2016=
-06-17 9:06 GMT-07:00 Kristian Rosenvold <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:krosenvold@apache.org" target=3D"_blank">krosenvold@apache.org</a>&gt=
;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex">Sigh, this has all the hallmark=
s of a thread safety issue or race condition.<br>
<br>
I had a perfect testcase that replicated the problem 100% of the time,<br>
but only when running on distinct nodes (never occurs on same box)<br>
with 2 distinct caches and with ignite 1.5; I just expanded the<br>
testcase I posted initially . Typically I&#39;d be missing the last 10-20<b=
r>
elements in the cache. I was about 2 seconds from reporting an issue<br>
and then I switched to yesterday&#39;s 1.7-SNAPSHOT version and it went<br>
away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my<br>
production data, it just broke my testcase :( Assumably I just need to<br>
tweak the cache sizes or element counts to hit some kind of non-sweet<br>
spot, and then it probably fails on my machine.<br>
<br>
The testcase always worked on a single box, which lead me to think<br>
about socket-related issues. But it also required 2 caches to fail,<br>
which lead me to think about race conditions like the rebalance<br>
terminating once the first node finishes.<br>
<br>
I&#39;m no stranger to reading bug reports like this myself, and I must<br>
admit this seems pretty tough to diagnose.<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
Kristian<br>
</font></span><div class=3D"HOEnZb"><div class=3D"h5"><br>
<br>
2016-06-17 14:57 GMT+02:00 Denis Magda &lt;<a href=3D"mailto:dmagda@gridgai=
n.com">dmagda@gridgain.com</a>&gt;:<br>
&gt; Hi Kristian,<br>
&gt;<br>
&gt; Your test looks absolutely correct for me. However I didn=E2=80=99t ma=
nage to<br>
&gt; reproduce this issue on my side as well.<br>
&gt;<br>
&gt; Alex G., do you have any ideas on what can be a reason of that? Can yo=
u<br>
&gt; recommend Kristian enabling of DEBUG/TRACE log levels for particular<b=
r>
&gt; modules? Probably advanced logging will let us to pin point the issue =
that<br>
&gt; happens in Kristian=E2=80=99s environment.<br>
&gt;<br>
&gt; =E2=80=94<br>
&gt; Denis<br>
&gt;<br>
&gt; On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold &lt;<a href=3D"mailto=
:krosenvold@apache.org">krosenvold@apache.org</a>&gt;<br>
&gt; wrote:<br>
&gt;<br>
&gt; For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since<=
br>
&gt; REPLICATED caches seem to be broken on 1.6 and beyond, I am testing<br=
>
&gt; this on 1.5:<br>
&gt;<br>
&gt; I can reliably start two nodes and get consistent correct results,<br>
&gt; lets say each node has 1.5 million elements in a given cache.<br>
&gt;<br>
&gt; Once I start a third or fourth node in the same cluster, it<br>
&gt; consistently gets a random incorrect number of elements in the same<br=
>
&gt; cache, typically 1.1 million or so.<br>
&gt;<br>
&gt; I tried to create a testcase to reproduce this on my local machine<br>
&gt; (<a href=3D"https://github.com/krosenvold/ignite/commit/4fb3f20f51280d=
8381e331b7bcdb2bae95b76b95" rel=3D"noreferrer" target=3D"_blank">https://gi=
thub.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95<=
/a>),<br>
&gt; but this fails to reproduce the problem.<br>
&gt;<br>
&gt; I have two nodes in 2 different datacenters, so there will invariably<=
br>
&gt; be some differences in latencies/response times between the existing 2=
<br>
&gt; nodes and the newly started node.<br>
&gt;<br>
&gt; This sounds like some kind of timing related bug, any tips ? Is there<=
br>
&gt; any way I kan skew the timing in the testcase ?<br>
&gt;<br>
&gt;<br>
&gt; Kristian<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div>

--94eb2c0b604011169505357d231f--