Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 22885200B0F for ; Fri, 17 Jun 2016 20:01:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 21163160A61; Fri, 17 Jun 2016 18:01:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 450FD160A4C for ; Fri, 17 Jun 2016 20:01:04 +0200 (CEST) Received: (qmail 53298 invoked by uid 500); 17 Jun 2016 18:01:03 -0000 Mailing-List: contact user-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ignite.apache.org Delivered-To: mailing list user@ignite.apache.org Received: (qmail 53288 invoked by uid 99); 17 Jun 2016 18:01:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2016 18:01:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 05091180572 for ; Fri, 17 Jun 2016 18:01:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id xnbjGwU1UvXD for ; Fri, 17 Jun 2016 18:01:01 +0000 (UTC) Received: from mail-vk0-f50.google.com (mail-vk0-f50.google.com [209.85.213.50]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 331F05F1E9 for ; Fri, 17 Jun 2016 18:01:00 +0000 (UTC) Received: by mail-vk0-f50.google.com with SMTP id d185so125846073vkg.0 for ; Fri, 17 Jun 2016 11:01:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=mBykWQKFJhKbDG3x59BHZJtvthVd6LUik07mQgMtX8o=; b=ZM+IOn3SReTM182jEpbtLKJzg+ObroUDL1bulXchr6b0MGtz1CkM/WUnbHfOCQwHK7 qkgMoabjpN0E/QmOfKZr5JzIiXpF9W8ic2b7arBPqsB9InhPBYb/IpzQEptOrpqPA94/ txPl10pulf3bJ+//FBQaqjEFX2genyqkkBTQPM08o44q2iWpx26UvRPqbaCjuCv2g29d JB9BqHVE+7JwQLP2SjAQ4wQjzyZLVvohAD60ZIZNobx5w1Q8TX7HYSGz2Ko0EdKnSJFN 6Ot5X1SOLKdBP27Vod7AFSvKeGk6XfdFGRpHbVnIAFlakRU1P1OFuQaNU8d8aSsIAW0C OxbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=mBykWQKFJhKbDG3x59BHZJtvthVd6LUik07mQgMtX8o=; b=QO0E4oHA3vkmS4i09PtuYGBySnfHkOGa8dvDllGQNgqx2TqEsFqgVJELAauH0XxyX+ nqUj3FuSRBFAKK6+z+ELztRppPSPst8xCTZ6W/3d0oCsnDtlAZmq4r4yz5teEpKIu96W v73hvlScsDPBCSPhTJoiLpW5+Bb3upJ4o90Ir2PmV7mXvKxRHmaKfc0WmcuzwzRuVrhr SXwPfQHPW3o2FBQBViVufCL/eC0ohcsrhLBZUA9u/0MJMOsWRl4WURLkg8Z61VNSBKu7 ggkUcXKlZfX0wxBGqBOOKpY+w2Tn/F13QLeh4K9xKkQot5aS0xzsQNwz/yOJVAZDaxeH HTuQ== X-Gm-Message-State: ALyK8tJpHCIlsJrgSyKkCK+mwQK6U78A9LoSXPB6wfrg6elidPWOZCA4n7G+fVSzMLu9blFxXdFBrcwp1LphYQ== X-Received: by 10.176.4.66 with SMTP id 60mr1487847uav.124.1466186459058; Fri, 17 Jun 2016 11:00:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.69.132 with HTTP; Fri, 17 Jun 2016 11:00:58 -0700 (PDT) In-Reply-To: References: <379C29BE-B685-4B7D-AE30-C73407859F38@gridgain.com> From: Alexey Goncharuk Date: Fri, 17 Jun 2016 11:00:58 -0700 Message-ID: Subject: Re: Adding a third node to REPLICATED cluster fails to get correct number of elements To: user@ignite.apache.org Content-Type: multipart/alternative; boundary=94eb2c0b604011169505357d231f archived-at: Fri, 17 Jun 2016 18:01:05 -0000 --94eb2c0b604011169505357d231f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Kristian, Are you sure you are using the latest 1.7-SNAPSHOT for your production data? Did you build binaries yourself? Can you confirm the commit# of the binaries you are using? The issue you are reporting seems to be the same as IGNITE-3305 and, since the fix was committed only a couple of days ago, it might not get to nightly snapshot. 2016-06-17 9:06 GMT-07:00 Kristian Rosenvold : > Sigh, this has all the hallmarks of a thread safety issue or race > condition. > > I had a perfect testcase that replicated the problem 100% of the time, > but only when running on distinct nodes (never occurs on same box) > with 2 distinct caches and with ignite 1.5; I just expanded the > testcase I posted initially . Typically I'd be missing the last 10-20 > elements in the cache. I was about 2 seconds from reporting an issue > and then I switched to yesterday's 1.7-SNAPSHOT version and it went > away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my > production data, it just broke my testcase :( Assumably I just need to > tweak the cache sizes or element counts to hit some kind of non-sweet > spot, and then it probably fails on my machine. > > The testcase always worked on a single box, which lead me to think > about socket-related issues. But it also required 2 caches to fail, > which lead me to think about race conditions like the rebalance > terminating once the first node finishes. > > I'm no stranger to reading bug reports like this myself, and I must > admit this seems pretty tough to diagnose. > > Kristian > > > 2016-06-17 14:57 GMT+02:00 Denis Magda : > > Hi Kristian, > > > > Your test looks absolutely correct for me. However I didn=E2=80=99t man= age to > > reproduce this issue on my side as well. > > > > Alex G., do you have any ideas on what can be a reason of that? Can you > > recommend Kristian enabling of DEBUG/TRACE log levels for particular > > modules? Probably advanced logging will let us to pin point the issue > that > > happens in Kristian=E2=80=99s environment. > > > > =E2=80=94 > > Denis > > > > On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold > > wrote: > > > > For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since > > REPLICATED caches seem to be broken on 1.6 and beyond, I am testing > > this on 1.5: > > > > I can reliably start two nodes and get consistent correct results, > > lets say each node has 1.5 million elements in a given cache. > > > > Once I start a third or fourth node in the same cluster, it > > consistently gets a random incorrect number of elements in the same > > cache, typically 1.1 million or so. > > > > I tried to create a testcase to reproduce this on my local machine > > ( > https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2= bae95b76b95 > ), > > but this fails to reproduce the problem. > > > > I have two nodes in 2 different datacenters, so there will invariably > > be some differences in latencies/response times between the existing 2 > > nodes and the newly started node. > > > > This sounds like some kind of timing related bug, any tips ? Is there > > any way I kan skew the timing in the testcase ? > > > > > > Kristian > > > > > --94eb2c0b604011169505357d231f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Kristian,

Are you sure you are using th= e latest 1.7-SNAPSHOT for your production data? Did you build binaries your= self? Can you confirm the commit# of the binaries you are using? The issue = you are reporting seems to be the same as IGNITE-3305 and, since the fix wa= s committed only a couple of days ago, it might not get to nightly snapshot= .

2016= -06-17 9:06 GMT-07:00 Kristian Rosenvold <krosenvold@apache.org>= ;:
Sigh, this has all the hallmark= s of a thread safety issue or race condition.

I had a perfect testcase that replicated the problem 100% of the time,
but only when running on distinct nodes (never occurs on same box)
with 2 distinct caches and with ignite 1.5; I just expanded the
testcase I posted initially . Typically I'd be missing the last 10-20 elements in the cache. I was about 2 seconds from reporting an issue
and then I switched to yesterday's 1.7-SNAPSHOT version and it went
away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
production data, it just broke my testcase :( Assumably I just need to
tweak the cache sizes or element counts to hit some kind of non-sweet
spot, and then it probably fails on my machine.

The testcase always worked on a single box, which lead me to think
about socket-related issues. But it also required 2 caches to fail,
which lead me to think about race conditions like the rebalance
terminating once the first node finishes.

I'm no stranger to reading bug reports like this myself, and I must
admit this seems pretty tough to diagnose.

Kristian


2016-06-17 14:57 GMT+02:00 Denis Magda <dmagda@gridgain.com>:
> Hi Kristian,
>
> Your test looks absolutely correct for me. However I didn=E2=80=99t ma= nage to
> reproduce this issue on my side as well.
>
> Alex G., do you have any ideas on what can be a reason of that? Can yo= u
> recommend Kristian enabling of DEBUG/TRACE log levels for particular > modules? Probably advanced logging will let us to pin point the issue = that
> happens in Kristian=E2=80=99s environment.
>
> =E2=80=94
> Denis
>
> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <krosenvold@apache.org>
> wrote:
>
> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since<= br> > REPLICATED caches seem to be broken on 1.6 and beyond, I am testing > this on 1.5:
>
> I can reliably start two nodes and get consistent correct results,
> lets say each node has 1.5 million elements in a given cache.
>
> Once I start a third or fourth node in the same cluster, it
> consistently gets a random incorrect number of elements in the same > cache, typically 1.1 million or so.
>
> I tried to create a testcase to reproduce this on my local machine
> (https://gi= thub.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95<= /a>),
> but this fails to reproduce the problem.
>
> I have two nodes in 2 different datacenters, so there will invariably<= br> > be some differences in latencies/response times between the existing 2=
> nodes and the newly started node.
>
> This sounds like some kind of timing related bug, any tips ? Is there<= br> > any way I kan skew the timing in the testcase ?
>
>
> Kristian
>
>

--94eb2c0b604011169505357d231f--