From dev-return-2315-archive-asf-public=cust-asf.ponee.io@tephra.incubator.apache.org  Tue Dec 25 20:51:52 2018
Return-Path: <dev-return-2315-archive-asf-public=cust-asf.ponee.io@tephra.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8984B180645
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 25 Dec 2018 20:51:51 +0100 (CET)
Received: (qmail 14793 invoked by uid 500); 25 Dec 2018 19:51:50 -0000
Mailing-List: contact dev-help@tephra.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@tephra.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@tephra.incubator.apache.org>
List-Post: <mailto:dev@tephra.incubator.apache.org>
List-Id: <dev.tephra.incubator.apache.org>
Reply-To: dev@tephra.incubator.apache.org
Delivered-To: mailing list dev@tephra.incubator.apache.org
Received: (qmail 14780 invoked by uid 99); 25 Dec 2018 19:51:50 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Dec 2018 19:51:49 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 93A8ECCF1F
	for <dev@tephra.apache.org>; Tue, 25 Dec 2018 19:51:49 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.248
X-Spam-Level: **
X-Spam-Status: No, score=2.248 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=hopper-com.20150623.gappssmtp.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id WHr1D6EM6WhC for <dev@tephra.apache.org>;
	Tue, 25 Dec 2018 19:51:47 +0000 (UTC)
Received: from mail-ua1-f45.google.com (mail-ua1-f45.google.com [209.85.222.45])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 6E99B60CFE
	for <dev@tephra.incubator.apache.org>; Tue, 25 Dec 2018 19:51:46 +0000 (UTC)
Received: by mail-ua1-f45.google.com with SMTP id n7so4623103uao.7
        for <dev@tephra.incubator.apache.org>; Tue, 25 Dec 2018 11:51:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=hopper-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=9GN26cd2zhhnzE2bsrsEj6MVneOxM1xy84snYzvqifA=;
        b=dn6FOdVSATfyATsW7jcCmXVpNzM/Yj4vvvUb1/wgl0T/ujVJzraAZT/J5FYBNAzAWA
         CyW3KvF18Fe6vcVaENiE1EbuYh7rgq2Z+gZxd7wBHrxq7C0RY7B9BmOj4aABRiOTIU2/
         leKAZtWViJixXBurDQyx9zuv6AeGi6BxvhLnpYAnsQ+1XE1BFWTt/7N9fjoJI7Vty8BS
         9MnMFbgT0RgeLoG434XQxB6y26xTOg85PYNK1gQ/lb8mrdJL0hRUo86fN/ardEdGtguI
         KBQpYLpNuBeR3X26MHuZG37qwhdqYxNqmcR2B/afMSBpzgK6i5vmWgZmAr5sfzP69RoG
         28eQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=9GN26cd2zhhnzE2bsrsEj6MVneOxM1xy84snYzvqifA=;
        b=Iyx8lkzQ8HxFMiz3rnxs7++xF1UktQEJgV4JgwOU33gz8SafoLPz/OnwMWl9j39tvR
         q7Z9/ha5eeKXV156wttxwm8qCi66q2a2o4rdpEdgf0lZtQcVpehs5d8hULH046r20T6B
         t3DJI3C+SL3NYvgL8GBkCDuIvMOW2YYffY/j8fMwiJzMf9gblck7cAuFArkf3/R9w1C6
         ytkKp62NgLeafGvpXTwgIF7eqzCUcgNYJIirCuKME7ryDdqYYkvnZY4os5T4rRblbzva
         E+TzKIo6tQmLsp/D10jO2Z61WNg/w9yGEErSZ4MlPXaeaN/sxeJ0n8H8ADmdPHMus0/g
         KGDw==
X-Gm-Message-State: AJcUukd0jZiARcH8x91Ew8kc4q5Cdnb5vrcZ8fI7j/V7eSBNcnqOiRvS
	Rs2ZMpKQ4v/rkHADg2swbE7lQslZTTio//2Mk0lMFeBr53A=
X-Google-Smtp-Source: ALg8bN4CSH7BIjTCYPlAXZCehI03Klzv+IaxQJnapJBVow3b4gGMwB5ky/G33TVINSEw5o5ZRQLri3htpW1uJNlyP1s=
X-Received: by 2002:ab0:7196:: with SMTP id l22mr6653929uao.3.1545767499058;
 Tue, 25 Dec 2018 11:51:39 -0800 (PST)
MIME-Version: 1.0
References: <CAFYs7S9BgPs_A-1E-K3Q06BfiO7W=h3mZD6-vkjb14R29C-bKw@mail.gmail.com>
 <CAFYs7S9K2O_QD1LyO0Rtar_j-xRLjfMecHftnO-8JoJ3jwKKWA@mail.gmail.com>
In-Reply-To: <CAFYs7S9K2O_QD1LyO0Rtar_j-xRLjfMecHftnO-8JoJ3jwKKWA@mail.gmail.com>
From: Philippe Laflamme <philippe@hopper.com>
Date: Tue, 25 Dec 2018 14:51:28 -0500
Message-ID: <CAFYs7S-mBCc1Byfy8PYAjCbOts27nkdgrF0=BWTPaZzruSAYWQ@mail.gmail.com>
Subject: Re: Investigating inconsistencies
To: dev@tephra.incubator.apache.org
Content-Type: multipart/alternative; boundary="000000000000afeb75057dde0ada"

--000000000000afeb75057dde0ada
Content-Type: text/plain; charset="UTF-8"

After modifying the test to verify the balance every second[1], I was able
to reproduce the issue. The logs are available here[2]:

* tephra-service-hbase-master.log : log of the tephra coordinator on the
"master" node.
* tephra-service-hbase-secondary.log : log of the tephra coordinator on the
"secondary" node.
* test-balancebooks.log : log of the test (running on the "slave" node)

In the master node log, we can see the same issue previously described[3]
which corresponds roughly to when the client sees the inconsistency.

Perhaps to confirm this, we can simply introduce a failure in the snapshot
state thread to fail to write the snapshot? But it seems like regardless of
this failure, the WAL should be replayed in produce the correct state?
Perhaps I'm missing something about how the snapshot and the WAL interacts?

Best,
Philippe
[1]
https://github.com/plaflamme/incubator-tephra/commit/6caeff5d76f193cdf2e0ad861059bc19cf894705
[2] https://gist.github.com/plaflamme/238a6539da9da1ac3a2e313e05ee82eb
[3]
https://gist.github.com/plaflamme/238a6539da9da1ac3a2e313e05ee82eb#file-tephra-service-hbase-master-log-L1360-L1878

On Tue, Dec 25, 2018 at 11:41 AM Philippe Laflamme <philippe@hopper.com>
wrote:

> I just noticed the BalanceBooks example which is basically the same test I
> just described. I'll use this to replicate the issue.
>
> Philippe
>
> On Tue, Dec 25, 2018 at 10:36 AM Philippe Laflamme <philippe@hopper.com>
> wrote:
>
>> Hi,
>>
>> I'm evaluating Tephra and have encountered an issue and I'm looking for
>> insights to determine what the nexts steps could be to know if this is a
>> configuration issue, a bug in our tooling, a Tephra bug or something else.
>>
>> Here's the test I'm running:
>> * 3 Vagrant VMs on the same host
>> * Tephra 0.15.0-incubating compiled against CDH 5.11.0 (all tests
>> succeeded)
>> * HDFS is configured in HA
>> * Tephra running in HA with 2 instances
>> * The workload is as follows (bank simulation):
>>   * 4 HBase keys where the value is an int (bank accounts)
>>   * 4 threads doing 2 GETs and 2 PUTs to a random pair of keys
>> (simulating a money transfer)
>>   * 1 thread continually, every second, doing 4 GETs and summing to check
>> the total is always consistently the same (no money is lost nor created)
>>
>> Under normal conditions, the checking thread should always see the same
>> total amount of money in the bank. I ran this test for 8 hours and no
>> inconsistency was ever reported.
>>
>> So I added an additional test, which is to randomly restart the Tephra
>> processes. Under these conditions, the checking thread will eventually see
>> an inconsistent state (money created or lost). It's pretty hard to recreate
>> consistently, but it always eventually pops up whenever I run the test for
>> long enough.
>>
>> So now my question is how to figure out where the problem lies. One thing
>> I've noticed is that sometimes the Tephra leader fails to write its
>> snapshot to HDFS during shutdown. I'm not sure this is sufficient to
>> explain the problem (perhaps someone here can confirm?) The exception looks
>> like this[1]. There seems to be a race during shutdown where the thread is
>> interrupted before it's finished doing its work.
>>
>> Unfortunately, I can't share our tooling code nor the test itself since
>> they rely on some internal code. So I'm wondering if someone can provide
>> guidance about what I can do to further help investigate this problem. I
>> could rewrite the test against Tephra APIs directly, but the fact that the
>> test works fine under normal conditions, I'm thinking this is more likely a
>> bug in Tephra itself.
>>
>> Cheers,
>> Philippe Laflamme
>> [1] https://gist.github.com/plaflamme/25a47dce6edd920653a33e9fc612428a
>>
>>
>>

--000000000000afeb75057dde0ada--