tephra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Laflamme <phili...@hopper.com>
Subject Re: Investigating inconsistencies
Date Tue, 25 Dec 2018 19:51:28 GMT
After modifying the test to verify the balance every second[1], I was able
to reproduce the issue. The logs are available here[2]:

* tephra-service-hbase-master.log : log of the tephra coordinator on the
"master" node.
* tephra-service-hbase-secondary.log : log of the tephra coordinator on the
"secondary" node.
* test-balancebooks.log : log of the test (running on the "slave" node)

In the master node log, we can see the same issue previously described[3]
which corresponds roughly to when the client sees the inconsistency.

Perhaps to confirm this, we can simply introduce a failure in the snapshot
state thread to fail to write the snapshot? But it seems like regardless of
this failure, the WAL should be replayed in produce the correct state?
Perhaps I'm missing something about how the snapshot and the WAL interacts?

[2] https://gist.github.com/plaflamme/238a6539da9da1ac3a2e313e05ee82eb

On Tue, Dec 25, 2018 at 11:41 AM Philippe Laflamme <philippe@hopper.com>

> I just noticed the BalanceBooks example which is basically the same test I
> just described. I'll use this to replicate the issue.
> Philippe
> On Tue, Dec 25, 2018 at 10:36 AM Philippe Laflamme <philippe@hopper.com>
> wrote:
>> Hi,
>> I'm evaluating Tephra and have encountered an issue and I'm looking for
>> insights to determine what the nexts steps could be to know if this is a
>> configuration issue, a bug in our tooling, a Tephra bug or something else.
>> Here's the test I'm running:
>> * 3 Vagrant VMs on the same host
>> * Tephra 0.15.0-incubating compiled against CDH 5.11.0 (all tests
>> succeeded)
>> * HDFS is configured in HA
>> * Tephra running in HA with 2 instances
>> * The workload is as follows (bank simulation):
>>   * 4 HBase keys where the value is an int (bank accounts)
>>   * 4 threads doing 2 GETs and 2 PUTs to a random pair of keys
>> (simulating a money transfer)
>>   * 1 thread continually, every second, doing 4 GETs and summing to check
>> the total is always consistently the same (no money is lost nor created)
>> Under normal conditions, the checking thread should always see the same
>> total amount of money in the bank. I ran this test for 8 hours and no
>> inconsistency was ever reported.
>> So I added an additional test, which is to randomly restart the Tephra
>> processes. Under these conditions, the checking thread will eventually see
>> an inconsistent state (money created or lost). It's pretty hard to recreate
>> consistently, but it always eventually pops up whenever I run the test for
>> long enough.
>> So now my question is how to figure out where the problem lies. One thing
>> I've noticed is that sometimes the Tephra leader fails to write its
>> snapshot to HDFS during shutdown. I'm not sure this is sufficient to
>> explain the problem (perhaps someone here can confirm?) The exception looks
>> like this[1]. There seems to be a race during shutdown where the thread is
>> interrupted before it's finished doing its work.
>> Unfortunately, I can't share our tooling code nor the test itself since
>> they rely on some internal code. So I'm wondering if someone can provide
>> guidance about what I can do to further help investigate this problem. I
>> could rewrite the test against Tephra APIs directly, but the fact that the
>> test works fine under normal conditions, I'm thinking this is more likely a
>> bug in Tephra itself.
>> Cheers,
>> Philippe Laflamme
>> [1] https://gist.github.com/plaflamme/25a47dce6edd920653a33e9fc612428a

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message