tephra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Laflamme <phili...@hopper.com>
Subject Re: Investigating inconsistencies
Date Tue, 25 Dec 2018 16:41:34 GMT
I just noticed the BalanceBooks example which is basically the same test I
just described. I'll use this to replicate the issue.

Philippe

On Tue, Dec 25, 2018 at 10:36 AM Philippe Laflamme <philippe@hopper.com>
wrote:

> Hi,
>
> I'm evaluating Tephra and have encountered an issue and I'm looking for
> insights to determine what the nexts steps could be to know if this is a
> configuration issue, a bug in our tooling, a Tephra bug or something else.
>
> Here's the test I'm running:
> * 3 Vagrant VMs on the same host
> * Tephra 0.15.0-incubating compiled against CDH 5.11.0 (all tests
> succeeded)
> * HDFS is configured in HA
> * Tephra running in HA with 2 instances
> * The workload is as follows (bank simulation):
>   * 4 HBase keys where the value is an int (bank accounts)
>   * 4 threads doing 2 GETs and 2 PUTs to a random pair of keys (simulating
> a money transfer)
>   * 1 thread continually, every second, doing 4 GETs and summing to check
> the total is always consistently the same (no money is lost nor created)
>
> Under normal conditions, the checking thread should always see the same
> total amount of money in the bank. I ran this test for 8 hours and no
> inconsistency was ever reported.
>
> So I added an additional test, which is to randomly restart the Tephra
> processes. Under these conditions, the checking thread will eventually see
> an inconsistent state (money created or lost). It's pretty hard to recreate
> consistently, but it always eventually pops up whenever I run the test for
> long enough.
>
> So now my question is how to figure out where the problem lies. One thing
> I've noticed is that sometimes the Tephra leader fails to write its
> snapshot to HDFS during shutdown. I'm not sure this is sufficient to
> explain the problem (perhaps someone here can confirm?) The exception looks
> like this[1]. There seems to be a race during shutdown where the thread is
> interrupted before it's finished doing its work.
>
> Unfortunately, I can't share our tooling code nor the test itself since
> they rely on some internal code. So I'm wondering if someone can provide
> guidance about what I can do to further help investigate this problem. I
> could rewrite the test against Tephra APIs directly, but the fact that the
> test works fine under normal conditions, I'm thinking this is more likely a
> bug in Tephra itself.
>
> Cheers,
> Philippe Laflamme
> [1] https://gist.github.com/plaflamme/25a47dce6edd920653a33e9fc612428a
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message