hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Ryakhovskiy (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-14422) Fix TestFastFailWithoutTestUtil
Date Tue, 28 Jun 2016 07:28:57 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352383#comment-15352383
] 

Konstantin Ryakhovskiy edited comment on HBASE-14422 at 6/28/16 7:28 AM:
-------------------------------------------------------------------------

I checked out master, reverted commit e4bf77e2de54ab6ea17b95dc116af9abf24a332d, modified one
line to allow the code to compile.

Thread 1 (T-1) is in the retry-mode
Thread 2 (T-2) is in the fast-fail mode.

when the mode is fast-fail the counter "done" gets incremented (by T-2), therefore, at some
point T-1 shouldn't call latch.await().
if (done.get() <= 1) 
  latches2[priviRetryCounter.get()].await();

T-2 increments the counter in case when T-2 is in the fast-fail mode only:
boolean pffe = false;
if (!isPriviThreadLocal.get().get()) 
  pffe = !((FastFailInterceptorContext)context).isRetryDespiteFastFailMode();
...
if (!isPriviThreadLocal.get().get()) {
  if (pffe) done.incrementAndGet();
The problem is in the PreemptiveFastFailInterceptor#inFastFailMode():
return (fInfo != null && 
  EnvironmentEdgeManager.currentTime() >
  (fInfo.timeOfFirstFailureMilliSec + this.fastFailThresholdMilliSec));

with some "unliky" timing T2 is in the retry mode instead of fast-fail and the counter "done"
is not incremented, 
context.isRetryDespiteFastFailMode() returns true for T-2 which should never happen.

Can I just remove the verification before incrementing the "done" counter
if (pffe) ... ?
Increasing PAUSE_TIME might not help, it will decrease the probability of the heisenbug, but
will not remove it.


was (Author: ryakhovskiy.k):
I checked out master, reverted commit e4bf77e2de54ab6ea17b95dc116af9abf24a332d, modified one
line to allow the code to compile.

Thread 1 (T-1) is in the retry-mode
Thread 2 (T-2) is in the fast-fail mode.

when the mode is fast-fail the counter "done" gets incremented (by T-2), therefore, at some
point T-1 shouldn't call latch.await().
if (done.get() <= 1) 
  latches2[priviRetryCounter.get()].await();

T-2 increments the counter in case when T-2 is in the fast-fail mode only:
boolean pffe = false;
if (!isPriviThreadLocal.get().get()) 
  pffe = !((FastFailInterceptorContext)context).isRetryDespiteFastFailMode();
...
if (!isPriviThreadLocal.get().get()) {
  if (pffe) done.incrementAndGet();
The problem is in the PreemptiveFastFailInterceptor#inFastFailMode():
return (fInfo != null && 
  EnvironmentEdgeManager.currentTime() >
  (fInfo.timeOfFirstFailureMilliSec + this.fastFailThresholdMilliSec));

with some "unliky" timing T2 is in the retry mode instead of fast-fail and the counter "done"
is not incremented, 
context.isRetryDespiteFastFailMode() returns true for T-2 which should never happen.

Can I just remove the verification before incrementing the "done" counter
if (pffe) ... ?
Decreasing fastFailThresholdMilliSec might not help, it will decrease the possibility of the
heisenbug, but will not remove it.

> Fix TestFastFailWithoutTestUtil
> -------------------------------
>
>                 Key: HBASE-14422
>                 URL: https://issues.apache.org/jira/browse/HBASE-14422
>             Project: HBase
>          Issue Type: Task
>          Components: test
>            Reporter: stack
>            Priority: Minor
>              Labels: beginner
>
> TestFastFailWithoutTestUtil has a unit test that does testInterceptorIntercept50Times
Usually it passes but on occasion, the latching between thread 1 and thread 2 goes awry and
the test hangs and the test hangs out. Depends on the hardware but it seems to happen about
one in four runs here on an internal rig.
> HBASE-14421 changed the wait-on-latch to timeout and do a thread dump and just let the
test keep going.
> This issue is about digging in on figuring why the hang up on latches and then fixing
it so the test doesn't have to have the latch timeout. Hopefully the threaddump helps.
> This one could be hard to fix since it not easy to reproduce. Marking it beginner anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message