From hdfs-issues-return-210853-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Thu Feb 22 23:38:06 2018
Return-Path: <hdfs-issues-return-210853-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 464BB18064E
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 22 Feb 2018 23:38:05 +0100 (CET)
Received: (qmail 92628 invoked by uid 500); 22 Feb 2018 22:38:04 -0000
Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-issues@hadoop.apache.org>
List-Id: <hdfs-issues.hadoop.apache.org>
Delivered-To: mailing list hdfs-issues@hadoop.apache.org
Received: (qmail 92617 invoked by uid 99); 22 Feb 2018 22:38:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Feb 2018 22:38:04 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C2D61180496
	for <hdfs-issues@hadoop.apache.org>; Thu, 22 Feb 2018 22:38:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.511
X-Spam-Level:
X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01,
	USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id 1RZcXCaD88JS for <hdfs-issues@hadoop.apache.org>;
	Thu, 22 Feb 2018 22:38:01 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2579D5F341
	for <hdfs-issues@hadoop.apache.org>; Thu, 22 Feb 2018 22:38:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 86B7EE01AE
	for <hdfs-issues@hadoop.apache.org>; Thu, 22 Feb 2018 22:38:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3E7AF27137
	for <hdfs-issues@hadoop.apache.org>; Thu, 22 Feb 2018 22:38:00 +0000 (UTC)
Date: Thu, 22 Feb 2018 22:38:00 +0000 (UTC)
From: "Gabor Bota (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13140341.1519338768000.256262.1519339080254@Atlassian.JIRA>
In-Reply-To: <JIRA.13140341.1519338768000@Atlassian.JIRA>
References: <JIRA.13140341.1519338768000@Atlassian.JIRA> <JIRA.13140341.1519338768061@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HDFS-13179)
 TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails
 intermittently
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Bota updated HDFS-13179:
------------------------------
    Description: 
The error caused by TimeoutException because the test is waiting to ensure that the file is replicated to DISK storage but the replication can't be finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially fixes the flakiness. 
{code:java}
    try {
      ensureFileReplicasOnStorageType(path1, DEFAULT);
    }catch (TimeoutException t){
      LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on RAM_DISK");
      ensureFileReplicasOnStorageType(path1, RAM_DISK);
    }
  }
{code}

Some thoughts:
* Successful and failed tests run similar to the point when datanode restarts. Restart line is the following in the log: LazyPersistTestCase - Restarting the DataNode
* There is a line which only occours in the failed test: *addStoredBlock: Redundant addStoredBlock request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
* This redundant request at BlockManager#addStoredBlock could be the main reason for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=============================

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}
java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
{noformat}

1 failure caused by: {noformat}
java.net.BindException: Problem binding to [localhost:56729] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) Caused by: java.net.BindException: Address already in use at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
{noformat}

=============================

Example stacktrace:
{noformat}
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2017-11-01 10:36:49,499
"Thread-1" prio=5 tid=13 runnable
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.dumpThreads(Native Method)
at java.lang.Thread.getAllStackTraces(Thread.java:1610)
at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
{noformat}

  was:
The error caused by TimeoutException because the test is waiting to ensure that the file is replicated to DISK storage but the replication can't be finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially fixes the flakiness. 
{code:java}
    try {
      ensureFileReplicasOnStorageType(path1, DEFAULT);
    }catch (TimeoutException t){
      LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on RAM_DISK");
      ensureFileReplicasOnStorageType(path1, RAM_DISK);
    }
  }
{code}

Successful and failed tests run similar to the point when datanode restarts. Restart line is the following in the log: LazyPersistTestCase - Restarting the DataNode
There is a line in the failed test: *addStoredBlock: Redundant addStoredBlock request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
This redundant request at BlockManager#addStoredBlock could be the main reason for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=============================

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}
java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
{noformat}

1 failure caused by: {noformat}
java.net.BindException: Problem binding to [localhost:56729] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) Caused by: java.net.BindException: Address already in use at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
{noformat}

=============================

Example stacktrace:
{noformat}
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2017-11-01 10:36:49,499
"Thread-1" prio=5 tid=13 runnable
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.dumpThreads(Native Method)
at java.lang.Thread.getAllStackTraces(Thread.java:1610)
at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
{noformat}


> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13179
>                 URL: https://issues.apache.org/jira/browse/HDFS-13179
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Gabor Bota
>            Priority: Critical
>
> The error caused by TimeoutException because the test is waiting to ensure that the file is replicated to DISK storage but the replication can't be finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially fixes the flakiness. 
> {code:java}
>     try {
>       ensureFileReplicasOnStorageType(path1, DEFAULT);
>     }catch (TimeoutException t){
>       LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on RAM_DISK");
>       ensureFileReplicasOnStorageType(path1, RAM_DISK);
>     }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode restarts. Restart line is the following in the log: LazyPersistTestCase - Restarting the DataNode
> * There is a line which only occours in the failed test: *addStoredBlock: Redundant addStoredBlock request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main reason for the test fail. Something wrong with the gen stamp? Corrupt replicas? 
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49) Caused by: java.net.BindException: Address already in use at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org