hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liaoyuxiangqin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
Date Tue, 27 Mar 2018 02:51:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413600#comment-16413600
] 

liaoyuxiangqin edited comment on HDFS-13179 at 3/27/18 2:50 AM:
----------------------------------------------------------------

[~gabor.bota] [~bharatviswa] [~elgoiri] [~xiaochen]. I think the failed reason is the correct
replica recovered from lazypersist was replaced by incorrect replica read from cache when
restart datanode. The comparative analysis  the successful and the failed test logs  as
follows:
{quote}success:
 15:05:43,550 INFO BlockPoolSlice - Successfully read replica from cache file : /tmp/run_tha_testU50PcZ/target/test/data/dfs/data/data1/current/BP-1756767659-172.17.0.1-1517497537101/current/replicas
 15:05:43,568 INFO FsDatasetImpl - Recovered 1 replicas from /tmp/run_tha_testU50PcZ/target/test/data/dfs/data/data2/current/BP-1756767659-172.17.0.1-1517497537101/current/lazypersist
{quote}
{quote}failed:
 15:07:17,309 INFO FsDatasetImpl - Recovered 1 replicas from /tmp/run_tha_testvb5u64/target/test/data/dfs/data/data2/current/BP-381858266-172.17.0.1-1517497631100/current/lazypersist
 15:07:17,310 INFO FsDatasetImpl - Time to add replicas to map for block pool BP-381858266-172.17.0.1-1517497631100
on volume /tmp/run_tha_testvb5u64/target/test/data/dfs/data/data2: 2ms
 15:07:17,311 INFO BlockPoolSlice - Successfully read replica from cache file : /tmp/run_tha_testvb5u64/target/test/data/dfs/data/data1/current/BP-381858266-172.17.0.1-1517497631100/current/replicas
{quote}
>From above test logs we can find that the read replica from cache file execute in the
front of recover replaca from lazypersist when success, but the execute sequence is in opposite
when failed.The reason lead to this issue is block pool getAllVolumesMap() async by dispatch
thread per volume when datanode restart, and the thread process speed may different in every
time, so cannot guarantee which vloume finish first.

At last I have a question about What is the effect of cache file? And I think the transient
Storage not need to write replica to cache file when shutdown volume.
 Thinks!


was (Author: liaoyuxiangqin):
[~gabor.bota] [~bharatviswa] [~elgoiri]. I think the failed reason is the correct replica
recovered from lazypersist was replaced by incorrect replica read from cache when restart
datanode. The comparative analysis  the successful and the failed test logs  as follows:
{quote}success:
 15:05:43,550 INFO BlockPoolSlice - Successfully read replica from cache file : /tmp/run_tha_testU50PcZ/target/test/data/dfs/data/data1/current/BP-1756767659-172.17.0.1-1517497537101/current/replicas
 15:05:43,568 INFO FsDatasetImpl - Recovered 1 replicas from /tmp/run_tha_testU50PcZ/target/test/data/dfs/data/data2/current/BP-1756767659-172.17.0.1-1517497537101/current/lazypersist
{quote}
{quote}failed:
 15:07:17,309 INFO FsDatasetImpl - Recovered 1 replicas from /tmp/run_tha_testvb5u64/target/test/data/dfs/data/data2/current/BP-381858266-172.17.0.1-1517497631100/current/lazypersist
 15:07:17,310 INFO FsDatasetImpl - Time to add replicas to map for block pool BP-381858266-172.17.0.1-1517497631100
on volume /tmp/run_tha_testvb5u64/target/test/data/dfs/data/data2: 2ms
 15:07:17,311 INFO BlockPoolSlice - Successfully read replica from cache file : /tmp/run_tha_testvb5u64/target/test/data/dfs/data/data1/current/BP-381858266-172.17.0.1-1517497631100/current/replicas
{quote}
>From above test logs we can find that the read replica from cache file execute in the
front of recover replaca from lazypersist when success, but the execute sequence is in opposite
when failed.The reason lead to this issue is block pool getAllVolumesMap() async by dispatch
thread per volume when datanode restart, and the thread process speed may different in every
time, so cannot guarantee which vloume finish first.

At last I have a question about What is the effect of cache file? And I think the transient
Storage not need to write replica to cache file when shutdown volume.
 Thinks!

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13179
>                 URL: https://issues.apache.org/jira/browse/HDFS-13179
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.0.0
>            Reporter: Gabor Bota
>            Priority: Critical
>         Attachments: test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure that the file
is replicated to DISK storage but the replication can't be finished to DISK during the 30s
timeout in ensureFileReplicasOnStorageType(), but the file is still on RAM_DISK - so there
is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially fixes the
flakiness. 
> {code:java}
>     try {
>       ensureFileReplicasOnStorageType(path1, DEFAULT);
>     }catch (TimeoutException t){
>       LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on RAM_DISK");
>       ensureFileReplicasOnStorageType(path1, RAM_DISK);
>     }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode restarts. Restart
line is the following in the log: LazyPersistTestCase - Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: Redundant addStoredBlock
request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main reason for
the test fail. Something wrong with the gen stamp? Corrupt replicas? 
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread diagnostics:
Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] java.net.BindException:
Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
Caused by: java.net.BindException: Address already in use at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message