hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj Govindassamy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-10819) BlockManager fails to store a good block for a datanode storage after it reported a corrupt block โ€” block replication stuck
Date Wed, 31 Aug 2016 02:01:22 GMT

     [ https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Manoj Govindassamy updated HDFS-10819:
--------------------------------------
    Description: 
TestDataNodeHotSwapVolumes occasionally fails in the unit test testRemoveVolumeBeingWrittenForDatanode.
 Data write pipeline can have issues as there could be timeouts, data node not reachable etc,
and in this test case it was more of induced one as one of the volumes in a datanode is removed
while block write is in progress. Digging further in the logs, when the problem happens in
the write pipeline, the error recovery is not happening as expected leading to block replication
never catching up.

Though this problem has same signature as in HDFS-10780, from the logs it looks like the code
paths taken are totally different and so the root cause could be different as well.


  was:
TestDataNodeHotSwapVolumes occasionally fails in the unit test testRemoveVolumeBeingWrittenForDatanode.
 Data write pipeline can have issues as there could be timeouts, data node not reachable etc,
and in this test case it was more of induced one as one of the volumes in a datanode is removed
while block write is in progress. Digging further in the logs, when the problem happens in
the write pipeline, the error recovery is not happening as expected leading to block replication
never catching up.

Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.495 sec <<< FAILURE!
- in org.apache.hadoop.hdfs.serv
testRemoveVolumeBeingWritten(org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes)
 Time elapsed: 44.354 se
java.util.concurrent.TimeoutException: Timed out waiting for /test to reach 3 replicas
Results :
Tests in error: 
  TestDataNodeHotSwapVolumes.testRemoveVolumeBeingWritten:637->testRemoveVolumeBeingWrittenForDatanode:714
ยป Timeout
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0

Following exceptions are not expected in this test run
{noformat}
 614 2016-08-10 12:30:11,269 [DataXceiver for client DFSClient_NONMAPREDUCE_-640082112_10
at /127.0.0.1:58805 [Receiving block BP-1852988604-172.16.3.66-1470857409044:blk_1073741825_1001]]
DEBUG datanode.Da     taNode (DataXceiver.java:run(320)) - 127.0.0.1:58789:Number of active
connections is: 2
 615 java.lang.IllegalMonitorStateException
 616         at java.lang.Object.wait(Native Method)
 617         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:280)
 618         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.removeVolumes(FsDatasetImpl.java:517)
 619         at org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:832)
 620         at org.apache.hadoop.hdfs.server.datanode.DataNode.removeVolumes(DataNode.java:798)
{noformat}

{noformat}
 720 2016-08-10 12:30:11,287 [DataNode: [[[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/,
[DISK]file:/Users/manoj/work/ups-hadoop/hadoop-hdfs-projec     t/hadoop-hdfs/target/test/data/dfs/data/data2/]]
 heartbeating to localhost/127.0.0.1:58788] ERROR datanode.DataNode (BPServiceActor.java:run(768))
- Exception in BPOfferService for Block pool BP-18529     88604-172.16.3.66-1470857409044
(Datanode Uuid 711d58ad-919d-4350-af1e-99fa0b061244) service to localhost/127.0.0.1:58788
 721 java.lang.NullPointerException
 722         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1841)
 723         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:336)
 724         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:624)
 725         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:766)
 726         at java.lang.Thread.run(Thread.java:745)
{noformat}




> BlockManager fails to store a good block for a datanode storage after it reported a corrupt
block โ€” block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10819
>                 URL: https://issues.apache.org/jira/browse/HDFS-10819
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test testRemoveVolumeBeingWrittenForDatanode.
 Data write pipeline can have issues as there could be timeouts, data node not reachable etc,
and in this test case it was more of induced one as one of the volumes in a datanode is removed
while block write is in progress. Digging further in the logs, when the problem happens in
the write pipeline, the error recovery is not happening as expected leading to block replication
never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it looks like
the code paths taken are totally different and so the root cause could be different as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message