hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11260) Slow writer threads are not stopped
Date Sun, 18 Dec 2016 08:04:58 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Wei-Chiu Chuang updated HDFS-11260:
    Environment: CDH5.8.0

> Slow writer threads are not stopped
> -----------------------------------
>                 Key: HDFS-11260
>                 URL: https://issues.apache.org/jira/browse/HDFS-11260
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.0
>         Environment: CDH5.8.0
>            Reporter: Wei-Chiu Chuang
> If a DataNode receives a transferred block, it tries to stop writer to the same block.
However, this may not work, and we saw the following error message and stacktrace.
> Fundamentally, the assumption of {{ReplicaInPipeline#stopWriter}} is wrong. It assumes
the writer thread must be a DataXceiver thread, which it can be interrupted and terminates
afterwards. However, IPC threads may also be the writer thread by calling initReplicaRecovery,
and which ignores interrupt and do not terminate. 
> {noformat}
> 2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Join on
writer thread Thread[IPC Server handler 6 on 50020,5,main] timed out
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:135)
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2052)
> 2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in BlockReceiver constructor. Cause is
> 2016-12-16 19:58:56,168 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: sj1dra082.corp.adobe.com:50010:DataXceiver
error processing WRITE_BLOCK operation  src: / dst: /
> java.io.IOException: Join on writer thread Thread[IPC Server handler 6 on 50020,5,main]
timed out
>         at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:212)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1579)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:669)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> There is also a logic error in FsDatasetImpl#createTemporary, in which if the code in
the synchronized block executes for more than 60 seconds (in theory), it could throw an exception,
without trying to stop the existing slow writer.
> We saw a FsDatasetImpl#createTemporary failed after nearly 10 minutes, and it's unclear
why yet. It's my understanding that the code intends to stop slow writers after 1 minute by
default. Some code rewrite is probably needed to get the logic right.
> {noformat}
> 2016-12-16 23:12:24,636 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Unable to stop existing writer for block BP-1527842723-
after 568320 miniseconds.
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message