hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frode Halvorsen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7212) Huge number of BLOCKED threads rendering DataNodes useless
Date Fri, 20 Mar 2015 19:29:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371926#comment-14371926

Frode Halvorsen commented on HDFS-7212:

No. This is not the same. When I earlier experienced that the number of connections exceeded
the maximum, I increased the maximum.
My issue is the same as in this bug-entry.

My datanodes runs fine with 70-80 threads, the suddenly one node with a lot of blocks just
stops writing the recieved blocks , and the thread keeps hanging on the reciever. Then the
threads just accumulate until I have at least 600 bloked threads. 
I get one more line like this for each thread :
2015-03-20 20:15:56,102 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-874555352-
src: / dest: /
But it never reports the block recieved as it normally does. 
Then I can go in via MissionControl, and look at the thread.graph, ans find it rising massivly.
after 650 secs, the name-node states that the datanode is dead, but it actually is not. I
can stop/start metrics (via mbeans), and sometimes the datanode just flushes (kills) all blocked
threads, and reconnects to the namenode. many times, however, I have to restart the datanode.
It uses a good half hour on the step where it adds the blocks to the pool, and when it reconnets
to the namenode, they first of all cleans up the over-replicated blocks. The namenode, of
course, stosp all other processing when the datanode 'arrives', so any process adding files
to the cluster is put 'on hold' by the namenode.
Very often during the cleanup with one datanode, another starts the same process with just
starting the recieve-thread, and piles up a few hundred of them i blocked state.

My stacktrace (on the blocked thred) is like this:
DataXceiver for client  at / [Receiving block BP-874555352-]
[51396] (BLOCKED)
   org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary line:
   org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary line:
   org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init> line: 179 
   org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock line: 615 
   org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock line: 137 
   org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp line: 74 
   org.apache.hadoop.hdfs.server.datanode.DataXceiver.run line: 235 
   java.lang.Thread.run line: 745 

And just now, my datanode has appx 700 of those threads. 

> Huge number of BLOCKED threads rendering DataNodes useless
> ----------------------------------------------------------
>                 Key: HDFS-7212
>                 URL: https://issues.apache.org/jira/browse/HDFS-7212
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.4.0
>         Environment: PROD
>            Reporter: Istvan Szukacs
> There are 3000 - 8000 threads in each datanode JVM, blocking the entire VM and rendering
the service unusable, missing heartbeats and stopping data access. The threads look like this:
> {code}
> 3415 (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
> - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Compiled
> - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1,
line=834 (Interpreted frame)
> - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node,
int) @bci=67, line=867 (Interpreted frame)
> - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) @bci=17, line=1197
(Interpreted frame)
> - java.util.concurrent.locks.ReentrantLock$NonfairSync.lock() @bci=21, line=214 (Compiled
> - java.util.concurrent.locks.ReentrantLock.lock() @bci=4, line=290 (Compiled frame)
> - org.apache.hadoop.net.unix.DomainSocketWatcher.add(org.apache.hadoop.net.unix.DomainSocket,
org.apache.hadoop.net.unix.DomainSocketWatcher$Handler) @bci=4, line=286 (Interpreted frame)
> - org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(java.lang.String,
org.apache.hadoop.net.unix.DomainSocket) @bci=169, line=283 (Interpreted frame)
> - org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(java.lang.String)
@bci=212, line=413 (Interpreted frame)
> - org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(java.io.DataInputStream)
@bci=13, line=172 (Interpreted frame)
> - org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(org.apache.hadoop.hdfs.protocol.datatransfer.Op)
@bci=149, line=92 (Compiled frame)
> - org.apache.hadoop.hdfs.server.datanode.DataXceiver.run() @bci=510, line=232 (Compiled
> - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
> {code}
> Has anybody seen this before?

This message was sent by Atlassian JIRA

View raw message