hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3232) Datanodes time out
Date Thu, 10 Apr 2008 18:08:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587736#action_12587736
] 

Raghu Angadi commented on HADOOP-3232:
--------------------------------------


While you are at it, could you attach log (.log file) for this datanode as well? The log file
show what activity is going on now. Also list approximate time the stack trace was taken.

My observation is that you are writing a lot of blocks.. and the datanode that looks blocked
is blocked while listing all the blocks on the native filesystem. It does this every hour
when it sends block reports. Till now nothing suspicious other than heavy write traffic and
slow disks. Check iostat on the machine. What is the hardware like?

The two main threads from the DataNode are :

# locks 0x782f1348 : {noformat}
"DataNode: [/var/storage/1/dfs/data,/var/storage/2/dfs/data,/var/storage/3/dfs/data,/var/storage/4/dfs/data]"
daemon prio=10 tid=0x72409000
 nid=0x44f6 runnable [0x71d8a000..0x71d8aec0]
   java.lang.Thread.State: RUNNABLE
        at java.io.UnixFileSystem.list(Native Method)
        at java.io.File.list(File.java:973)
        at java.io.File.listFiles(File.java:1051)
        at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:153)
        at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:149)
        at org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:149)
        at org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:368)
        at org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:434)
        - locked <0x782f1348> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
        at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:781)
        at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:642)
        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2431)
        at java.lang.Thread.run(Thread.java:619)
{noformat}
# locked 0x77ce9360 and waiting on 0x782f1348 {noformat}
"org.apache.hadoop.dfs.DataNode$DataXceiver@101f287" daemon prio=10 tid=0x71906400 nid=0x5a93
waiting for monitor entry [0x712de000..0x712defc0]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:665)
	- waiting to lock <0x782f1348> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
	- locked <0x77ce9360> (a org.apache.hadoop.dfs.FSDataset)
	at org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:1995)
	at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)
	at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
	at java.lang.Thread.run(Thread.java:619)
{noformat}
# most other threads are waiting on 0x77ce9360





> Datanodes time out
> ------------------
>
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.16.3
>
>         Attachments: hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out, hadoop-hadoop-namenode-master2.out
>
>
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions we've often
seen in the nn webui that one or two datanodes "last contact" goes from the usual 0-3 sec
to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do this at once,
as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but looking at
the gc log output it doesn't seem to be the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message