hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3232) Datanodes time out
Date Thu, 10 Apr 2008 21:46:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587795#action_12587795

Raghu Angadi commented on HADOOP-3232:

What is the exact iostat command you ran? Is it an average over last 5 seconds or so or overall

>From the data
2008-04-10 09:23:08,667 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 497572 blocks
got processed in 381086 msecs

A few things about this: 
- You have around 500K blocks.. mostly of them very small (verification time is very short).
This is an order of magnitude larger than our datanodes have here at Yahoo.
- The above log message says it too 6.5 min for block report. Most of this time I would think
is for listing the files in the local directory.
- Such a large number of blocks should cause similar problem with 0.15 also. Anything you
think is different?

Though DataNode should be able to handle larger number of blocks better, I don't think this
is a blocker for 0.16.3 release (unless iostat shows something else). Do you agree?

Btw, are you planning to have large number of small blocks? Its going to limit NameNode scalability.

> Datanodes time out
> ------------------
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.16.3
>         Attachments: hadoop-hadoop-datanode-new.log, hadoop-hadoop-datanode-new.out,
hadoop-hadoop-datanode.out, hadoop-hadoop-namenode-master2.out
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions we've often
seen in the nn webui that one or two datanodes "last contact" goes from the usual 0-3 sec
to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do this at once,
as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but looking at
the gc log output it doesn't seem to be the case.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message