hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3232) Datanodes time out
Date Fri, 09 May 2008 19:30:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595697#action_12595697

Raghu Angadi commented on HADOOP-3232:

It was my mistake to say 'DF' where I meant 'DU'.

Patch looks good. Couple of comments :

- it need not disallow interval of zero. In that case, you could just not start the thread
and invoke run() as before. Since DU is a utilitiy it is used (or could be used) outside DataNode.

- persistent thread : in normal case, since the thread works only once in a while, it could
be created only when it needs to run. This would be an improvement, I don't mean it as a hard
requirement for this patch. This is more inline with the prev behaviour since if getUsed()
is not called, then there is no penalty. 

Are you using this (or prev) patch in your environment?

This certainly improves DN stability with large number of blocks. We still need to keep in
mind that DU has very noticeable penalty. Say it takes around 10min (as in your case)... then
it implies 15% of the time DN will be extremely I/O starved. This will have very noticeable
affect on I/O intensive applications.  

> Datanodes time out
> ------------------
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.18.0
>         Attachments: du-nonblocking-v1.patch, du-nonblocking-v2-trunk.patch, hadoop-hadoop-datanode-new.log,
hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out, hadoop-hadoop-namenode-master2.out
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions we've often
seen in the nn webui that one or two datanodes "last contact" goes from the usual 0-3 sec
to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do this at once,
as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but looking at
the gc log output it doesn't seem to be the case.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message