hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Oskarsson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3232) Datanodes time out
Date Fri, 09 May 2008 16:56:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595669#action_12595669
] 

Johan Oskarsson commented on HADOOP-3232:
-----------------------------------------

Doug: You're right about the Runnable/run() bit, just as you wrote that I adapted the patch
as you suggested.

I agree about the interval, I'll change it.
This patch is for DU, the DF returns so quickly that it shouldn't cause an issue.

In the DU constructor the command is run once so that we get values straight away, I thought
this would be better since then we know for sure there's correct values in there once the
object is created.

I'll try to recreate the situation to produce good evidence, but off the top of my head DU
is used to decide what volume to write to in writeToBlock in FSDataset, so it causes problems
with writing blocks if it takes too long. We've seen quite a lot of this.
As you say it doesn't run that often, but often enough to cause us problems.

> Datanodes time out
> ------------------
>
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: du-nonblocking-v1.patch, du-nonblocking-v2-trunk.patch, hadoop-hadoop-datanode-new.log,
hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out, hadoop-hadoop-namenode-master2.out
>
>
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions we've often
seen in the nn webui that one or two datanodes "last contact" goes from the usual 0-3 sec
to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do this at once,
as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but looking at
the gc log output it doesn't seem to be the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message