hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Oskarsson (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3232) Datanodes time out
Date Tue, 13 May 2008 13:40:57 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Johan Oskarsson updated HADOOP-3232:

    Attachment: du-nonblocking-v4-trunk.patch

Updated patch with the suggestions from Doug and Raghu.
Interval defaults to 10min. If the incoming interval is 0 the previous behavior is used.

Findbugs doesn't like that I start a thread in the constructor, but afaik it's the only way
without adding a start method to the class and I assume you don't want to change the interface.
This passes all the tests+checkstyle on my local machine. Also added a bunch of javadoc.

Raghu: I am using a previous patch on our cluster yes, no problems so far.
I'm not sure what you mean by not having a permanent thread. How would we update the value
without blocking on getUsed in that case? You say it's not a hard requirement, I hope you
can accept the patch anyway.

> Datanodes time out
> ------------------
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.18.0
>         Attachments: du-nonblocking-v1.patch, du-nonblocking-v2-trunk.patch, du-nonblocking-v4-trunk.patch,
hadoop-hadoop-datanode-new.log, hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out,
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions we've often
seen in the nn webui that one or two datanodes "last contact" goes from the usual 0-3 sec
to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do this at once,
as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but looking at
the gc log output it doesn't seem to be the case.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message