hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
Date Mon, 20 Feb 2012 18:37:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212024#comment-13212024

Aaron T. Myers commented on HDFS-2966:

Hey Steve, patch looks pretty good. I agree this issue could stand to be improved. I've also
seen spurious failures in this test.

A few comments:

# In the spot where you call waitForGaugeValue for "FilesTotal", you also  unnecessarily assert
the value for FilesTotal.
# The name "waitForGaugeValue" seems a little misleading, since it's not a general-purpose
method for gauges, but rather somewhat specific to gauges that are a function of _DN metrics_.
Perhaps consider renaming it to something like "waitForDnMetricValue" ?
# Though the patch manages to get rid of the most race-prone sleeps (DN metrics), I don't
think it will necessarily completely solve the issue for very slow VMs, since there are still
several calls to updateMetrics. Can we completely remove the need for updateMetrics in this
test, by waiting for a specific value as you've done here?
> TestNameNodeMetrics tests can fail under load
> ---------------------------------------------
>                 Key: HDFS-2966
>                 URL: https://issues.apache.org/jira/browse/HDFS-2966
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.24.0
>         Environment: OS/X running intellij IDEA, firefox, winxp in a virtualbox.
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HDFS-2966.patch
> I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of running the
HDFS tests on a desktop with out enough memory for all the programs trying to run. Things
got swapped out and the tests failed as the DN heartbeats didn't come in on time.
> the tests both rely on {{waitForDeletion()}} to block the tests until the delete operation
has completed, but all it does is sleep for the same number of seconds as there are datanodes.
This is too brittle -it may work on a lightly-loaded system, but not on a system under heavy
load where it is taking longer to replicate than expect.
> Immediate fix: double, triple, the sleep time?
> Better fix: have the thread block until all the DN heartbeats have finished.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message