hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pete Wyckoff (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3120) Large #of tasks failing at one time can effectively hang the jobtracker
Date Fri, 28 Mar 2008 19:16:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583155#action_12583155

Pete Wyckoff commented on HADOOP-3120:

I am setting it up on a non-hadoop test server, just to ensure it operates the way I think
it should.  You are right, if we went from 10 microseconds/call to something more like ~0.5-1
microsecond that should probably do it -- The 10 microseconds is measured from a call to LOG.info
with a couple of strings and 2 ints to format (standalone program with idle disk and about
a million calls). The 0.5 is the same call with info logging disabled and just guestimating
the overhead of the call to AsyncAppender. Probably can have #s on asyncappender performance
maybe early next week.
-- pete

> Large #of tasks failing at one time can effectively hang the jobtracker 
> ------------------------------------------------------------------------
>                 Key: HADOOP-3120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3120
>             Project: Hadoop Core
>          Issue Type: Bug
>         Environment: Linux/Hadoop-15.3
>            Reporter: Pete Wyckoff
>            Priority: Minor
> We think that JobTracker.removeMarkedTaks does so much logging when this happens (ie
logging thousands of failed taks per cycle) that nothing else can go on (since it's called
from a synchronized method) and thus by the next cycle, the next waves of jobs have failed
and we again have 10s of thousands of failures to log and on and on.
> At least, the above is what we observed - just a continual printing of those failures
and nothing else happening on and on. Of course the original jobs may have ultimately failed
but new jobs come in to perpetuate the problem.
> This has happened to us a number of times and since we commented out the log.info in
that method we haven't had any problems. Although thousands and thousands of task failures
are hopefully not that common.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message