hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3120) Large #of tasks failing at one time can effectively hang the jobtracker
Date Fri, 28 Mar 2008 19:06:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583147#action_12583147

Doug Cutting commented on HADOOP-3120:

It sounds like we should experiment with AsyncAppender with a large buffer size (10k messages?)
and perhaps with blocking=false (drops messages when buffer is full, logging a count of dropped
messages).  Have you tried this, Pete?

> Large #of tasks failing at one time can effectively hang the jobtracker 
> ------------------------------------------------------------------------
>                 Key: HADOOP-3120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3120
>             Project: Hadoop Core
>          Issue Type: Bug
>         Environment: Linux/Hadoop-15.3
>            Reporter: Pete Wyckoff
>            Priority: Minor
> We think that JobTracker.removeMarkedTaks does so much logging when this happens (ie
logging thousands of failed taks per cycle) that nothing else can go on (since it's called
from a synchronized method) and thus by the next cycle, the next waves of jobs have failed
and we again have 10s of thousands of failures to log and on and on.
> At least, the above is what we observed - just a continual printing of those failures
and nothing else happening on and on. Of course the original jobs may have ultimately failed
but new jobs come in to perpetuate the problem.
> This has happened to us a number of times and since we commented out the log.info in
that method we haven't had any problems. Although thousands and thousands of task failures
are hopefully not that common.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message