hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3972) Locking and exception issues in JobHistory Server.
Date Wed, 11 Apr 2012 21:39:17 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251948#comment-13251948
] 

Robert Joseph Evans commented on MAPREDUCE-3972:
------------------------------------------------

I am updating the patch to address the comments so far.

bq. HistoryFileInfo.loadJob - instead of waiting for the file to be moved to done, it may
be better to block the move if it hasn't already started - and respond to user requests faster.

Yes I agree that would be good, but that would require me to make it so that reading the Configuration
file, and getting the location of the file are things that are done via the HistoryFileInfo,
so that the Configuration file could not move out from under other parts of the code.  This
would likely require changes to the Job interface which I was a bit reluctant to do, and also
make CompletedJob store a reference to HistoryFileInfo.  I can do it I just thought this was
a simpler approach.

bq. JobListCache.add - if the cache size is exceeded and a move failed on the job to be removed
- looks like the list will keep growing ?

Yes, and that is what happened previously too, except it was part of a different data structure
then.  It is worse now because if some of the files are not able to be moved eventually they
will prevent files that were moved from being deleted out of the data structure.  I will look
into how to handle this properly, but I am not really sure what to do about files that fail
to move.  If they fail to move we probably don't have the correct permissions on the files
or the NN is down.  If the NN is down the data structure will not be growing because nothing
new will be coming in.  If it is because the files have the wrong permissions then we may
just need to ignore the files.  It is also worse then before because of the first issue you
pointed out, so if a file fails to move the threads waiting on it will likely block until
the timeout is reached.  OK I'll look at fixing the first issue too.  Good catch Sid.
                
> Locking and exception issues in JobHistory Server.
> --------------------------------------------------
>
>                 Key: MAPREDUCE-3972
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3972
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>         Attachments: MR-3972.txt, MR-3972.txt, MR-3972.txt, MR-3972.txt
>
>
> The JobHistory server's locking is inconsistent and wrong in some cases.  This is not
super critical because the issues would only show up if a job is being cleaned up or moved
from intermediate done to done, at the same time it is being parsed into a CompletedJob. 
However the locking is slowing down the server in some cases, and is a ticking time bomb that
needs to be addressed.
> As part of this too we need to be sure that the Cleaner and Intermediate to Done migration
threads handle exceptions properly.  Now it appears that the exception is logged, and the
thread just shuts down.  This means that the history server could still be up and running
for weeks and never remove old jobs.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message