hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4749) Killing multiple attempts of a task taker longer as more attempts are killed
Date Thu, 01 Nov 2012 16:23:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488802#comment-13488802
] 

Robert Joseph Evans commented on MAPREDUCE-4749:
------------------------------------------------

If there are lots and lots of events for a job that is localizing then there could be a pause
for each of these events, and yes it would slow the queue down even to the point prior to
MAPREDUCE-4088 when all events would wait for the job to finish localizing. But the common
case is much faster then the worst case, not that it is much comfort when you hit the worst
case :). We could mitigate this by dropping the wait time to something smaller like 100ms
so it would take 50 times as many events to slow it down the same amount.

I also agree that the tight loop will only happen when *ALL* the present actions in the queue
are tainted. But I don't agree that it should be rare.  I think it is quite common to have
a single event in the queue, or to have all of the events in the queue to be for a single
job that is localizing.  Especially if all of the other jobs on this node are done localizing
so their events get processed quickly and removed from the queue. The only time the thread
would not be running is when the queue is empty.  I have not collected any real world numbers
so I don't know how often that actually is in practice, or what percentage of the running
time is just for checking.  If you feel that the extra CPU utilization is worth this then
go ahead and remove the wait.  I am not opposed to it. I just wanted to point out the consequences
of doing so. Also if you remove the wait, we should look at if we can remove the notify calls
from the job as well.  If no one is ever going to wait the notifys become dead code. 

That being said, I agree with you Vinod that having separate queues is a better solution over
all, but it is also a much larger change.  One that I don't know would provide that much more
benefit compared to the risk of such a change.
                
> Killing multiple attempts of a task taker longer as more attempts are killed
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4749
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4749
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Arpit Gupta
>            Assignee: Arpit Gupta
>         Attachments: MAPREDUCE-4749.branch-1.patch
>
>
> The following was noticed on a mr job running on hadoop 1.1.0
> 1. Start an mr job with 1 mapper
> 2. Wait for a min
> 3. Kill the first attempt of the mapper and then subsequently kill the other 3 attempts
in order to fail the job
> The time taken to kill the task grew exponentially.
> 1st attempt was killed immediately.
> 2nd attempt took a little over a min
> 3rd attempt took approx. 20 mins
> 4th attempt took around 3 hrs.
> The command used to kill the attempt was "hadoop job -fail-task"
> Note that the command returned immediately as soon as the fail attempt was accepted but
the time the attempt was actually killed was as stated above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message