hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2177) The wait for spill completion should call Condition.awaitNanos(long nanosTimeout)
Date Mon, 08 Nov 2010 01:18:24 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929423#action_12929423
] 

Chris Douglas commented on MAPREDUCE-2177:
------------------------------------------

It is forced to block because the buffer is full. Returning from collect without serializing
the emitted record would be an error, as would serializing the record over data allocated
to the spill. Changing the call as you suggest would affect correctness, unless you're arguing
that the task should fail if the spill takes more than some set amount of time. If the task
timeout is killing the task, then it's working as designed, and equivalently to the proposed
mechanism.

There are many reasons the spill could take a long time. Running with a combiner, using a
non-{{RawComparator}}, spilling to a failing/slow disk, etc. It's possible you're seeing a
race condition that causes the collection thread to miss the signal, but the fix would not
be to add a timeout to the wait, but to fix the locking. Can you get a stack trace from a
map task stuck in this state? If the job is rerun over the same data, do the same tasks hang?
Do the timeouts occur on particular machines? Does the task succeed on later attempts on different
machines?

> The wait for spill completion should call Condition.awaitNanos(long nanosTimeout)
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2177
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2177
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.20.2
>            Reporter: Ted Yu
>
> We sometimes saw maptask timeout in cdh3b2. Here is log from one of the maptasks:
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer
full= true
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: bufstart = 119534169;
bufend = 59763857; bufvoid = 298844160
> 2010-11-04 10:34:23,820 INFO org.apache.hadoop.mapred.MapTask: kvstart = 438913; kvend
= 585320; length = 983040
> 2010-11-04 10:34:41,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3
> 2010-11-04 10:35:45,352 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer
full= true
> 2010-11-04 10:35:45,547 INFO org.apache.hadoop.mapred.MapTask: bufstart = 59763857; bufend
= 298837899; bufvoid = 298844160
> 2010-11-04 10:35:45,547 INFO org.apache.hadoop.mapred.MapTask: kvstart = 585320; kvend
= 731585; length = 983040
> 2010-11-04 10:45:41,289 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4
> Note how long the last spill took.
> In MapTask.java, the following code waits for spill to finish:
> while (kvstart != kvend) { reporter.progress(); spillDone.await(); }
> In trunk code, code is similar.
> There is no timeout mechanism for Condition.await(). In case the SpillThread takes long
before calling spillDone.signal(), we would see timeout.
> Condition.awaitNanos(long nanosTimeout) should be called.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message