hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3604) Reduce stuck at shuffling phase
Date Tue, 24 Jun 2008 07:07:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607499#action_12607499
] 

Arun C Murthy commented on HADOOP-3604:
---------------------------------------

Ok, this turned very murky indeed... 

Looks like lots of threads in JVM are hung in a call to Mutex::lock_without_safepoint_check;
when I ran jstack over the hung reducer with

$ jstack -F -m <pid>

I saw:

{noformat}
----------------- 22175 -----------------
0xffffe410  ????????
0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x064fc204  _ZN2os5sleepEP6Threadxb + 0x244
0x063d5d20  JVM_Sleep + 0x320
0xf4d863aa  * java.lang.Thread.sleep(long) bci:0 (Interpreted frame)
0xf4d7ef0d  * org.apache.hadoop.mapred.Task$1.run() bci:31 line:301 (Interpreted frame)
0xf4d7f3e9  * java.lang.Thread.run() bci:11 line:619 (Interpreted frame)
...
...


----------------- 22178 -----------------
0xffffe410  ????????
0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x065b79f4  _ZN10JavaThread44check_safepoint_and_suspend_for_native_transEPS_ + 0x2b4
0x06398e9c  jni_SetByteArrayRegion + 0x7c
0xd034b55b  Java_java_net_SocketInputStream_socketRead0 + 0xfb
0xf4d863aa  * java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int,
int, int) bci:0 (Interpreted frame)
0xf4d7ede9  * java.net.SocketInputStream.read(byte[], int, int) bci:84 line:129 (Interpreted
frame)
0xf4d7ede9  * java.io.BufferedInputStream.fill() bci:175 line:218 (Interpreted frame)
0xf4d7ef0d  * java.io.BufferedInputStream.read1(byte[], int, int) bci:44 line:258 (Interpreted
frame)
0xf4d7ede9  * java.io.BufferedInputStream.read(byte[], int, int) bci:49 line:317 (Interpreted
frame)
0xf4d7ede9  * sun.net.www.http.HttpClient.parseHTTPHeader(sun.net.www.MessageHeader, sun.net.ProgressSource,
sun.net.www.protocol.http.HttpURLConnection) bci:51 line:687 (Interpreted frame)
0xf4d7ece1  * sun.net.www.http.HttpClient.parseHTTP(sun.net.www.MessageHeader, sun.net.ProgressSource,
sun.net.www.protocol.http.HttpURLConnection) bci:30 line:632 (Interpreted frame)
0xf4d7ece1  * sun.net.www.protocol.http.HttpURLConnection.getInputStream() bci:290 line:1000
(Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(java.net.URLConnection,
int, int) bci:71 line:1230 (Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
org.apache.hadoop.fs.Path) bci:54 line:1068 (Interpreted frame)
...
...


----------------- 22182 -----------------
0xffffe410  ????????0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x440x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread
+ 0xd7
0x06594250  _ZN13ObjectMonitor5enterEP6Thread + 0x2a0
0x0659253a  _ZN18ObjectSynchronizer10slow_enterE6HandleP9BasicLockP6Thread + 0x6a
0x06365bde  _ZN18InterpreterRuntime12monitorenterEP10JavaThreadP15BasicObjectLock + 0x6e
0xf4d8f263  * org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) bci:13
line:202 (Interpreted frame)
0xf4d7ef0d  * org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority,
java.lang.Object, java.lang.Throwable) bci:14 line:388 (Interpreted frame)
0xf4d7ef0d  * org.apache.log4j.Category.log(java.lang.String, org.apache.log4j.Priority, java.lang.Object,
java.lang.Throwable) bci:34 line:853 (Interpreted frame)
0xf4d7ef0d  * org.apache.commons.logging.impl.Log4JLogger.info(java.lang.Object) bci:21 line:133
(Interpreted frame)
0xf4d7f3e9  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
org.apache.hadoop.fs.Path) bci:311 line:1114 (Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation)
bci:181 line:976 (Interpreted frame)
...
...


----------------- 22185 -----------------
0xffffe410  ????????
0x00000002  ????????
0x064e5344  _ZN5Mutex28lock_without_safepoint_checkEv + 0x44
0x06557f57  _ZN20SafepointSynchronize5blockEP10JavaThread + 0xd7
0x063639cc  _ZN18InterpreterRuntime8newarrayEP10JavaThread9BasicTypei + 0xfc
0xf4d8eb97  * java.lang.AbstractStringBuilder.<init>(int) bci:6 line:45 (Interpreted
frame)
0xf4d7ef0d  * java.lang.StringBuffer.<init>() bci:3 line:79 (Interpreted frame)
0xf4d7ef0d  * java.text.NumberFormat.format(long) bci:6 line:280 (Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.JobID.toStringWOPrefix() bci:30 line:107 (Interpreted
frame)
0xf4d7eda7  * org.apache.hadoop.mapred.TaskID.toStringWOPrefix() bci:13 line:135 (Interpreted
frame)
0xf4d7eda7  * org.apache.hadoop.mapred.TaskAttemptID.toStringWOPrefix() bci:13 line:124 (Interpreted
frame)
0xf4d7eda7  * org.apache.hadoop.mapred.TaskAttemptID.toString() bci:20 line:118 (Interpreted
frame)
0xf4d7eda7  * java.lang.String.valueOf(java.lang.Object) bci:10 line:2827 (Interpreted frame)
0xf4d7eda7  * java.lang.StringBuilder.append(java.lang.Object) bci:2 line:115 (Interpreted
frame)
0xf4d7eda7  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation,
org.apache.hadoop.fs.Path) bci:372 line:1119 (Interpreted frame)
0xf4d7eda7  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputLocation)
bci:181 line:976 (Interpreted frame)
0xf4d7ee2b  * org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run() bci:93
line:925 (Interpreted frame)
...
...
{noformat}


Sigh! This could possibly mean that we are running into JVM bug - maybe I should upgrade to
the latest, stable jvm and check.

Oh, the one other point to note is that the native codecs are compiled on a machine with 1.6.0
jvm and are run on nodes with 1.6.0_06...

> Reduce stuck at shuffling phase
> -------------------------------
>
>                 Key: HADOOP-3604
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3604
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Runping Qi
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3604_0_20080623.patch
>
>
> I was running gridmix with Hadoop 0.18.
> I set the map output compression to true.
> Most of the jobs completed just fine.
> Three jobs, however, got stuck.
> Each has one reducer stuck at shuffling phase.
> Here is the log:
> 2008-06-20 00:06:01,264 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=SHUFFLE, sessionId=
> 2008-06-20 00:06:01,415 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec
[/bin/cat]
> 2008-06-20 00:06:01,463 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager:
MemoryLimit=134217728, MaxSingleShuffleLimit=33554432
> 2008-06-20 00:06:01,474 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop
library
> 2008-06-20 00:06:01,475 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2008-06-20 00:06:01,476 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,477 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,477 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,478 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,478 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,486 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,486 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,487 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,487 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,488 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,488 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,489 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,493 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,496 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,497 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,497 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
> 2008-06-20 00:06:01,500 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0
Thread started: Thread for merging on-disk files
> 2008-06-20 00:06:01,500 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0
Thread waiting: Thread for merging on-disk files
> 2008-06-20 00:06:01,502 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0
Need another 270 map output(s) where 0 is already in progress
> 2008-06-20 00:06:01,503 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0
Thread started: Thread for merging in memory files
> 2008-06-20 00:06:01,503 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0:
Got 0 new map-outputs & number of known map outputs is 0
> 2008-06-20 00:06:01,504 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 2008-06-20 00:06:06,654 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0:
Got 269 new map-outputs & number of known map outputs is 269
> 2008-06-20 00:06:06,656 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200806192318_0450_r_000016_0
Scheduled 229 of 269 known outputs (0 slow hosts and 40 dup hosts)
> 2008-06-20 00:06:07,163 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 846183 bytes
(210104 raw bytes) into RAM-FS from attempt_200806192318_0450_m_000089_0
> 2008-06-20 00:06:07,163 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 820890 bytes
(204371 raw bytes) into RAM-FS from attempt_200806192318_0450_m_000083_0
> 2008-06-20 00:06:07,166 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 835672 bytes
(208085 raw bytes) into RAM-FS from attempt_200806192318_0450_m_000122_0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message