Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 40223 invoked from network); 2 Nov 2007 16:18:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Nov 2007 16:18:12 -0000 Received: (qmail 69733 invoked by uid 500); 2 Nov 2007 16:17:59 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 69695 invoked by uid 500); 2 Nov 2007 16:17:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 69686 invoked by uid 99); 2 Nov 2007 16:17:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2007 09:17:59 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Nov 2007 16:18:10 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id AA98C714201 for ; Fri, 2 Nov 2007 09:17:50 -0700 (PDT) Message-ID: <12067727.1194020270617.JavaMail.jira@brutus> Date: Fri, 2 Nov 2007 09:17:50 -0700 (PDT) From: "Christian Kunz (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate In-Reply-To: <7271267.1192732431212.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Kunz updated HADOOP-2076: ----------------------------------- Component/s: mapred Priority: Blocker (was: Major) > extensive map tasks failures because of SocketTimeoutException during statusUpdate > ---------------------------------------------------------------------------------- > > Key: HADOOP-2076 > URL: https://issues.apache.org/jira/browse/HADOOP-2076 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.16.0 > Environment: Oct 17 #718 nightly build with patches 2033 and 2048 > Reporter: Christian Kunz > Priority: Blocker > > A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully) > More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get a connection and fail just after a few seconds. > JobTracker is running with 60 handlers. We allow up to 10 attempts for maps. > I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start. > 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= > 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600 > 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library > 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library > 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:484) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) > at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source) > at org.apache.hadoop.mapred.Task$1.run(Task.java:293) > at java.lang.Thread.run(Thread.java:619) > 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:484) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) > at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source) > at org.apache.hadoop.mapred.Task$1.run(Task.java:293) > at java.lang.Thread.run(Thread.java:619) > 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:484) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) > at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source) > at org.apache.hadoop.mapred.Task$1.run(Task.java:293) > at java.lang.Thread.run(Thread.java:619) > 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception > 8 active threads > Thread 13 (Comm thread for task_200710172336_0016_m_000071_0): > State: RUNNABLE > Blocked count: 0 > Waited count: 4128 > Stack: > sun.management.ThreadImpl.getThreadInfo0(Native Method) > sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147) > sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123) > org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114) > org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162) > org.apache.hadoop.mapred.Task$1.run(Task.java:315) > java.lang.Thread.run(Thread.java:619) > Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571): > State: TIMED_WAITING > Blocked count: 0 > Waited count: 6403 > Stack: > java.lang.Thread.sleep(Native Method) > org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558) > java.lang.Thread.run(Thread.java:619) > Thread 9 (IPC Client connection to /127.0.0.1:49458): > State: RUNNABLE > Blocked count: 21 > Waited count: 2063 > Stack: > java.net.SocketInputStream.socketRead0(Native Method) > java.net.SocketInputStream.read(SocketInputStream.java:129) > java.io.FilterInputStream.read(FilterInputStream.java:116) > org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181) > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > java.io.DataInputStream.readInt(DataInputStream.java:370) > org.apache.hadoop.ipc.Client$Connection.run(Client.java:258) > Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler): > State: TIMED_WAITING > Blocked count: 0 > Waited count: 6402 > Stack: > java.lang.Thread.sleep(Native Method) > org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404) > Thread 4 (Signal Dispatcher): > State: RUNNABLE > Blocked count: 0 > Waited count: 0 > Stack: > Thread 3 (Finalizer): > State: WAITING > Blocked count: 398 > Waited count: 2270 > Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5 > Stack: > java.lang.Object.wait(Native Method) > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) > java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > Thread 2 (Reference Handler): > State: WAITING > Blocked count: 257 > Waited count: 2269 > Waiting on java.lang.ref.Reference$Lock@1c66ec7 > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > Thread 1 (main): > State: RUNNABLE > Blocked count: 1 > Waited count: 10 > Stack: > java.io.FileInputStream.readBytes(Native Method) > java.io.FileInputStream.read(FileInputStream.java:199) > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105) > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > java.io.DataInputStream.read(DataInputStream.java:132) > org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378) > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200) > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234) > org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) > org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) > org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157) > org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378) > org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359) > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254) > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37) > org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793) > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217) > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142) > 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0 > Log of task that could not start: > 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s). > 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s). > 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s). > 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s). > 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s). > 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s). > 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s). > 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s). > 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s). > 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.