Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 20218E509 for ; Fri, 4 Jan 2013 11:53:08 +0000 (UTC) Received: (qmail 29513 invoked by uid 500); 4 Jan 2013 11:53:06 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 29288 invoked by uid 500); 4 Jan 2013 11:53:01 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 29260 invoked by uid 99); 4 Jan 2013 11:53:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2013 11:53:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of royston.sellman@googlemail.com designates 74.125.82.171 as permitted sender) Received: from [74.125.82.171] (HELO mail-we0-f171.google.com) (74.125.82.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2013 11:52:51 +0000 Received: by mail-we0-f171.google.com with SMTP id u3so7903186wey.30 for ; Fri, 04 Jan 2013 03:52:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:mime-version :content-type:x-mailer:thread-index:content-language; bh=8yp3vmDKwT0bXkGiKruM2VCFg0QnUMNBP60uyPkJaLc=; b=Gi2Z6YytXykfWRVeMx0Dn9tkJxhWXbJMzXIsM5APMYy1/4eL1pbgM7eEh2F99lvdVi KrrjVygRWU3OufcbUH4UdtWFWh53nVfhlO8bfcIXxLwfm1nKPKCjZ7TJ8FZwXOHjiFFi GQn57NtcBPcNrWrvytGAQ/onenPqKzRhBuqY8bVZbcKHp4/DEOPvGKoo67DjeXsFFC+f r/yvyJRcHbIYWtXvlA6JiWkTiFG/avNzHRHhcwIGfeFyDQu/BsHJP7Rg+q+6nOhdTQ1J pYR7jeRVYjWAtHoReen+BNG5s0IHNbJhOdMQiTfS0MY0OQaIMX3A81YDz/Cs76eS+Eqg cF1Q== X-Received: by 10.180.20.109 with SMTP id m13mr81417334wie.16.1357300350386; Fri, 04 Jan 2013 03:52:30 -0800 (PST) Received: from roystonsnb (host86-175-1-239.wlms-broadband.com. [86.175.1.239]) by mx.google.com with ESMTPS id u6sm89926923wif.2.2013.01.04.03.52.27 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 04 Jan 2013 03:52:29 -0800 (PST) From: Royston Sellman To: Cc: "Tom Wilcox" Subject: Lost tasktracker errors Date: Fri, 4 Jan 2013 11:52:27 -0000 Message-ID: <033401cdea71$f8d512d0$ea7f3870$@gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0335_01CDEA71.F8F1C290" X-Mailer: Microsoft Outlook 14.0 Thread-Index: Ac3qb4QsQCDmLXOcSi+cOTtxB8U5mw== Content-Language: en-gb X-Virus-Checked: Checked by ClamAV on apache.org ------=_NextPart_000_0335_01CDEA71.F8F1C290 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit I'm running a job over a 380 billion row 20 TB dataset which is computing sum(), max() etc. The job is running fine at around 3 million rows per second for several hours then grinding to a halt as it loses one after another of the tasktrackers. We see a healthy mix of successful map and reduce attempts on the tasktracker... 2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041109_0 1.0% 2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041105_0 1.0% 2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041105_0 1.0% 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041105_0 is done. 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041105_0 was 111 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 8 2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041106_0 0.9884119% 2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of tasks it ran: 1 2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041103_0 0.9884134% 2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041101_0 1.0% 2013-01-03 23:41:43,193 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041101_0 1.0% 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041101_0 is done. 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041101_0 was 111 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 9 2013-01-03 23:41:43,303 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041109_0 1.0% 2013-01-03 23:41:43,306 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041109_0 1.0% 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041109_0 is done. 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041109_0 was 111 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 10 2013-01-03 23:41:43,361 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_36690963 exited with exit code 0. Number of tasks it ran: 1 2013-01-03 23:41:43,428 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041106_0 1.0% 2013-01-03 23:41:43,432 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041106_0 1.0% 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041106_0 is done. 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041106_0 was 111 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 11 2013-01-03 23:41:43,457 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_-2095784622 exited with exit code 0. Number of tasks it ran: 1 2013-01-03 23:41:43,595 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_1190449426 exited with exit code 0. Number of tasks it ran: 1 2013-01-03 23:41:43,862 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041103_0 1.0% 2013-01-03 23:41:43,866 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041103_0 1.0% 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041103_0 is done. 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041103_0 was 111 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 12 2013-01-03 23:41:44,021 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_-505301168 exited with exit code 0. Number of tasks it ran: 1 2013-01-03 23:41:45,539 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050235882% reduce > copy (40975 of 271884 at 0.00 MB/s) > Then it seems to get stuck on reduce attempts, before exiting with a SIG_TERM (143) exit code... 2013-01-03 23:50:49,642 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:50:55,678 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:01,717 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:04,755 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:10,796 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:16,831 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:19,870 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:25,911 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:31,953 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:34,987 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:41,023 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:47,063 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:50,102 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:51:56,143 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:02,179 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:05,213 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:11,254 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:17,295 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:20,334 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:26,375 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:32,411 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:35,445 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:41,486 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:44,526 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:50,567 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:56,608 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:52:59,648 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:05,689 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:11,730 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:14,764 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:20,804 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:26,844 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:29,883 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:35,924 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:41,964 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) > 2013-01-03 23:53:42,616 INFO org.apache.hadoop.mapred.TaskTracker: Recieved ReinitTrackerAction from JobTracker 2013-01-03 23:53:42,881 INFO org.apache.hadoop.util.ProcessTree: Killing process group6089 with signal TERM. Exit code 0 2013-01-03 23:53:42,881 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 12 2013-01-03 23:53:43,348 WARN org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 143 2013-01-03 23:53:43,348 INFO org.apache.hadoop.mapred.DefaultTaskController: Output from DefaultTaskController's launchTask follows: 2013-01-03 23:53:43,348 INFO org.apache.hadoop.mapred.TaskController: 2013-01-03 23:53:43,349 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_r_-2077112828 exited with exit code 143. Number of tasks it ran: 0 2013-01-03 23:54:28,484 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run (TrackerDistributedCacheManager.java:947) Just around this time (actually a bit before) the JobTracker log reports the following: 2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker 'tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005' 2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201301031813_0001_m_000010_0: Lost task tracker: tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201301031813_0001_m_000039_0: Lost task tracker: tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201301031813_0001_m_000041_0: Lost task tracker: tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201301031813_0001_m_000089_0: Lost task tracker: tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201301031813_0001_m_000092_0: Lost task tracker: tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 Etc etc. Can anyone help with this? Royston ------=_NextPart_000_0335_01CDEA71.F8F1C290--