Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 55025 invoked from network); 19 Feb 2009 12:48:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Feb 2009 12:48:27 -0000 Received: (qmail 83355 invoked by uid 500); 19 Feb 2009 12:48:24 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 83312 invoked by uid 500); 19 Feb 2009 12:48:24 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 83301 invoked by uid 99); 19 Feb 2009 12:48:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2009 04:48:24 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Feb 2009 12:48:22 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B9B24234C495 for ; Thu, 19 Feb 2009 04:48:01 -0800 (PST) Message-ID: <1015649254.1235047681745.JavaMail.jira@brutus> Date: Thu, 19 Feb 2009 04:48:01 -0800 (PST) From: "Vinod K V (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-5285) JobTracker hangs for long periods of time In-Reply-To: <335697386.1235045761732.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod K V updated HADOOP-5285: ------------------------------ Attachment: trace.txt After recovering from the hang, JT logged something as follows. Looks like initTasks() was blocked because of some DFS issue: {code} 2009-02-19 09:52:53,132 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200902190419_0129_r_000178_0' has completed task_200902190419_0129_r_000178 successful ly. 2009-02-19 10:03:29,445 WARN org.apache.hadoop.hdfs.DFSClient: Exception while reading from blk_2044238107768440002_840946 of /mapredsystem/hadoopqa/mapredsystem/job_200902190419_0419/job.split from : java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/ remote=/] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) at org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1222) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1075) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1630) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1680) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:154) at org.apache.hadoop.mapred.JobClient$RawSplit.readFields(JobClient.java:983) at org.apache.hadoop.mapred.JobClient.readSplitFile(JobClient.java:1060) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:402) at org.apache.hadoop.mapred.EagerTaskInitializationListener$JobInitThread.run(EagerTaskInitializationListener.java:55) at java.lang.Thread.run(Thread.java:619) 2009-02-19 10:03:29,446 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2044238107768440002_840946 from any node: java.io.IOException: No live nodes contain current block {code} Also attaching the trace of JT when one such situation happened. > JobTracker hangs for long periods of time > ----------------------------------------- > > Key: HADOOP-5285 > URL: https://issues.apache.org/jira/browse/HADOOP-5285 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.20.0 > Reporter: Vinod K V > Priority: Blocker > Fix For: 0.20.0 > > Attachments: trace.txt > > > On one of the larger clusters of 2000 nodes, JT hanged quite often, sometimes for times in the order of 10-15 minutes and once for one and a half hours(!). The stack trace shows that JobInProgress.obtainTaskCleanupTask() is waiting for lock on JobInProgress object which JobInProgress.initTasks() is holding for a long time waiting for DFS operations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.