hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod K V (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-5285) JobTracker hangs for long periods of time
Date Thu, 19 Feb 2009 12:48:01 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod K V updated HADOOP-5285:
------------------------------

    Attachment: trace.txt

After recovering from the hang, JT logged something as follows. Looks like initTasks() was
blocked because of some DFS issue:


{code}
2009-02-19 09:52:53,132 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200902190419_0129_r_000178_0'
has completed task_200902190419_0129_r_000178 successful
ly.
2009-02-19 10:03:29,445 WARN org.apache.hadoop.hdfs.DFSClient: Exception while reading from
blk_2044238107768440002_840946 of /mapredsystem/hadoopqa/mapredsystem/job_200902190419_0419/job.split
from <host:port>: java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<host:port>
remote=/<host:port>]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1222)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1075)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1630)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1680)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:154)
        at org.apache.hadoop.mapred.JobClient$RawSplit.readFields(JobClient.java:983)
        at org.apache.hadoop.mapred.JobClient.readSplitFile(JobClient.java:1060)
        at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:402)
        at org.apache.hadoop.mapred.EagerTaskInitializationListener$JobInitThread.run(EagerTaskInitializationListener.java:55)
        at java.lang.Thread.run(Thread.java:619)

2009-02-19 10:03:29,446 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_2044238107768440002_840946
from any node:  java.io.IOException: No live nodes contain current block
{code}

Also attaching the trace of JT when one such situation happened.

> JobTracker hangs for long periods of time
> -----------------------------------------
>
>                 Key: HADOOP-5285
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5285
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Vinod K V
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: trace.txt
>
>
> On one of the larger clusters of 2000 nodes, JT hanged quite often, sometimes for times
in the order of 10-15 minutes and once for one and a half hours(!). The stack trace shows
that JobInProgress.obtainTaskCleanupTask() is waiting for lock on JobInProgress object which
JobInProgress.initTasks() is holding for a long time waiting for DFS operations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message