hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Bolotin" <ig...@collarity.com>
Subject RE: Reduce hangs at times
Date Sat, 24 Feb 2007 19:38:25 GMT
Just observed similar behavior this morning. The thread dump on one of
the Jetty server showed that there was one thread trying to open file on
DFS, while all other threads waited for this one to because of
synchronization: 

"pool-2-thread-110" prio=1 tid=0x08aa8bc8 nid=0x7570 runnable
[0x0015f000..0x00160130]
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at
java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at
java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        - locked <0xaaa02f30> (a java.io.BufferedInputStream)
        at java.io.DataInputStream.readFully(DataInputStream.java:176)
        at java.io.DataInputStream.readLong(DataInputStream.java:380)
        at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.jav
a:615)
        - locked <0xaaa02f98> (a
org.apache.hadoop.dfs.DFSClient$DFSInputStream)
        at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:702)
        - locked <0xaaa02f98> (a
org.apache.hadoop.dfs.DFSClient$DFSInputStream)
        at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStr
eam.java:189)
        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at
java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at
java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        - locked <0xaaa03040> (a
org.apache.hadoop.fs.FSDataInputStream$Buffer)
        at java.io.DataInputStream.readFully(DataInputStream.java:176)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.
java:60)
        at
org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:279
)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:262)
        at
com.collarity.cdata.CDataSegmentReader.openDataStream(CDataSegmentReader
.java:132)
        at
com.collarity.cdata.CDataSegmentReader.getDataStream(CDataSegmentReader.
java:118)
        - locked <0x1bd29208> (a com.collarity.cdata.CDataSegmentReader)
        at
com.collarity.cdata.CDataSegmentReader.readDocument(CDataSegmentReader.j
ava:95)
...

Unfortunately - as it happened on production system - I really didn't
have any time to research it further.
Once I stopped mapreduce (Job tracker, task trackers) and killed all
outstanding tasks - the Jetty got back to normal.

The only suspicious line in the NameNode logs around that timeframe was:


2007-02-24 09:25:58,792 INFO
[org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@24a4e2e3] []
StateChange          : BLOCK* NameSystem.heartbeatCheck: lost heartbeat
from sf3-1:50010

Well, if/when it happens again - I'll try to investigate it further.

Igor

P.S. We are using Hadoop version 0.10.1. I guess first thing for us to
try would be to to upgrade to the latest version.


-----Original Message-----
From: Venkat Seeth [mailto:svejb@yahoo.com] 
Sent: Saturday, February 24, 2007 9:32 AM
To: hadoop-user@lucene.apache.org
Subject: Reduce hangs at times

Hi there,

Howdy. I observe at times that few of the reduce tasks hangs during copy
phase and does not result in failures also. Hence these tasks never
complete nor rerun for timeouts.

reduce > copy (1510 of 1540 at 1.57 MB/s) >

At the same time, I see that Jetty is out of threads in its thread pool.
Dont know if these 2 are related.

I also see the following exception for many of the MR operations.

24 Feb 2007 02:52:14,438  WARN - No thread for
Socket[addr=/10.163.63.137,port=56019,localport=50060]
- at
org.mortbay.util.ThreadPool.run(ThreadPool.java:373)
24 Feb 2007 02:52:24,440  WARN - No thread for
Socket[addr=/10.163.63.137,port=56023,localport=50060]
- at
org.mortbay.util.ThreadPool.run(ThreadPool.java:373)
24 Feb 2007 02:53:01,582  WARN -
getMapOutput(task_0005_m_000432_0,30) failed :
java.net.SocketException: Connection reset
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at
org.mortbay.http.ChunkingOutputStream.bypassWrite(ChunkingOutputStream.j
ava:151)
        at
org.mortbay.http.BufferedOutputStream.write(BufferedOutputStream.java:13
9)
        at
org.mortbay.http.HttpOutputStream.write(HttpOutputStream.java:423)
        at
org.mortbay.jetty.servlet.ServletOut.write(ServletOut.java:54)
        at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.
java:1574)
        at
javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
andler.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
        at
org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
text.java:635)
        at
org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at
org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at
org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
)
        at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
 - at
org.apache.hadoop.mapred.TaskTracker.doGet(TaskTracker.java:1600)
24 Feb 2007 02:53:01,583  WARN -
getMapOutput(task_0005_m_001465_0,55) failed :

Has anyone experienced this? Any thoughts are greatly appreciated.

Thanks,
Venkat



 
________________________________________________________________________
____________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html 

Mime
View raw message