hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schneider <Sch...@TransPac.com>
Subject DFS/RPC problems
Date Wed, 15 Nov 2006 18:52:26 GMT
Fellow Hadoop coders,

I'm getting tragic DFS/RPC problems running Nutch 0.8+ atop Hadoop 
0.5.1+ that seem similar to problems others have been reporting 
(e.g., HADOOP-707). My great hope was that the HADOOP-563 patch would 
resolve them, and it may well have improved things somewhat. However, 
I'm still unable to complete the crawl I've been attempting even 
after rolling in this patch. A few notes about my setup that may be 
relevant:

1) I'm running a crawl through a proxy server, and I've modified 
Nutch so that it skips the server delay whenever it gets a hit in the 
proxy cache. Since I've been attempting the same crawl over and over 
again, the first part of it may run so fast that it taxes the DFS and 
RPC in a way that exposes fundamental problems.

2) I'm injecting 26M URLs into a brand new DB and then trying to 
fetch them all at once. I've tuned the fetcher.threads.fetch to 500 
in order to avoid fatal memory swapping on my cluster of 10 slaves (2 
tasktrackers per slave).

3) I've been using hadoop job -kill to stop the fetch job when I 
notice things not going the way I want (not necessarily catastrophic 
problems). I wonder whether this (in combination with the heavy 
loading I've applied to hadoop) may result in my DFS becoming 
"unusable".

4) Once my DFS becomes "unusable", I can stop-all.sh/start-all.sh, 
reboot my systems, etc., but nothing will give me back a cluster I 
can reliably use to do even relatively simple things like hadoop dfs 
-du. The start-all.sh script takes a very long time to execute, but 
hadoop eventually does come up, including its web server. The hadoop 
fsck / command (unless it dies do to an RPC problem) also reports 
that my DFS is healthy, though things are very sick indeed. Even 
simple DFS commands either execute slowly or die with RPC problems 
(SocketTimeoutException). Crawling at this point may also get started 
OK, though the fetch will eventually die.

5) Part of the modifications we've made to Nutch are a message queue 
subsystem developed by Andrzej. Since this doesn't use NIO to do any 
of its communication, I doubt that it is interfering with hadoop's 
RPC. However, it does a fair amount of writing to the DFS itself, and 
many of the error messages in the logs are associated with these 
requests.

6) To resolve the "unusable" DFS problem, I'm forced to use the local 
FS to delete all of the DFS contents (on both master and slaves), and 
then I tell the namenode to format the DFS. I also wonder whether 
this operation may leave things in some kind of strange state, though 
I can't imagine what might be the problem.

After resorting to step #6 above yesterday, I performed another test 
and conducted a fairly detailed analysis of the log files. I injected 
my 26M URLs into a new crawldb, generated a fetch list (without 
specifying any topN) and started fetching it. After over an hour of 
fetching, I noticed that I wanted to do a little more logging on the 
Nutch side, so I killed the fetch job, did a stop-all.sh, deployed a 
modified version of Nutch, did a start-all.sh, and tried my crawl 
again from scratch (i.e., I injected my 26M URLs into a brand new 
crawldb, etc.) I let this run until the fetch job died, then mined 
the logs for the following information.

Any help our insight you could provide would be greatly appreciated.

Best Regards,

- Chris

-------------------------------------------------------------------------------------------
I didn't find anything unusual in the stdout for start-all.sh.

The namenode may have had trouble removing files immediately after it was first
launched, but this is probably just housekeeping (making sure that it deletes
any files that are sometimes left over from previous runs):

   2006-11-14 11:13:23,276 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/tmp/hadoop/mapred/.system.crc because it does not exist
   2006-11-14 11:13:23,280 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/tmp/hadoop/mapred/system because it does not exist

Immediately after the injection for the first crawl began
   2006-11-14 11:45:45,141 INFO  crawl.Injector - Injector: starting
there is a significant(?) DFS block issue in the s1 datanode's log:

   2006-11-14 11:45:49,021 WARN  dfs.DataNode - DataXCeiver
   java.io.IOException: Block blk_452376043108156205 has already been 
started (though not completed), and thus cannot be created.
           at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:298)
           at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663) 

           at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
           at java.lang.Thread.run(Thread.java:595)

This problem is reflected in the requesting (s7) datanode's log as well and
apparently results in its connection to the namenode getting reset(?):

   2006-11-14 11:45:49,010 WARN  dfs.DataNode - Failed to transfer 
blk_452376043108156205 to s1-crawlera/10.10.16.101:50010
   java.net.SocketException: Connection reset
           at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
           at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
           at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
           at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
           at java.io.DataOutputStream.write(DataOutputStream.java:90)
           at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:912)
           at java.lang.Thread.run(Thread.java:595)

(There are three of these events in s7's datanode log during the injection,
generation and fetching, which seem to be correlated with some of the message
queue subsystem problems mentioned below.)

As the injection for the first crawl completed
   2006-11-14 12:06:13,613 INFO  crawl.Injector - Injector: done
there are a couple of unimportant(?) messages that I can attribute to a brand
new crawldb:

   2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/.old.crc because it does 
not exist
   2006-11-14 12:06:13,603 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/old because it does not 
exist
   2006-11-14 12:06:13,604 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedRenameTo: failed to rename 
/user/crawler/crawl-20061114114341/crawldb/current to 
/user/crawler/crawl-20061114114341/crawldb/old because source does 
not exist
   2006-11-14 12:06:13,607 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/.old.crc because it does 
not exist
   2006-11-14 12:06:13,608 WARN  dfs.StateChange - DIR* 
FSDirectory.unprotectedDelete: failed to remove 
/user/crawler/crawl-20061114114341/crawldb/old because it does not 
exist

Once the fetcher job got going
   2006-11-14 12:21:30,654 INFO  fetcher.Fetcher - Fetcher: segment: 
/user/crawler/crawl-20061114114341/segments/20061114120615
Andrzej's message queue subsystem apparently had tons of trouble creating
various files, with messages getting spit out every minute or so:

   2006-11-14 12:21:54,827 WARN  dfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/mq/alerts/fetcher/.1163535714813.s7-crawlera.crc for 
DFSClient_task_0005_m_000010_0 on client s7-crawlera because 
pendingCreates is non-null.
   ...
   2006-11-14 13:10:50,457 WARN  dfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/mq/alerts/fetcher/.1163538650385.s7-crawlera.crc for 
DFSClient_task_0005_m_000019_0 on client s7-crawlera because 
pendingCreates is non-null.

These may be correlated with entries in the datanode logs:

   2006-11-14 13:10:46,387 WARN  dfs.DataNode - DataXCeiver
   java.io.IOException: Block blk_-8970705341808748626 is valid, and 
cannot be written to.
           at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:285)
           at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:663)
           at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:507)
           at java.lang.Thread.run(Thread.java:595)

The namenode also noticed a few redundant addStoredBlock requests:

   2006-11-14 13:10:56,103 WARN  dfs.StateChange - BLOCK* 
NameSystem.addStoredBlock: Redundant addStoredBlock request received 
for blk_-4818305081009377296 on s10-crawlera:50010

But I *don't* find anything in the datanode logs mentioning these blocks.

Finally (perhaps in response to me killing this first fetch job?), we see our
first ClosedChannelException in the namenode log:

   2006-11-14 13:40:44,634 WARN  ipc.Server - handler output error
   java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)

This is associated with "unknown opcode" messages in most of the datanode logs:

   s3-crawlera.prod.krugle.net: 2006-11-14 13:40:43,678 WARN 
dfs.DataNode - DataXCeiver
   s3-crawlera.prod.krugle.net: java.io.IOException: Unknown opcode 
for incoming data stream
   s3-crawlera.prod.krugle.net:    at 
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:515)
   s3-crawlera.prod.krugle.net:    at java.lang.Thread.run(Thread.java:595)


--------------------------------------------------------------------------------
Here I did a stop-all.sh, deployed a version of nutch with a bit more logging,
and then did a start-all.sh
--------------------------------------------------------------------------------


The jobtracker log shows that it was apparently unable to connect to the
namenode just after it gets launched the second time. The connection error
repeats over and over again, but eventually we get a few messages about the
web server starting up that may indicate that the jobtracker was finally able
to connect to the namenode(?)

   2006-11-14 13:59:52,321 WARN  mapred.JobTracker - Starting tracker
   java.net.ConnectException: Connection refused
     at java.net.PlainSocketImpl.socketConnect(Native Method)
     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
     at java.net.Socket.connect(Socket.java:507)
     at java.net.Socket.connect(Socket.java:457)
     at java.net.Socket.<init>(Socket.java:365)
     at java.net.Socket.<init>(Socket.java:207)
     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
     at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
     at org.apache.hadoop.ipc.Client.call(Client.java:297)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
     at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
     at 
org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
     at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
     at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
     at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
   ...
   2006-11-14 14:00:18,352 WARN  mapred.JobTracker - Starting tracker
   java.net.ConnectException: Connection refused
     at java.net.PlainSocketImpl.socketConnect(Native Method)
     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
     at java.net.Socket.connect(Socket.java:507)
     at java.net.Socket.connect(Socket.java:457)
     at java.net.Socket.<init>(Socket.java:365)
     at java.net.Socket.<init>(Socket.java:207)
     at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:113)
     at org.apache.hadoop.ipc.Client.getConnection(Client.java:359)
     at org.apache.hadoop.ipc.Client.call(Client.java:297)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:245)
     at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:103)
     at 
org.apache.hadoop.dfs.DistributedFileSystem.<init>(DistributedFileSystem.java:47)
     at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:101)
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
     at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:486)
     at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1200)
   2006-11-14 14:00:19,516 INFO  util.Credential - Checking Resource aliases
   2006-11-14 14:00:19,534 INFO  http.HttpServer - Version Jetty/5.1.4
   2006-11-14 14:00:20,423 INFO  util.Container - Started 
org.mortbay.jetty.servlet.WebApplicationHandler@69267649
   2006-11-14 14:00:20,492 INFO  util.Container - Started 
WebApplicationContext[/,/]
   2006-11-14 14:00:20,493 INFO  util.Container - Started 
HttpContext[/logs,/logs]
   2006-11-14 14:00:20,493 INFO  util.Container - Started 
HttpContext[/static,/static]
   2006-11-14 14:00:20,496 INFO  http.SocketListener - Started 
SocketListener on 0.0.0.0:50030
   2006-11-14 14:00:20,496 INFO  util.Container - Started 
org.mortbay.jetty.Server@b6e39f

Once the second crawl gets started
   2006-11-14 14:02:03,392 INFO  crawl.Injector - Injector: starting
the same issues with the new crawldb, the message queue, and redundant blocks
mentioned above appear again in the namenode log (not shown).

The DFS/RPC problems finally begin to surface in the jobtracker log (and these
bubble up to tool output) several hours into the second crawl:

   2006-11-14 17:07:27,539 WARN  fs.DFSClient - Problem renewing lease 
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out 
waiting for rpc response
     at org.apache.hadoop.ipc.Client.call(Client.java:312)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
     at java.lang.Thread.run(Thread.java:595)

Nothing special appears during this same timeframe in the namenode log (just
the continuing message queue and redundant blocks errors).

Finally, there is an indication in the jobtracker log (embedded in the
continuing DFS/RPC problems) that the fetch job has died and is being garbage
collected:

   2006-11-14 18:46:37,681 WARN  fs.DFSClient - Problem renewing lease 
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out 
waiting for rpc response
     at org.apache.hadoop.ipc.Client.call(Client.java:312)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
     at java.lang.Thread.run(Thread.java:595)

   2006-11-14 18:47:37,339 WARN  mapred.JobInProgress - Error cleaning 
up job_0005: java.net.SocketTimeoutException: timed out waiting for 
rpc response
   2006-11-14 18:47:38,696 WARN  fs.DFSClient - Problem renewing lease 
for DFSClient_-284911018: java.net.SocketTimeoutException: timed out 
waiting for rpc response
     at org.apache.hadoop.ipc.Client.call(Client.java:312)
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
     at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
     at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:440)
     at java.lang.Thread.run(Thread.java:595)

About 20 minutes later, we finally see the first of a series of broken pipe
exceptions in the namenode log:

   2006-11-14 19:09:54,440 WARN  ipc.Server - handler output error
   java.io.IOException: Broken pipe
     at sun.nio.ch.FileDispatcher.write0(Native Method)
     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
     at sun.nio.ch.IOUtil.write(IOUtil.java:60)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)

About 30 minutes later, the first of a series of closed channel exceptions
appears in the namenode log (along with an indication that the namenode may
have lost connection to all but one of its datanodes):

   2006-11-14 19:39:43,142 WARN  ipc.Server - handler output error
   java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:125)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:294)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:120)
     at 
org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:93)
     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:486)
   2006-11-14 19:39:43,142 WARN  fs.FSNamesystem - Replication 
requested of 3 is larger than cluster size (1). Using cluster size.

The namenode process eventually dies, though the jobtracker survives (as do
all of the datanodes and tasktrackers on the slaves).
-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Mime
View raw message