hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-2341) Datanode active connections never returns to 0
Date Wed, 12 Dec 2007 06:59:43 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550838
] 

stack edited comment on HADOOP-2341 at 12/11/07 10:58 PM:
----------------------------------------------------------

On how to reproduce, first apply this to MapFile:

{code}
Index: src/java/org/apache/hadoop/io/MapFile.java
===================================================================
--- src/java/org/apache/hadoop/io/MapFile.java  (revision 603416)
+++ src/java/org/apache/hadoop/io/MapFile.java  (working copy)
@@ -563,8 +563,12 @@
     String out = args[1];
 
     Configuration conf = new Configuration();
-    FileSystem fs = FileSystem.getLocal(conf);
+    // FileSystem fs = FileSystem.getLocal(conf);
+    FileSystem fs = FileSystem.get(conf);
     MapFile.Reader reader = new MapFile.Reader(fs, in, conf);
+    LOG.info("READER opened -- go do a netstat -tn");
+    Thread.sleep(60*1000);
+    LOG.info("READER moving on to invoke next");
     MapFile.Writer writer =
       new MapFile.Writer(conf, fs, out, reader.getKeyClass(), reader.getValueClass());
 
@@ -574,8 +578,14 @@
 
     while (reader.next(key, value))               // copy all entries
       writer.append(key, value);
+    LOG.info("READER past append -- check netstat before close");
+    Thread.sleep(60*1000);
 
     writer.close();
+
+    reader.close();
+    LOG.info("READER closed -- waiting around 60seconds so can see netstat");
+    Thread.sleep(60*1000);
   }
 
 }
{code}

It adds a pause after opening a small map file in dfs and after running through append --
just before close -- as well as after close, just so you can do a netsat -tn to see the CLOSE_WAITs
showing up.

Build and then do:

{code}
% netstat -tn|grep CLOSE_WAIT|wc
{code}

Remember the count.

Then do the following where the file is any smallish MapFile up in hdfs:

{code}
./bin/hadoop org.apache.hadoop.io.MapFile /hbase123/hregion_-70236052/info/mapfiles/3026597445031962455
/x
{code}

Put it in background and do netstat every time it logs.

On open, you'll see two new CLOSE_WAITs, one for the data file and the other for the index.
  They stick around through the append.  They go away on close of the map file.

      was (Author: stack):
    On how to reproduce, first apply this to MapFile:

{code}
Index: src/java/org/apache/hadoop/io/MapFile.java
===================================================================
--- src/java/org/apache/hadoop/io/MapFile.java  (revision 602986)
+++ src/java/org/apache/hadoop/io/MapFile.java  (working copy)
@@ -563,8 +563,14 @@
     String out = args[1];
 
     Configuration conf = new Configuration();
-    FileSystem fs = FileSystem.getLocal(conf);
+    // FileSystem fs = FileSystem.getLocal(conf);
+    FileSystem fs = FileSystem.get(conf);
     MapFile.Reader reader = new MapFile.Reader(fs, in, conf);
+    
+    LOG.info("READER opened");
+    Thread.sleep(5*180*1000);
+    LOG.info("READER past sleep");
+    
     MapFile.Writer writer =
       new MapFile.Writer(conf, fs, out, reader.getKeyClass(), reader.getValueClass());
 
@@ -577,5 +583,4 @@
 
     writer.close();
   }
-
 }
{code}

Build and then do:

{code}
% netstat -tn|grep CLOSE_WAIT|wc
{code}

Remember the count.

Then do the following where the file is any smallish MapFile up in hdfs:

{code}
./bin/hadoop org.apache.hadoop.io.MapFile /hbase123/hregion_-70236052/info/mapfiles/3026597445031962455
/x
{code}

Next do the netstat again.

You'll see two new CLOSE_WAITs, one for the data file and the other for the index.
  
> Datanode active connections never returns to 0
> ----------------------------------------------
>
>                 Key: HADOOP-2341
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2341
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Paul Saab
>         Attachments: dfsclient.patch, hregionserver-stack.txt, stacks-XX.XX.XX.XXX.txt,
stacks-YY.YY.YY.YY.txt
>
>
> On trunk i continue to see the following in my data node logs:
> 2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 42
> 2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 41
> 2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 40
> 2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 39
> 2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 38
> 2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 37
> 2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 36
> 2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 35
> 2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 34
> 2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of active connections
is: 33
> This number never returns to 0, even after many hours of no new data being manipulated
or added into the DFS.
> Looking at netstat -tn i see significant amount of data in the send-q that never goes
away:
> tcp        0  34240 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:55792   ESTABLISHED

> tcp        0  38968 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:38169   ESTABLISHED

> tcp        0  38456 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:35456   ESTABLISHED

> tcp        0  29640 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:59845   ESTABLISHED

> tcp        0  50168 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:44584   ESTABLISHED

> When sniffing the network I see that the remote side (YY.YY.YY.YY) is returning a window
size of 0
> 16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 win 46
<nop,nop,timestamp 1786247180 885681789>
> 16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 <nop,nop,timestamp
885801786 1775711351>
> Then we look at the stack traces on each datanode, I will have tons of threads that *never*
go away in the following trace:
> {code}
> Thread 6516 (org.apache.hadoop.dfs.DataNode$DataXceiver@166068b6):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.net.SocketOutputStream.socketWrite0(Native Method)
>     java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>     java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>     java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>     java.io.DataOutputStream.write(DataOutputStream.java:90)
>     org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400)
>     org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433)
>     org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904)
>     org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
>     java.lang.Thread.run(Thread.java:619)
> {code}
> Unfortunately there's very little in the logs with exceptions that could point to this.
 I have some exceptions the following, but nothing that points to problems between XX and
YY:
> {code}
> 2007-12-02 11:19:47,889 WARN  dfs.DataNode - Unexpected error trying to delete block
blk_4515246476002110310. Block not found in blockMap. 
> 2007-12-02 11:19:47,922 WARN  dfs.DataNode - java.io.IOException: Error in deleting blocks.
>         at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750)
>         at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675)
>         at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569)
>         at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720)
>         at java.lang.Thread.run(Thread.java:619)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message