hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject HDFS throughput problem after upgrade to Hadoop 2.2.0
Date Mon, 24 Aug 2015 15:49:48 GMT
Hi, 
Recently we upgrade our production cluster from Hadoop V1.1.0 to V2.2.0. One issue we found
out is that the HDFS throughput is worse than before.
We saw lot of "Timeout Exception" in the hadoop data log. Here is the basic information related
to our cluster:
1) One HDFS NameNode2) One HDFS 2nd NameNode3) 42 Data/Task nodes4) 2 Edge nodes
First, we observed some HDFS client (using "hadoop fs -put") get the following message in
the console:
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/10.20.95.157:20888 remote=/10.20.95.157:50010]
       at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)   
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>From what I saw so far, the message always complaining writing to the 3rd data node, and
failed and had to retry.
Most of the time, the HDFS write operation will succeed after retry. But we got lots of "Timeout
Exception" occurrences in the data log.
Then we added the following settings in the "hdfs-site.xml":
  <property>    <name>dfs.client.socket-timeout</name>    <value>180000</value>
 </property>  <property>    <name>dfs.datanode.socket.write.timeout</name>
   <value>960000</value>  </property>
But we still saw lots of Timeout Exception, with only longer timeout value, like following:015-08-16
11:10:36,466 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/data1/hdfs/data/current,
/data2/hdfs/data/current, /data3/hdfs/data/current, /data0/hdfs/data/current]'}, localName='p2-bigin144.ad.prodcc.net:50010',
storageID='DS-709172270-10.20.95.176-50010-1427848090396', xmitsInProgress=0}:Exception transfering
block BP-834217708-10.20.95.130-1438701195738:blk_1074671541_1099532031180 to mirror 10.20.95.162:50010:
java.net.SocketTimeoutException: 185000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/10.20.95.176:32663 remote=/10.20.95.162:50010]
What I can find out so far are:
1) The timeout exception happens on connecting to lots of nodes (The destination IP are changing),
so it doesn't look like one bad data node causing this.2) The "dfs.datanode.handler.count"
is already set to 10, same as before we upgrade. I think that is enough thread for HDFS data
node.3) Our HDFS daily usage didn't change significantly before/after upgrade4) While I still
try to find out if any network changes before/after upgrade, but what I got so far from network
team is none.5) This is in our own data center, not in the public cloud.
So if anyone faced similar issues before? What part I can check, if I want to find the root
cause of this? Most of MR jobs and HDFS operation will succeed eventually, but the performance
is impact by this.
Thanks
Yong
 		 	   		  
Mime
View raw message