hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "hoelog (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7539) Namenode can't leave safemode because of Datanodes' IPC socket timeout
Date Wed, 17 Dec 2014 22:58:14 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

hoelog updated HDFS-7539:
-------------------------
    Description: 
During the starting of namenode, data nodes seem waiting namenode's response through IPC to
register block pools.

here is DN's log -
{code} 
2014-12-16 20:28:09,064 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging
ACTIVE Namenode Block pool BP-877672386-10.114.130.143-1412666752827 (Datanode Uuid 2117395f-e034-4b4a-adec-8a28464f4796)
service to NN.x.com/10.x.x143:9000 
{code}
But namenode is too busy to responde it, and datanodes occur socket timeout - default is 1
minute.
{code}
2014-12-16 20:29:09,857 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in offerService
java.net.SocketTimeoutException: Call From DN1.x.com/10.x.x.84 to NN.x.com:9000 failed on
socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.x.x.84:57924
remote=NN.x.com/10.x.x.143:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

{code}
same events repeat and eventually NN drops most connecting trials from DN. So NN can't leave
safemode.

DN's log -
{code}
2014-12-16 20:32:25,895 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in offerService
java.io.IOException: failed on local exception java.io.ioexception connection reset by peer
{code}
There is no troubles in the network, configuration or servers. I think NN is too busy to respond
to DN in a minute. 

I configured "ipc.ping.interval" to 15 mins In the core-site.xml, and that was helpful for
my cluster. 
{code}
<property>
  <name>ipc.ping.interval</name>
  <value>900000</value>
</property>
{code}
In my cluster, namenode responded 1 min ~ 5 mins for the DNs' request.
It will be helpful if there is more elegant solution.
{code}
2014-12-16 23:28:16,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging
ACTIVE Namenode Block pool BP-877672386-10.x.x.143-1412666752827 (Datanode Uuid c4f7beea-b8e9-404f-bc81-6e87e37263d2)
service to NN/10.x.x.143:9000
2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Sent 1 blockreports
2090961 blocks total. Took 1690 msec to generate and 193738 msecs for RPC and NN processing.
 Got back commands org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@20e68e11
2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize
command for block pool BP-877672386-10.x.x.143-1412666752827
2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap
2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2014-12-16 23:31:32,044 INFO org.apache.hadoop.util.GSet: 0.5% max memory 3.6 GB = 18.2 MB
2014-12-16 23:31:32,045 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries
2014-12-16 23:31:32,046 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner:
Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-877672386-10.114.130.143-1412666752827
{code}

  was:
During the starting of namenode, data nodes seem waiting namenode's response through IPC to
register block pools.

here is DN's log -
 
2014-12-16 20:28:09,064 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging
ACTIVE Namenode Block pool BP-877672386-10.114.130.143-1412666752827 (Datanode Uuid 2117395f-e034-4b4a-adec-8a28464f4796)
service to NN.x.com/10.x.x143:9000 

But namenode is too busy to responde it, and datanodes occur socket timeout - default is 1
minute.

2014-12-16 20:29:09,857 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in offerService
java.net.SocketTimeoutException: Call From DN1.x.com/10.x.x.84 to NN.x.com:9000 failed on
socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.x.x.84:57924
remote=NN.x.com/10.x.x.143:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout


same events repeat and eventually NN drops most connecting trials from DN. So NN can't leave
safemode.

DN's log -

2014-12-16 20:32:25,895 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in offerService
java.io.IOException: failed on local exception java.io.ioexception connection reset by peer

There is no troubles in the network, configuration or servers. I think NN is too busy to respond
to DN in a minute. 

I configured "ipc.ping.interval" to 15 mins In the core-site.xml, and that was helpful for
my cluster. 

<property>
  <name>ipc.ping.interval</name>
  <value>900000</value>
</property>

In my cluster, namenode responded 1 min ~ 5 mins for the DNs' request.
It will be helpful if there is more elegant solution.

2014-12-16 23:28:16,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging
ACTIVE Namenode Block pool BP-877672386-10.x.x.143-1412666752827 (Datanode Uuid c4f7beea-b8e9-404f-bc81-6e87e37263d2)
service to NN/10.x.x.143:9000
2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Sent 1 blockreports
2090961 blocks total. Took 1690 msec to generate and 193738 msecs for RPC and NN processing.
 Got back commands org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@20e68e11
2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize
command for block pool BP-877672386-10.x.x.143-1412666752827
2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap
2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2014-12-16 23:31:32,044 INFO org.apache.hadoop.util.GSet: 0.5% max memory 3.6 GB = 18.2 MB
2014-12-16 23:31:32,045 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries
2014-12-16 23:31:32,046 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner:
Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-877672386-10.114.130.143-1412666752827



> Namenode can't leave safemode because of Datanodes' IPC socket timeout
> ----------------------------------------------------------------------
>
>                 Key: HDFS-7539
>                 URL: https://issues.apache.org/jira/browse/HDFS-7539
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.5.1
>         Environment: 1 master, 1 secondary and 128 slaves, each node has x24 cores, 48GB
memory. fsimage is 4GB.
>            Reporter: hoelog
>
> During the starting of namenode, data nodes seem waiting namenode's response through
IPC to register block pools.
> here is DN's log -
> {code} 
> 2014-12-16 20:28:09,064 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging
ACTIVE Namenode Block pool BP-877672386-10.114.130.143-1412666752827 (Datanode Uuid 2117395f-e034-4b4a-adec-8a28464f4796)
service to NN.x.com/10.x.x143:9000 
> {code}
> But namenode is too busy to responde it, and datanodes occur socket timeout - default
is 1 minute.
> {code}
> 2014-12-16 20:29:09,857 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in offerService
> java.net.SocketTimeoutException: Call From DN1.x.com/10.x.x.84 to NN.x.com:9000 failed
on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.x.x.84:57924
remote=NN.x.com/10.x.x.143:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

> {code}
> same events repeat and eventually NN drops most connecting trials from DN. So NN can't
leave safemode.
> DN's log -
> {code}
> 2014-12-16 20:32:25,895 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in offerService
> java.io.IOException: failed on local exception java.io.ioexception connection reset by
peer
> {code}
> There is no troubles in the network, configuration or servers. I think NN is too busy
to respond to DN in a minute. 
> I configured "ipc.ping.interval" to 15 mins In the core-site.xml, and that was helpful
for my cluster. 
> {code}
> <property>
>   <name>ipc.ping.interval</name>
>   <value>900000</value>
> </property>
> {code}
> In my cluster, namenode responded 1 min ~ 5 mins for the DNs' request.
> It will be helpful if there is more elegant solution.
> {code}
> 2014-12-16 23:28:16,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging
ACTIVE Namenode Block pool BP-877672386-10.x.x.143-1412666752827 (Datanode Uuid c4f7beea-b8e9-404f-bc81-6e87e37263d2)
service to NN/10.x.x.143:9000
> 2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Sent 1
blockreports 2090961 blocks total. Took 1690 msec to generate and 193738 msecs for RPC and
NN processing.  Got back commands org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@20e68e11
> 2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize
command for block pool BP-877672386-10.x.x.143-1412666752827
> 2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: Computing capacity for map
BlockMap
> 2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
> 2014-12-16 23:31:32,044 INFO org.apache.hadoop.util.GSet: 0.5% max memory 3.6 GB = 18.2
MB
> 2014-12-16 23:31:32,045 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152
entries
> 2014-12-16 23:31:32,046 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner:
Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-877672386-10.114.130.143-1412666752827
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message