hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Liochon <nkey...@gmail.com>
Subject Re: Datanodes shutdown and HBase's regionservers not working
Date Mon, 25 Feb 2013 10:07:27 GMT
I agree.
Then for HDFS, ...
The first thing to check is the network I would say.




On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <davey.yan@gmail.com> wrote:

> Thanks for reply, Nicolas.
>
> My question: What can lead to shutdown of all of the datanodes?
> I believe that the regionservers will be OK if the HDFS is OK.
>
>
> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <nkeywal@gmail.com>
> wrote:
> > Ok, what's your question?
> > When you say the datanode went down, was it the datanode processes or the
> > machines, with both the datanodes and the regionservers?
> >
> > The NameNode pings its datanodes every 3 seconds. However it will
> internally
> > mark the datanodes as dead after 10:30 minutes (even if in the gui you
> have
> > 'no answer for x minutes').
> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
> > considered as dead after 180s with no answer. Before, well, it's
> considered
> > as live.
> > When you stop a regionserver, it tries to flush its data to the disk
> (i.e.
> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a
> high
> > ratio of your datanodes are dead, it can't shutdown. Connection refused &
> > socket timeouts come from the fact that before the 10:30 minutes hdfs
> does
> > not declare the nodes as dead, so hbase tries to use them (and,
> obviously,
> > fails). Note that there is now  an intermediate state for hdfs datanodes,
> > called "stale": an intermediary state where the datanode is used only if
> you
> > have to (i.e. it's the only datanode with a block replica you need). It
> will
> > be documented in HBase for the 0.96 release. But if all your datanodes
> are
> > down it won't change much.
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> >
> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <davey.yan@gmail.com> wrote:
> >>
> >> Hey guys,
> >>
> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1
> >> year, and it works fine.
> >> But the datanodes got shutdown twice in the last month.
> >>
> >> When the datanodes got shutdown, all of them became "Dead Nodes" in
> >> the NN web admin UI(http://ip:50070/dfshealth.jsp),
> >> but regionservers of HBase were still live in the HBase web
> >> admin(http://ip:60010/master-status), of course, they were zombies.
> >> All of the processes of jvm were still running, including
> >> hmaster/namenode/regionserver/datanode.
> >>
> >> When the datanodes got shutdown, the load (using the "top" command) of
> >> slaves became very high, more than 10, higher than normal running.
> >> From the "top" command, we saw that the processes of datanode and
> >> regionserver were comsuming CPU.
> >>
> >> We could not stop the HBase or Hadoop cluster through normal
> >> commands(stop-*.sh/*-daemon.sh stop *).
> >> So we stopped datanodes and regionservers by kill -9 PID, then the
> >> load of slaves returned to normal level, and we start the cluster
> >> again.
> >>
> >>
> >> Log of NN at the shutdown point(All of the DNs were removed):
> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.152:50010
> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.149:50010
> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.149:50010
> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.150:50010
> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.150:50010
> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.148:50010
> >> 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.148:50010
> >>
> >>
> >> Logs in DNs indicated there were many IOException and
> >> SocketTimeoutException:
> >> 2013-02-22 11:02:52,354 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.1.148:50010,
> >> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.io.IOException: Interrupted receiveBlock
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2013-02-22 11:03:44,823 WARN
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.1.148:50010,
> >> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> >> infoPort=50075, ipcPort=50020):Got exception while serving
> >> blk_-1985405101514576650_247001 to /192.168.1.148:
> >> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> >> for channel to be ready for write. ch :
> >> java.nio.channels.SocketChannel[connected local=/192.168.1.148:50010
> >> remote=/192.168.1.148:48654]
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2013-02-22 11:09:42,294 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.1.148:50010,
> >> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> >> for channel to be ready for write. ch :
> >> java.nio.channels.SocketChannel[connected local=/192.168.1.148:50010
> >> remote=/192.168.1.148:37188]
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2013-02-22 11:12:41,892 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> succeeded for blk_-2674357249542194287_43419
> >>
> >>
> >> Here is our env:
> >> hadoop 1.0.3
> >> hbase 0.94.1(snappy enabled)
> >>
> >> java version "1.6.0_31"
> >> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> >> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> >>
> >> # ulimit -a
> >> core file size          (blocks, -c) 0
> >> data seg size           (kbytes, -d) unlimited
> >> scheduling priority             (-e) 20
> >> file size               (blocks, -f) unlimited
> >> pending signals                 (-i) 16382
> >> max locked memory       (kbytes, -l) 64
> >> max memory size         (kbytes, -m) unlimited
> >> open files                      (-n) 32768
> >> pipe size            (512 bytes, -p) 8
> >> POSIX message queues     (bytes, -q) 819200
> >> real-time priority              (-r) 0
> >> stack size              (kbytes, -s) 8192
> >> cpu time               (seconds, -t) unlimited
> >> max user processes              (-u) 32768
> >> virtual memory          (kbytes, -v) unlimited
> >> file locks                      (-x) unlimited
> >>
> >> # uname -a
> >> Linux ubuntu6401 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:30
> >> UTC 2011 x86_64 GNU/Linux
> >>
> >>
> >> # free(master)
> >>              total       used       free     shared    buffers
> cached
> >> Mem:      24732936    8383708   16349228          0     490584
>  2580356
> >> -/+ buffers/cache:    5312768   19420168
> >> Swap:     72458232          0   72458232
> >>
> >>
> >> # free(slaves)
> >>              total       used       free     shared    buffers
> cached
> >> Mem:      24733000   22824276    1908724          0     862556
> 15303304
> >> -/+ buffers/cache:    6658416   18074584
> >> Swap:     72458232        264   72457968
> >>
> >>
> >> Some important conf:
> >> core-site.xml
> >>         <property>
> >>                 <name>io.file.buffer.size</name>
> >>                 <value>65536</value>
> >>         </property>
> >>
> >> hdfs-site.xml
> >>         <property>
> >>                 <name>dfs.block.size</name>
> >>                 <value>134217728</value>
> >>         </property>
> >>         <property>
> >>                 <name>dfs.datanode.max.xcievers</name>
> >>                 <value>4096</value>
> >>         </property>
> >>         <property>
> >>                 <name>dfs.support.append</name>
> >>                 <value>true</value>
> >>         </property>
> >>         <property>
> >>                 <name>dfs.replication</name>
> >>                 <value>2</value>
> >>         </property>
> >>
> >>
> >> Hope you can help us.
> >> Thanks in advance.
> >>
> >>
> >>
> >> --
> >> Davey Yan
> >
> >
>
>
>
> --
> Davey Yan
>

Mime
View raw message