hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Whiting <je...@qualtrics.com>
Subject Re: Should a data node restart cause a region server to go down?
Date Mon, 06 Feb 2012 20:01:42 GMT
I've been able to reproduce this on multiple clusters.  I'm basically doing a rolling restart
of 
data nodes with 1 every 5-10+ minutes.  However the region servers will just die. "hadoop
fsck /" 
shows it is healthy, the web interface says all the data nodes are up, and region servers
logs seem 
quiet.   With this last cluster the region servers went down (so I restarted them) but the
master 
could never reassign the root region and was simply stuck until I restarted the master.  It
is a 
little frustrating as it seems like it should be able to do this.

So here is the region server log.  There is large gaps between the recovery and then it failing

which may be due to my slow data node restarts.

12/02/06 18:33:35 WARN hdfs.DFSClient: Error Recovery for block blk_5522208212139439103_25405
in 
pipeline 10.130.163.92:50010, 10.131.67.12:50010, 10.130.145.236:50010: bad datanode 
10.130.145.236:50010
**6 minute gap here**
12/02/06 18:39:20 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block

blk_5522208212139439103_25412java.io.EOFException
     at java.io.DataInputStream.readFully(DataInputStream.java:180)
     at java.io.DataInputStream.readLong(DataInputStream.java:399)
     at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2634)

12/02/06 18:39:20 WARN hdfs.DFSClient: Error Recovery for block blk_5522208212139439103_25412
bad 
datanode[0] 10.130.163.92:50010
12/02/06 18:39:20 WARN hdfs.DFSClient: Error Recovery for block blk_5522208212139439103_25412
in 
pipeline 10.130.163.92:50010, 10.131.67.12:50010: bad datanode 10.130.163.92:50010
12/02/06 18:39:20 INFO ipc.Client: Retrying connect to server: /10.131.67.12:50020. Already
tried 0 
time(s).
**18 minute gap here**
12/02/06 18:57:11 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block

blk_5522208212139439103_25413java.io.EOFException
     at java.io.DataInputStream.readFully(DataInputStream.java:180)
     at java.io.DataInputStream.readLong(DataInputStream.java:399)
     at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2634)
12/02/06 18:57:11 WARN hdfs.DFSClient: Error Recovery for block blk_5522208212139439103_25413
bad 
datanode[0] 10.131.67.12:50010
12/02/06 18:57:11 WARN hdfs.DFSClient: Error while syncing
java.io.IOException: All datanodes 10.131.67.12:50010 are bad. Aborting...
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2766)
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
12/02/06 18:57:11 FATAL wal.HLog: Could not append. Requesting close of hlog
java.io.IOException: Reflection
     at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
     at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:981)
     at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:958)
Caused by: java.lang.reflect.InvocationTargetException
     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
     ... 2 more
Caused by: java.io.IOException: All datanodes 10.131.67.12:50010 are bad. Aborting...
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2766)
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
12/02/06 18:57:11 ERROR wal.HLog: Error while syncing, requesting close of hlog

Any ideas as to why this is happening?  It seems like the region servers are fine until I
restart 
the last data node--I've tested this several times now since starting the email and the problems

with the RS start after the last data node is restarted.  Is it somehow remembering that a
data node 
was bad and even though it has been restarted and successfully serving data for 15+ minutes?
So when 
I restart the last node it just gives up? fsck says that everything is healthy and fully replicated.

~Jeff

On 2/6/2012 11:28 AM, Jeff Whiting wrote:
> I was increasing the storage on some of my data nodes and thus had to do a restart of
the data 
> node.  I use cdh3u2 and ran "/etc/init.d/hadoop-0.20-datanode restart" (I don't think
this is a 
> cdh problem). Unfortunately doing the restart caused region servers to go offline.  Is
this 
> expected behavior?  It seems like should recover for those just fine without giving up
and dying 
> since there were other data nodes available.  Here are the logs on the region server
from when I 
> restarted the data node to when it decided to give up.  To give you a little background
I'm 
> running a small cluster with 4 region servers and 4 data nodes.
>
> Thanks,
> ~Jeff
>
> 12/02/06 18:06:03 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
block 
> blk_-4249058562504578427_18197java.io.IOException: Bad response 1 for block 
> blk_-4249058562504578427_18197 from datanode 10.49.129.134:50010
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)
>
> 12/02/06 18:06:03 INFO hdfs.DFSClient: Error Recovery for block blk_-4249058562504578427_18197

> waiting for responder to exit.
> 12/02/06 18:06:03 WARN hdfs.DFSClient: Error Recovery for block blk_-4249058562504578427_18197
bad 
> datanode[2] 10.49.129.134:50010
> 12/02/06 18:06:03 WARN hdfs.DFSClient: Error Recovery for block blk_-4249058562504578427_18197
in 
> pipeline 10.59.39.142:50010, 10.234.50.225:50010, 10.49.129.134:50010, 10.49.29.92:50010:
bad 
> datanode 10.49.129.134:50010
> 12/02/06 18:06:03 WARN wal.HLog: HDFS pipeline error detected. Found 3 replicas but expecting
4 
> replicas.  Requesting close of hlog.
> 12/02/06 18:06:03 INFO wal.SequenceFileLogWriter: Using syncFs -- HDFS-200
> 12/02/06 18:06:03 WARN regionserver.ReplicationSourceManager: Replication stopped, won't
add new log
> 12/02/06 18:06:03 INFO wal.HLog: Roll 
> /hbase/.logs/ip-10-59-39-142.eu-west-1.compute.internal,60020,1328142685179/ip-10-59-39-142.eu-west-1.compute.internal%3A60020.1328549504988,

> entries=3644, filesize=12276680. New hlog 
> /hbase/.logs/ip-10-59-39-142.eu-west-1.compute.internal,60020,1328142685179/ip-10-59-39-142.eu-west-1.compute.internal%3A60020.1328551563518
>
> 12/02/06 18:06:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:

> Bad connect ack with firstBadLink as 10.49.129.134:50010
> 12/02/06 18:06:04 INFO hdfs.DFSClient: Abandoning block blk_6156813298944908969_18211
> 12/02/06 18:06:04 INFO hdfs.DFSClient: Excluding datanode 10.49.129.134:50010
> 12/02/06 18:06:04 WARN wal.HLog: HDFS pipeline error detected. Found 3 replicas but expecting
4 
> replicas.  Requesting close of hlog.
> 12/02/06 18:07:06 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
block 
> blk_-165678744483388406_18211java.io.IOException: Bad response 1 for block 
> blk_-165678744483388406_18211 from datanode 10.234.50.225:50010
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)
>
> 12/02/06 18:07:06 INFO hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18211

> waiting for responder to exit.
> 12/02/06 18:07:06 WARN hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18211
bad 
> datanode[2] 10.234.50.225:50010
> 12/02/06 18:07:06 WARN hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18211
in 
> pipeline 10.59.39.142:50010, 10.49.29.92:50010, 10.234.50.225:50010: bad datanode 10.234.50.225:50010
> 12/02/06 18:09:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
block 
> blk_-165678744483388406_18214java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:210)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>     at java.io.DataInputStream.readFully(DataInputStream.java:178)
>     at java.io.DataInputStream.readLong(DataInputStream.java:399)
>     at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2634)
>
> 12/02/06 18:09:21 INFO hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18214

> waiting for responder to exit.
> 12/02/06 18:09:21 WARN hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18214
bad 
> datanode[0] 10.59.39.142:50010
> 12/02/06 18:09:21 WARN hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18214
in 
> pipeline 10.59.39.142:50010, 10.49.29.92:50010: bad datanode 10.59.39.142:50010
> 12/02/06 18:09:55 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
block 
> blk_-165678744483388406_18221java.io.IOException: Connection reset by peer
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:210)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>     at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>     at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>     at java.io.DataInputStream.readFully(DataInputStream.java:178)
>     at java.io.DataInputStream.readLong(DataInputStream.java:399)
>     at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2634)
>
> 12/02/06 18:09:55 INFO hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18221

> waiting for responder to exit.
> 12/02/06 18:09:56 WARN hdfs.DFSClient: Error Recovery for block blk_-165678744483388406_18221
bad 
> datanode[0] 10.49.29.92:50010
> 12/02/06 18:09:56 WARN hdfs.DFSClient: Error while syncing
> java.io.IOException: All datanodes 10.49.29.92:50010 are bad. Aborting...
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2766)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> 12/02/06 18:09:56 WARN hdfs.DFSClient: Error while syncing
> java.io.IOException: All datanodes 10.49.29.92:50010 are bad. Aborting...
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2766)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> 12/02/06 18:09:56 FATAL wal.HLog: Could not append. Requesting close of hlog
> java.io.IOException: Reflection
>     at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
>     at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:981)
>     at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:958)
> Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
>     ... 2 more
> Caused by: java.io.IOException: All datanodes 10.49.29.92:50010 are bad. Aborting...
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2766)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
> 12/02/06 18:09:56 ERROR wal.HLog: Error while syncing, requesting close of hlog
> java.io.IOException: Reflection
>     at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
>     at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:981)
>     at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:958)
> Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
>     ... 2 more
> Caused by: java.io.IOException: All datanodes 10.49.29.92:50010 are bad. Aborting...
>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Mime
View raw message