hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From y_823...@tsmc.com
Subject Re: Region server goes away
Date Thu, 15 Apr 2010 03:42:58 GMT
Following exception had been shown in my cluster.
I never saw such error message after setting
dfs.datanode.socket.write.timeout = 0

hdfs-site.xml

<property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>0</value>
</property>




Fleming Chiu(邱宏明)
707-6128
y_823910@tsmc.com
週一無肉日吃素救地球(Meat Free Monday Taiwan)




                                                                                         
                                                            
                      "Geoff Hendrey"                                                    
                                                            
                      <ghendrey@decarta        To:      <hbase-user@hadoop.apache.org>
                                                               
                      .com>                    cc:      "Paul Mahon" <pmahon@decarta.com>,
"Bill Brune" <bbrune@decarta.com>, "Shaheen Bahauddin"     
                                                <sbahauddin@decarta.com>, "Rohit Nigam"
<rnigam@decarta.com>, (bcc: Y_823910/TSMC)                    
                      2010/04/15 11:27         Subject: Region server goes away          
                                                            
                      AM                                                                 
                                                            
                      Please respond to                                                  
                                                            
                      hbase-user                                                         
                                                            
                                                                                         
                                                            
                                                                                         
                                                            




Hi,

I have posted previously about issues I was having with HDFS when I was
running HBase and HDFS on the same box both pseudoclustered. Now I have
two very capable servers. I've setup HDFS with a datanode on each box.
I've setup the namenode on one box, and the zookeeper and HDFS master on
the other box. Both boxes are region servers. I am using hadoop 20.2 and
hbase 20.3.

I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml.

I am running a mapreduce job with about 200 concurrent reducers, each of
which writes into HBase, with 32,000 row flush buffers. About 40%
through the completion of my job, HDFS started showing one of the
datanodes was dead (the one *not* on the same machine as the namenode).
I stopped HBase, and magically the datanode came back to life.

Any suggestions on how to increase the robustness?


I see errors like this in the datanode's log:

2010-04-14 12:54:58,692 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: D
atanodeRegistration(10.241.6.80:50010,
storageID=DS-642079670-10.241.6.80-50010-
1271178858027, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel
 to be ready for write. ch : java.nio.channels.SocketChannel[connected
local=/10
.241.6.80:50010 remote=/10.241.6.80:48320]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeo
ut.java:246)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutput
Stream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutput
Stream.java:198)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSe
nder.java:313)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSen
der.java:400)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXcei
ver.java:180)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.ja
:


Here I show the output of 'hadoop dfsadmin -report'. First time it is
invoked, all is well. Second time, one datanode is dead. Third time, the
dead datanode has come back to life.:

[hadoop@dt1 ~]$ hadoop dfsadmin -report
Configured Capacity: 1277248323584 (1.16 TB)
Present Capacity: 1208326105528 (1.1 TB)
DFS Remaining: 1056438108160 (983.88 GB)
DFS Used: 151887997368 (141.46 GB)
DFS Used%: 12.57%
Under replicated blocks: 3479
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)

Name: 10.241.6.79:50010
Decommission Status : Normal
Configured Capacity: 643733970944 (599.52 GB)
DFS Used: 75694104268 (70.5 GB)
Non DFS Used: 35150238004 (32.74 GB)
DFS Remaining: 532889628672(496.29 GB)
DFS Used%: 11.76%
DFS Remaining%: 82.78%
Last contact: Wed Apr 14 11:20:59 PDT 2010


Name: 10.241.6.80:50010
Decommission Status : Normal
Configured Capacity: 633514352640 (590.01 GB)
DFS Used: 76193893100 (70.96 GB)
Non DFS Used: 33771980052 (31.45 GB)
DFS Remaining: 523548479488(487.59 GB)
DFS Used%: 12.03%
DFS Remaining%: 82.64%
Last contact: Wed Apr 14 11:14:37 PDT 2010


[hadoop@dt1 ~]$ hadoop dfsadmin -report
Configured Capacity: 643733970944 (599.52 GB)
Present Capacity: 609294929920 (567.45 GB)
DFS Remaining: 532876144640 (496.28 GB)
DFS Used: 76418785280 (71.17 GB)
DFS Used%: 12.54%
Under replicated blocks: 3247
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (2 total, 1 dead)

Name: 10.241.6.79:50010
Decommission Status : Normal
Configured Capacity: 643733970944 (599.52 GB)
DFS Used: 76418785280 (71.17 GB)
Non DFS Used: 34439041024 (32.07 GB)
DFS Remaining: 532876144640(496.28 GB)
DFS Used%: 11.87%
DFS Remaining%: 82.78%
Last contact: Wed Apr 14 11:28:38 PDT 2010


Name: 10.241.6.80:50010
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 0 (0 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Wed Apr 14 11:14:37 PDT 2010


[hadoop@dt1 ~]$ hadoop dfsadmin -report
Configured Capacity: 1277248323584 (1.16 TB)
Present Capacity: 1210726427080 (1.1 TB)
DFS Remaining: 1055440003072 (982.96 GB)
DFS Used: 155286424008 (144.62 GB)
DFS Used%: 12.83%
Under replicated blocks: 3338
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)

Name: 10.241.6.79:50010
Decommission Status : Normal
Configured Capacity: 643733970944 (599.52 GB)
DFS Used: 77775145981 (72.43 GB)
Non DFS Used: 33086850051 (30.81 GB)
DFS Remaining: 532871974912(496.28 GB)
DFS Used%: 12.08%
DFS Remaining%: 82.78%
Last contact: Wed Apr 14 11:29:44 PDT 2010


Name: 10.241.6.80:50010
Decommission Status : Normal
Configured Capacity: 633514352640 (590.01 GB)
DFS Used: 77511278027 (72.19 GB)
Non DFS Used: 33435046453 (31.14 GB)
DFS Remaining: 522568028160(486.68 GB)
DFS Used%: 12.24%
DFS Remaining%: 82.49%
Last contact: Wed Apr 14 11:29:44 PDT 2010







 --------------------------------------------------------------------------- 
                                                         TSMC PROPERTY       
 This email communication (and any attachments) is proprietary information   
 for the sole use of its                                                     
 intended recipient. Any unauthorized review, use or distribution by anyone  
 other than the intended                                                     
 recipient is strictly prohibited.  If you are not the intended recipient,   
 please notify the sender by                                                 
 replying to this email, and then delete this email and any copies of it     
 immediately. Thank you.                                                     
 --------------------------------------------------------------------------- 




Mime
View raw message