hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: Region server goes away
Date Thu, 15 Apr 2010 04:20:21 GMT
Thanks for your help. See answers below.

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, April 14, 2010 8:45 PM
To: hbase-user@hadoop.apache.org
Cc: Paul Mahon; Bill Brune; Shaheen Bahauddin; Rohit Nigam
Subject: Re: Region server goes away

On Wed, Apr 14, 2010 at 8:27 PM, Geoff Hendrey <ghendrey@decarta.com> wrote:
> Hi,
>
> I have posted previously about issues I was having with HDFS when I 
> was running HBase and HDFS on the same box both pseudoclustered. Now I 
> have two very capable servers. I've setup HDFS with a datanode on each box.
> I've setup the namenode on one box, and the zookeeper and HDFS master 
> on the other box. Both boxes are region servers. I am using hadoop 
> 20.2 and hbase 20.3.

What do you have for replication?  If two datanodes, you've set it to two rather than default
3?
Geoff: I didn't change the default, so it was 3. I will change it to 2 moving forward. Actually,
for now I am going to make it 1. For initial test runs I don't see why I need replication
at all.


>
> I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml.
>
This is probably not necessary.


> I am running a mapreduce job with about 200 concurrent reducers, each 
> of which writes into HBase, with 32,000 row flush buffers.


Why don't you try with just a few reducers first and then build it up?
 See if that works?

G: I did, and I have been. This is the point where things fail. Smaller jobs with fewer reducers
succeed. But it's a matter of time. These are big jobs, and we have a sizable cluster in house
- Even with about 200 concurrent reducers, I have thousands of pending reducers.. I need to
be able to fully utilize my mapreduce cluster and get results quickly (like in 8 to 16 hours).
If I scale things back too much, I'm in a debug cycle where it can take 3 to 4 *days* only
to find out that I get a failure in the reduce phase.  

About 40%
> through the completion of my job, HDFS started showing one of the 
> datanodes was dead (the one *not* on the same machine as the namenode).


Do you think it dead -- what did a threaddump say? -- or was it just that you couldn't get
into it?  Any errors in the datanode logs complaining about xceiver count or perhaps you need
to up the number of handlers?

G:I got "dead" from 'hadoop dfsadmn -report' which replied "Datanodes available: 1 (2 total,
1 *dead*)" (my emphasis around 'dead')

G:Ahh, yes, I do see this in a datanode log: "java.io.IOException: xceiverCount 257 exceeds
the limit of concurrent xcievers 256". I apologize for not catching that myself. Not sure
why I didn't see that. I will up dfs.datanode.max.xcievers to 4096.

G: Number of handlers is set to a ridiculously high value. From further reading, I understand
this value to be ridiculously high, but I had dfs.datanode.handler.count set to 1000 in the
hdfs-site.xml. So yes, I know that's a ridiculous value, but that's what I had it set to.
I'll turn it down to 10 or 20 as recommended.



> I stopped HBase, and magically the datanode came back to life.
>
> Any suggestions on how to increase the robustness?
>
>
> I see errors like this in the datanode's log:
>
> 2010-04-14 12:54:58,692 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: D 
> atanodeRegistration(10.241.6.80:50010,
> storageID=DS-642079670-10.241.6.80-50010-
> 1271178858027, infoPort=50075, ipcPort=50020):DataXceiver
> java.net.SocketTimeoutException: 480000 millis timeout while waiting 
> for channel


I believe this harmless.  Its just the DN timing out the socket -- you set the timeout to
0 in the hbase-site.xml rather than in hdfs-site.xml where it would have an effect.  See HADOOP-3831
for detail.

G: Thanks. I've now put it in the right place.


>  to be ready for write. ch : java.nio.channels.SocketChannel[connected
> local=/10
> .241.6.80:50010 remote=/10.241.6.80:48320]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeo
> ut.java:246)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutput
> Stream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutput
> Stream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSe
> nder.java:313)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSen
> der.java:400)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXcei
> ver.java:180)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.ja
> :
>
>
> Here I show the output of 'hadoop dfsadmin -report'. First time it is 
> invoked, all is well. Second time, one datanode is dead. Third time, 
> the dead datanode has come back to life.:
>
> [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 
> 1277248323584 (1.16 TB) Present Capacity: 1208326105528 (1.1 TB) DFS 
> Remaining: 1056438108160 (983.88 GB) DFS Used: 151887997368 (141.46 
> GB) DFS Used%: 12.57% Under replicated blocks: 3479 Blocks with 
> corrupt replicas: 0 Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 2 (2 total, 0 dead)
>
> Name: 10.241.6.79:50010
> Decommission Status : Normal
> Configured Capacity: 643733970944 (599.52 GB) DFS Used: 75694104268 
> (70.5 GB) Non DFS Used: 35150238004 (32.74 GB) DFS Remaining: 
> 532889628672(496.29 GB) DFS Used%: 11.76% DFS Remaining%: 82.78% Last 
> contact: Wed Apr 14 11:20:59 PDT 2010
>
>

Yeah, my guess as per above is that the reporting client couldn't get on to the datanode because
handlers were full or xceivers exceeded.

Let us know how it goes.
St.Ack

G: I definitely will let you know. And thanks in advance for your support!


> Name: 10.241.6.80:50010
> Decommission Status : Normal
> Configured Capacity: 633514352640 (590.01 GB) DFS Used: 76193893100 
> (70.96 GB) Non DFS Used: 33771980052 (31.45 GB) DFS Remaining: 
> 523548479488(487.59 GB) DFS Used%: 12.03% DFS Remaining%: 82.64% Last 
> contact: Wed Apr 14 11:14:37 PDT 2010
>
>
> [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 
> 643733970944 (599.52 GB) Present Capacity: 609294929920 (567.45 GB) 
> DFS Remaining: 532876144640 (496.28 GB) DFS Used: 76418785280 (71.17 
> GB) DFS Used%: 12.54% Under replicated blocks: 3247 Blocks with 
> corrupt replicas: 0 Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 1 (2 total, 1 dead)
>
> Name: 10.241.6.79:50010
> Decommission Status : Normal
> Configured Capacity: 643733970944 (599.52 GB) DFS Used: 76418785280 
> (71.17 GB) Non DFS Used: 34439041024 (32.07 GB) DFS Remaining: 
> 532876144640(496.28 GB) DFS Used%: 11.87% DFS Remaining%: 82.78% Last 
> contact: Wed Apr 14 11:28:38 PDT 2010
>
>
> Name: 10.241.6.80:50010
> Decommission Status : Normal
> Configured Capacity: 0 (0 KB)
> DFS Used: 0 (0 KB)
> Non DFS Used: 0 (0 KB)
> DFS Remaining: 0(0 KB)
> DFS Used%: 100%
> DFS Remaining%: 0%
> Last contact: Wed Apr 14 11:14:37 PDT 2010
>
>
> [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 
> 1277248323584 (1.16 TB) Present Capacity: 1210726427080 (1.1 TB) DFS 
> Remaining: 1055440003072 (982.96 GB) DFS Used: 155286424008 (144.62 
> GB) DFS Used%: 12.83% Under replicated blocks: 3338 Blocks with 
> corrupt replicas: 0 Missing blocks: 0
>
> -------------------------------------------------
> Datanodes available: 2 (2 total, 0 dead)
>
> Name: 10.241.6.79:50010
> Decommission Status : Normal
> Configured Capacity: 643733970944 (599.52 GB) DFS Used: 77775145981 
> (72.43 GB) Non DFS Used: 33086850051 (30.81 GB) DFS Remaining: 
> 532871974912(496.28 GB) DFS Used%: 12.08% DFS Remaining%: 82.78% Last 
> contact: Wed Apr 14 11:29:44 PDT 2010
>
>
> Name: 10.241.6.80:50010
> Decommission Status : Normal
> Configured Capacity: 633514352640 (590.01 GB) DFS Used: 77511278027 
> (72.19 GB) Non DFS Used: 33435046453 (31.14 GB) DFS Remaining: 
> 522568028160(486.68 GB) DFS Used%: 12.24% DFS Remaining%: 82.49% Last 
> contact: Wed Apr 14 11:29:44 PDT 2010
>
>
>
>

Mime
View raw message