hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: Region server goes away
Date Thu, 15 Apr 2010 17:34:34 GMT
No, I didn't make any changes. Doubt it is garbage collection related.
This error happens immediately upon startup when nothing is accessing
Base, and the error continues periodically with seeming regularity.
Also, I am running on 64 bit machines, with GOB of heap per hardtop
process.

-g

-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Thursday, April 15, 2010 10:31 AM
To: hbase-user@hadoop.apache.org
Subject: RE: Region server goes away



Did you make changes to your garbage collection?

Could be that you've swamped your nodes and time out due to GC running.


> Subject: RE: Region server goes away
> Date: Thu, 15 Apr 2010 10:25:45 -0700
> From: ghendrey@decarta.com
> To: hbase-user@hadoop.apache.org
> 
> After making all the recommended config changes, the only issue I see
it this, in the zookeeper logs. It happens repeatedly. Hbase shell seems
to work fine, running it on same machine as the zookeeper. Any ideas? I
reviewed a thread in the email list, on this topic, but it seemed
inconclusive.:
> 
> 2010-04-15 04:14:36,048 WARN 
> org.apache.zookeeper.server.PrepRequestProcessor:  ot exception when 
> processing sessionid:0x128012c809c0000 type:create cxid:0x4 z 
> id:0xfffffffffffffffe txntype:unknown n/a
> org.apache.zookeeper.KeeperException$NodeExistsException:
KeeperErrorCode = Nod Existsof 0x128012c809c0002 valid:true
>         at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReques
Processor.java:245)87c5a0000
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProc 
> ssor.java:114)27fe787c5a3bba
> 
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, April 14, 2010 8:45 PM
> To: hbase-user@hadoop.apache.org
> Cc: Paul Mahon; Bill Brune; Shaheen Bahauddin; Rohit Nigam
> Subject: Re: Region server goes away
> 
> On Wed, Apr 14, 2010 at 8:27 PM, Geoff Hendrey <ghendrey@decarta.com>
wrote:
> > Hi,
> >
> > I have posted previously about issues I was having with HDFS when I 
> > was running HBase and HDFS on the same box both pseudoclustered. Now

> > I have two very capable servers. I've setup HDFS with a datanode on
each box.
> > I've setup the namenode on one box, and the zookeeper and HDFS 
> > master on the other box. Both boxes are region servers. I am using 
> > hadoop
> > 20.2 and hbase 20.3.
> 
> What do you have for replication?  If two datanodes, you've set it to
two rather than default 3?
> 
> 
> >
> > I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml.
> >
> This is probably not necessary.
> 
> 
> > I am running a mapreduce job with about 200 concurrent reducers, 
> > each of which writes into HBase, with 32,000 row flush buffers.
> 
> 
> Why don't you try with just a few reducers first and then build it up?
>  See if that works?
> 
> 
> About 40%
> > through the completion of my job, HDFS started showing one of the 
> > datanodes was dead (the one *not* on the same machine as the
namenode).
> 
> 
> Do you think it dead -- what did a threaddump say? -- or was it just
that you couldn't get into it?  Any errors in the datanode logs
complaining about xceiver count or perhaps you need to up the number of
handlers?
> 
> 
> 
> > I stopped HBase, and magically the datanode came back to life.
> >
> > Any suggestions on how to increase the robustness?
> >
> >
> > I see errors like this in the datanode's log:
> >
> > 2010-04-14 12:54:58,692 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: D 
> > atanodeRegistration(10.241.6.80:50010,
> > storageID=DS-642079670-10.241.6.80-50010-
> > 1271178858027, infoPort=50075, ipcPort=50020):DataXceiver
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting

> > for channel
> 
> 
> I believe this harmless.  Its just the DN timing out the socket -- you
set the timeout to 0 in the hbase-site.xml rather than in hdfs-site.xml
where it would have an effect.  See HADOOP-3831 for detail.
> 
> 
> >  to be ready for write. ch : 
> > java.nio.channels.SocketChannel[connected
> > local=/10
> > .241.6.80:50010 remote=/10.241.6.80:48320]
> >        at
> > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTime
> > o
> > ut.java:246)
> >        at
> > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutpu
> > t
> > Stream.java:159)
> >        at
> > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutpu
> > t
> > Stream.java:198)
> >        at
> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockS
> > e
> > nder.java:313)
> >        at
> > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSe
> > n
> > der.java:400)
> >        at
> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXce
> > i
> > ver.java:180)
> >        at
> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.j
> > a
> > :
> >
> >
> > Here I show the output of 'hadoop dfsadmin -report'. First time it 
> > is invoked, all is well. Second time, one datanode is dead. Third 
> > time, the dead datanode has come back to life.:
> >
> > [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 
> > 1277248323584 (1.16 TB) Present Capacity: 1208326105528 (1.1 TB) DFS
> > Remaining: 1056438108160 (983.88 GB) DFS Used: 151887997368 (141.46
> > GB) DFS Used%: 12.57% Under replicated blocks: 3479 Blocks with 
> > corrupt replicas: 0 Missing blocks: 0
> >
> > -------------------------------------------------
> > Datanodes available: 2 (2 total, 0 dead)
> >
> > Name: 10.241.6.79:50010
> > Decommission Status : Normal
> > Configured Capacity: 643733970944 (599.52 GB) DFS Used: 75694104268
> > (70.5 GB) Non DFS Used: 35150238004 (32.74 GB) DFS Remaining: 
> > 532889628672(496.29 GB) DFS Used%: 11.76% DFS Remaining%: 82.78% 
> > Last
> > contact: Wed Apr 14 11:20:59 PDT 2010
> >
> >
> 
> Yeah, my guess as per above is that the reporting client couldn't get
on to the datanode because handlers were full or xceivers exceeded.
> 
> Let us know how it goes.
> St.Ack
> 
> 
> > Name: 10.241.6.80:50010
> > Decommission Status : Normal
> > Configured Capacity: 633514352640 (590.01 GB) DFS Used: 76193893100
> > (70.96 GB) Non DFS Used: 33771980052 (31.45 GB) DFS Remaining: 
> > 523548479488(487.59 GB) DFS Used%: 12.03% DFS Remaining%: 82.64% 
> > Last
> > contact: Wed Apr 14 11:14:37 PDT 2010
> >
> >
> > [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 
> > 643733970944 (599.52 GB) Present Capacity: 609294929920 (567.45 GB) 
> > DFS Remaining: 532876144640 (496.28 GB) DFS Used: 76418785280 (71.17
> > GB) DFS Used%: 12.54% Under replicated blocks: 3247 Blocks with 
> > corrupt replicas: 0 Missing blocks: 0
> >
> > -------------------------------------------------
> > Datanodes available: 1 (2 total, 1 dead)
> >
> > Name: 10.241.6.79:50010
> > Decommission Status : Normal
> > Configured Capacity: 643733970944 (599.52 GB) DFS Used: 76418785280
> > (71.17 GB) Non DFS Used: 34439041024 (32.07 GB) DFS Remaining: 
> > 532876144640(496.28 GB) DFS Used%: 11.87% DFS Remaining%: 82.78% 
> > Last
> > contact: Wed Apr 14 11:28:38 PDT 2010
> >
> >
> > Name: 10.241.6.80:50010
> > Decommission Status : Normal
> > Configured Capacity: 0 (0 KB)
> > DFS Used: 0 (0 KB)
> > Non DFS Used: 0 (0 KB)
> > DFS Remaining: 0(0 KB)
> > DFS Used%: 100%
> > DFS Remaining%: 0%
> > Last contact: Wed Apr 14 11:14:37 PDT 2010
> >
> >
> > [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity: 
> > 1277248323584 (1.16 TB) Present Capacity: 1210726427080 (1.1 TB) DFS
> > Remaining: 1055440003072 (982.96 GB) DFS Used: 155286424008 (144.62
> > GB) DFS Used%: 12.83% Under replicated blocks: 3338 Blocks with 
> > corrupt replicas: 0 Missing blocks: 0
> >
> > -------------------------------------------------
> > Datanodes available: 2 (2 total, 0 dead)
> >
> > Name: 10.241.6.79:50010
> > Decommission Status : Normal
> > Configured Capacity: 643733970944 (599.52 GB) DFS Used: 77775145981
> > (72.43 GB) Non DFS Used: 33086850051 (30.81 GB) DFS Remaining: 
> > 532871974912(496.28 GB) DFS Used%: 12.08% DFS Remaining%: 82.78% 
> > Last
> > contact: Wed Apr 14 11:29:44 PDT 2010
> >
> >
> > Name: 10.241.6.80:50010
> > Decommission Status : Normal
> > Configured Capacity: 633514352640 (590.01 GB) DFS Used: 77511278027
> > (72.19 GB) Non DFS Used: 33435046453 (31.14 GB) DFS Remaining: 
> > 522568028160(486.68 GB) DFS Used%: 12.24% DFS Remaining%: 82.49% 
> > Last
> > contact: Wed Apr 14 11:29:44 PDT 2010
> >
> >
> >
> >
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from
your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:
ON:WL:en-US:WM_HMP:042010_2

Mime
View raw message