hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: regionserver disconnection
Date Sat, 14 Nov 2009 00:08:18 GMT
Ok.  Lack of file descriptors manifests in all kinds of weird ways.
Hopefully thats it.  If not, lets keep digging.
St.Ack

On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <zhongresearch@gmail.com>wrote:

> Stack,
>
> You are right, the master started with ulimit -n 1024. It doesn't take the
> system value.
>
> Regarding to the too many open files, it looks like the same as the one J-D
> put up. But I will get the Master start with higher value first and see if
> this kind of error goes away.
>
> thanks a lot!
> zhenyu
>
> On Fri, Nov 13, 2009 at 6:02 PM, stack <stack@duboce.net> wrote:
>
> > Does it say
> >
> > ulimit -n 32768
> >
> > ...as the second line in your log file on start of the master?
> >
> > Is the stack trace that complains about too many open files same as the
> one
> > in the blog post J-D put up?
> >
> > St.Ack
> >
> >
> > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <zhongresearch@gmail.com
> > >wrote:
> >
> > > The ulimit file descriptors was set to fs.file-max =1578334, also in
> the
> > > limits.conf the value was set to 32768.
> > > So these are way higher than the open descriptors for the running
> > > processes.
> > >
> > > thanks
> > > zhenyu
> > >
> > >
> > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <saint.ack@gmail.com> wrote:
> > >
> > > > You upped the ulimit file descriptors as per the getting started doc?
> > > >
> > > >
> > > >
> > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <zhongresearch@gmail.com>
> > > wrote:
> > > >
> > > >  Thanks a lot.
> > > >>
> > > >>
> > > >> Bad news is my kernel is still 2.6.26.
> > > >> But it was not a problem before.
> > > >>
> > > >> Very strange.
> > > >>
> > > >> zhenyu
> > > >>
> > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <
> > > jdcryans@apache.org
> > > >> >wrote:
> > > >>
> > > >>  Looks like
> > > >>>
> > > >>>
> > >
> >
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> > > >>>
> > > >>> J-D
> > > >>>
> > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <
> > zhongresearch@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Hi,
> > > >>>>
> > > >>>>
> > > >>>> After I re-organize the cluster, the experiment ran into problem
> > > faster
> > > >>>>
> > > >>> than
> > > >>>
> > > >>>> before.
> > > >>>>
> > > >>>> Basically the changes are to use machines with less resources
as
> > > >>>>
> > > >>> zookeeper
> > > >>>
> > > >>>> quorums and machines with more resources as regionserver.
> > > >>>>
> > > >>>> From the log, I still see the pause around 1 minute.
> > > >>>> I enabled the GC logging,  please see
> > > >>>>
> > > >>>> http://pastebin.com/m1d4ce0f1
> > > >>>>
> > > >>>> for details.
> > > >>>> In general I don't see many pauses in the GC.
> > > >>>>
> > > >>>> What is more interesting, 1 hour after the 1st regionserver
> > > >>>> disconnected,
> > > >>>> the master log complained about too many open files. This
didn't
> > > happen
> > > >>>> before.
> > > >>>> I checked the system OS setup as well as the limits.conf.
I also
> > > checked
> > > >>>>
> > > >>> the
> > > >>>
> > > >>>> running processes. The total open files don't reach the limit.
I
> am
> > > >>>>
> > > >>> confused
> > > >>>
> > > >>>> a bit.
> > > >>>>
> > > >>>>
> > > >>>> Please see the following master log.
> > > >>>>
> > > >>>> 2009-11-13 20:06:02,114 INFO
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta region
> > {server:
> > > >>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>}
> > complete
> > > >>>> 2009-11-13 20:06:02,114 INFO
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > >>>>
> > > >>> All
> > > >>>
> > > >>>> 1 .META. region(s) scanned
> > > >>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn:
Got
> > > ping
> > > >>>> response for sessionid:0x424eebf1c10004c after 3ms
> > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack
with
> > > >>>> firstBadLink 192.168.100.123:50010
> > > >>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_-2808245019291145247_5478039
> > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.io.EOFException
> > > >>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_1074853606841896259_5478048
> > > >>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn:
Got
> > > ping
> > > >>>> response for sessionid:0x24eebf1043003c after 1ms
> > > >>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn:
Got
> > > ping
> > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> > > >>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn:
Got
> > > ping
> > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.io.IOException: Bad connect ack
with
> > > >>>> firstBadLink 192.168.100.123:50010
> > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_-6642544517082142289_5478063
> > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.net.SocketException: Too many
open
> > files
> > > >>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_2057511041109796090_5478063
> > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.net.SocketException: Too many
open
> > files
> > > >>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_8219260302213892894_5478064
> > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.net.SocketException: Too many
open
> > files
> > > >>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_1669205542853067709_5478235
> > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.net.SocketException: Too many
open
> > files
> > > >>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_9128897691346270351_5478237
> > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.net.SocketException: Too many
open
> > files
> > > >>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_1116845144864123018_5478240
> > > >>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn:
Got
> > > ping
> > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> > > >>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn:
Got
> > > ping
> > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient:
> > > >>>>
> > > >>> DataStreamer
> > > >>>
> > > >>>> Exception: java.io.IOException: Too many open files
> > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native Method)
> > > >>>>      at
> > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> > > >>>
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> > > >>>
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> > > >>>
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> > > >>>
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > >>>
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > >>>
> > > >>>>      at
> > > >>>>
> > > >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > >>>
> > > >>>>      at java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > >>>>      at
> > > >>>>
> > > >>>>
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> > > >>>
> > > >>>>
> > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > >>>> Recovery for block blk_8148813491785406356_5478475 bad datanode[0]
> > > >>>> 192.168.100.123:50010
> > > >>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > >>>> Recovery for block blk_8148813491785406356_5478475 in pipeline
> > > >>>> 192.168.100.123:50010, 192.168.100.134:50010,
> 192.168.100.122:50010
> > :
> > > >>>> bad
> > > >>>> datanode 192.168.100.123:50010
> > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > >>>>
> > > >>> in
> > > >>>
> > > >>>> createBlockOutputStream java.net.SocketException: Too many
open
> > files
> > > >>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient:
> > > >>>> Abandoning
> > > >>>> block blk_2119727700857186236_5478498
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <
> > > zhongresearch@gmail.com
> > > >>>> wrote:
> > > >>>>
> > > >>>>  Will do.
> > > >>>>>
> > > >>>>> thanks
> > > >>>>> zhenyu
> > > >>>>>
> > > >>>>>
> > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <stack@duboce.net>
wrote:
> > > >>>>>
> > > >>>>>  Enable GC logging too on this next run (see hbase-env.sh).
 Lets
> > try
> > > >>>>>>
> > > >>>>> and
> > > >>>
> > > >>>> get
> > > >>>>>> to the bottom of whats going on.
> > > >>>>>> Thanks,
> > > >>>>>> St.Ack
> > > >>>>>>
> > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <
> > > >>>>>>
> > > >>>>> zhongresearch@gmail.com
> > > >>>
> > > >>>> wrote:
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>  I can switch the boxes that run zookeeper with the
ones that
> run
> > > >>>>>>> regionservers.
> > > >>>>>>> I will see the results later.
> > > >>>>>>>
> > > >>>>>>> FYI. The node does have the 10 minutes
> zookeeper.session.timeout
> > > >>>>>>>
> > > >>>>>> value
> > > >>>
> > > >>>> in
> > > >>>>>>
> > > >>>>>>> place.
> > > >>>>>>>
> > > >>>>>>> thanks
> > > >>>>>>> zhenyu
> > > >>>>>>>
> > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <stack@duboce.net>
> wrote:
> > > >>>>>>>
> > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong
<
> > > >>>>>>>>
> > > >>>>>>> zhongresearch@gmail.com
> > > >>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>> In my cluster, half of the cluster have
2 disks 400GB each
> per
> > > >>>>>>>>>
> > > >>>>>>>> machine,
> > > >>>>>>
> > > >>>>>>> and
> > > >>>>>>>>
> > > >>>>>>>>> half of the cluster have 6 disks per machine.
 Maybe we
> should
> > > >>>>>>>>>
> > > >>>>>>>> run
> > > >>>
> > > >>>> zookeeper
> > > >>>>>>>>> on the machines with 2 disks and RS on
machines with 6 disks?
> > > >>>>>>>>>
> > > >>>>>>>>> That would make most sense only in the
below, it looks like
> the
> > > >>>>>>>>>
> > > >>>>>>>> RS
> > > >>>
> > > >>>> that
> > > >>>>>>
> > > >>>>>>> had
> > > >>>>>>>> issue had 4 disks?
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>  BTW, the 10 minutes zookeeper.session.timeout
has been set
> > during
> > > >>>>>>>>>
> > > >>>>>>>> the
> > > >>>>>>
> > > >>>>>>> experiment.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>> And for sure this node had it in place?
> > > >>>>>>>> St.Ack
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>> thanks
> > > >>>>>>>>> zhenyu
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack
<stack@duboce.net>
> > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu
Zhong <
> > > >>>>>>>>>>
> > > >>>>>>>>> zhongresearch@gmail.com
> > > >>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>  Though I experienced 2 regionserver
disconnection this
> > > >>>>>>>>>>>
> > > >>>>>>>>>> morning,
> > > >>>
> > > >>>> it
> > > >>>>>>
> > > >>>>>>> looks
> > > >>>>>>>>>
> > > >>>>>>>>>> better from the regionserver log.
(please see the following
> > > >>>>>>>>>>>
> > > >>>>>>>>>> log)
> > > >>>
> > > >>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> http://pastebin.com/m496dbfae
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I checked diskIO for one of the
> regionserver(192.168.100.116)
> > > >>>>>>>>>>>
> > > >>>>>>>>>> during
> > > >>>>>>>
> > > >>>>>>>> the
> > > >>>>>>>>>
> > > >>>>>>>>>> time it disconnected.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Time: 03:04:58 AM
> > > >>>>>>>>>>> Device:            tps   Blk_read/s
  Blk_wrtn/s   Blk_read
> > > >>>>>>>>>>>
> > > >>>>>>>>>> Blk_wrtn
> > > >>>>>>>>
> > > >>>>>>>>> sda             105.31      5458.83  
  19503.64 9043873239
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 32312473676
> > > >>>>>>>>
> > > >>>>>>>>> sda1              2.90         6.64  
     99.25   10993934
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 164433464
> > > >>>>>>>>
> > > >>>>>>>>> sda2              1.72        23.76  
     30.25   39365817
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 50124008
> > > >>>>>>>>
> > > >>>>>>>>> sda3              0.30         0.38  
      3.58     624930
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 5923000
> > > >>>>>>>>
> > > >>>>>>>>> sda4            100.39      5428.06  
  19370.56 8992888270
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 32091993204
> > > >>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>  Is this high for you?  20k blocks/second
would seem to be
> > high
> > > >>>>>>>>>>
> > > >>>>>>>>> but
> > > >>>
> > > >>>> its
> > > >>>>>>>
> > > >>>>>>>> one
> > > >>>>>>>>>
> > > >>>>>>>>>> disk only and its not being shared
by zk anymore so
> shouldn't
> > > >>>>>>>>>>
> > > >>>>>>>>> matter?
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>  I also checked the zookeeper quorum
server that the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> regionserver
> > > >>>
> > > >>>> tried
> > > >>>>>>>>
> > > >>>>>>>>> to
> > > >>>>>>>>>
> > > >>>>>>>>>> connect according to the log. However,
I don't see
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 192.168.100.116
> > > >>>>>>
> > > >>>>>>> in
> > > >>>>>>>
> > > >>>>>>>> the
> > > >>>>>>>>>
> > > >>>>>>>>>> client list of the zookeeper quorum
server that the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> regionserver
> > > >>>
> > > >>>> tried
> > > >>>>>>>>
> > > >>>>>>>>> to
> > > >>>>>>>>>
> > > >>>>>>>>>> connect.
> > > >>>>>>>>>>> Would that be a problem?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>  Is that because the ephemeral
node for the regionserver
> had
> > > >>>>>>>>>>
> > > >>>>>>>>> evaporated?
> > > >>>>>>>
> > > >>>>>>>> Lost
> > > >>>>>>>>>> its lease w/ zk by the time you went
to look?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558,
built on 08/27/2009 18:48
> > > >>>>>>>>>>>
> > > >>>>>>>>>> GMT
> > > >>>
> > > >>>> Clients:
> > > >>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0)
> > > >>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0)
> > > >>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0)
> > > >>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0)
> > > >>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0)
> > > >>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0)
> > > >>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0)
> > > >>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
> > > >>>>>>>>>>> Received: 47111
> > > >>>>>>>>>>> Sent: 0
> > > >>>>>>>>>>> Outstanding: 0
> > > >>>>>>>>>>> Zxid: 0x77000083f4
> > > >>>>>>>>>>> Mode: leader
> > > >>>>>>>>>>> Node count: 23
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> That 6 second maximum latency is pretty
bad but should be
> well
> > > >>>>>>>>>>
> > > >>>>>>>>> within
> > > >>>>>>
> > > >>>>>>> the
> > > >>>>>>>>
> > > >>>>>>>>> zk
> > > >>>>>>>>>> session timeout.
> > > >>>>>>>>>>
> > > >>>>>>>>>> So, problem is likely on the zk client
side of the session;
> > > >>>>>>>>>>
> > > >>>>>>>>> i.e.
> > > >>>
> > > >>>> in
> > > >>>>>>
> > > >>>>>>> the
> > > >>>>>>>
> > > >>>>>>>> RS.
> > > >>>>>>>>>
> > > >>>>>>>>>> You could enable GC logging as J-D
suggested to see if you
> > have
> > > >>>>>>>>>>
> > > >>>>>>>>> any
> > > >>>>>>
> > > >>>>>>> big
> > > >>>>>>>
> > > >>>>>>>> pauses, pauses > zk session timeout.
> > > >>>>>>>>>>
> > > >>>>>>>>>> When the RS went down, it didn't look
too heavily loaded:
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
> > > >>>>>>>>>>  org.apache.hadoop.hbase.regionserver.HRegionServer:
Dump of
> > > >>>>>>>>>>
> > > >>>>>>>>> metrics:
> > > >>>>>>>
> > > >>>>>>>>  request=1.5166667, regions=322, stores=657,
storefiles=631,
> > > >>>>>>>>>>  storefileIndexSize=61, memstoreSize=1472,
usedHeap=2819,
> > > >>>>>>>>>>
> > > >>>>>>>>> maxHeap=4079,
> > > >>>>>>>>
> > > >>>>>>>>>  blockCacheSize=658110960, blockCacheFree=197395984,
> > > >>>>>>>>>>
> > > >>>>>>>>> blockCacheCount=9903,
> > > >>>>>>>>>
> > > >>>>>>>>>>  blockCacheHitRatio=99
> > > >>>>>>>>>>
> > > >>>>>>>>>> Its serving a few reads?  The number
of store files seems
> > fine.
> > > >>>>>>>>>>
> > > >>>>>>>>> Not
> > > >>>>>>
> > > >>>>>>> too
> > > >>>>>>>>
> > > >>>>>>>>> much memory used.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Looking at the logs, I see the Lease
Still Held exception.
> > > >>>>>>>>>>
> > > >>>>>>>>> This
> > > >>>
> > > >>>> happens
> > > >>>>>>>>
> > > >>>>>>>>> when the RS does its regular report to
the master but the
> > > >>>>>>>>>>
> > > >>>>>>>>> master
> > > >>>
> > > >>>> thinks
> > > >>>>>>>
> > > >>>>>>>> the
> > > >>>>>>>>>
> > > >>>>>>>>>> RS has since restarted.  It'll think
this probably because
> it
> > > >>>>>>>>>>
> > > >>>>>>>>> noticed
> > > >>>>>>
> > > >>>>>>> that
> > > >>>>>>>>>
> > > >>>>>>>>>> the RS's znode in zk had gone away
and it considered the RS
> > > >>>>>>>>>>
> > > >>>>>>>>> dead.
> > > >>>
> > > >>>>
> > > >>>>>>>>>> Looking too at your  logs I see this
gap in the zk pinging:
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
> > > >>>>>>>>>>
> > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >>>>>>
> > > >>>>>>> Got
> > > >>>>>>>
> > > >>>>>>>>  ping response for sessionid:0x224e55436ad0004
after 0ms
> > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
> > > >>>>>>>>>>
> > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >>>>>>
> > > >>>>>>> Got
> > > >>>>>>>
> > > >>>>>>>>  ping response for sessionid:0x24e55436a0007d
after 0ms
> > > >>>>>>>>>>
> > > >>>>>>>>>> Where in the lines above it, its reporting
about every ten
> > > >>>>>>>>>>
> > > >>>>>>>>> seconds,
> > > >>>>>>
> > > >>>>>>> here
> > > >>>>>>>>
> > > >>>>>>>>> there is a big gap.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Do you have ganglia or something that
will let you look more
> > > >>>>>>>>>>
> > > >>>>>>>>> into
> > > >>>
> > > >>>> what
> > > >>>>>>>
> > > >>>>>>>> was
> > > >>>>>>>>>
> > > >>>>>>>>>> happening on this machine around this
time?  Is the machine
> > OK?
> > > >>>>>>>>>>
> > > >>>>>>>>> It
> > > >>>>>>
> > > >>>>>>> looks
> > > >>>>>>>>
> > > >>>>>>>>> lightly loaded and you have your cluster
nicely laid out.
> > > >>>>>>>>>>
> > > >>>>>>>>> Something
> > > >>>>>>
> > > >>>>>>> odd
> > > >>>>>>>>
> > > >>>>>>>>> is
> > > >>>>>>>>>
> > > >>>>>>>>>> going on.  What about things like
the write speed to disk?
>  In
> > > >>>>>>>>>>
> > > >>>>>>>>> the
> > > >>>
> > > >>>> past
> > > >>>>>>>
> > > >>>>>>>> strange issues have been explained by incorrectly
set BIOS
> > > >>>>>>>>>>
> > > >>>>>>>>> which
> > > >>>
> > > >>>> made
> > > >>>>>>
> > > >>>>>>> disks
> > > >>>>>>>>>
> > > >>>>>>>>>> run at 1/100th of their proper speed.
> > > >>>>>>>>>>
> > > >>>>>>>>>> St.Ack
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Best,
> > > >>>>>>>>>>> zhenyu
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM,
Zhenyu Zhong <
> > > >>>>>>>>>>>
> > > >>>>>>>>>> zhongresearch@gmail.com
> > > >>>>>>>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>  Stack
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I am very appreciated for
your comments.
> > > >>>>>>>>>>>> I will use the zookeeper monitoring
script on my cluster
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> and
> > > >>>
> > > >>>> let
> > > >>>>>>
> > > >>>>>>> it
> > > >>>>>>>
> > > >>>>>>>> run
> > > >>>>>>>>>
> > > >>>>>>>>>> overnight to see the result.
> > > >>>>>>>>>>>> I will follow up the thread
when I get anything.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> thanks
> > > >>>>>>>>>>>> zhenyu
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52
PM, stack <stack@duboce.net>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>>>>>  I see these in your log too:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018
DEBUG
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >>>>>>>>
> > > >>>>>>>>> Got
> > > >>>>>>>>>>
> > > >>>>>>>>>>>  ping response for sessionid:0x424dfd908c50009
after
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> 4544ms
> > > >>>
> > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >>>>>>>>
> > > >>>>>>>>> Got
> > > >>>>>>>>>>
> > > >>>>>>>>>>>  ping response for sessionid:0x24dfd90c810002
after
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> 4548ms
> > > >>>
> > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >>>>>>>>
> > > >>>>>>>>> Got
> > > >>>>>>>>>>
> > > >>>>>>>>>>>  ping response for sessionid:0x424dfd908c50009
after
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> 9030ms
> > > >>>
> > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> org.apache.zookeeper.
> > > >>>>>>>>
> > > >>>>>>>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message