hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: regionserver disconnection
Date Mon, 16 Nov 2009 21:07:17 GMT
On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <zhongresearch@gmail.com>wrote:

> I just realized that there was a MapReduce job running during the time the
> regionserver disconnected from the zookeeper.
> That MapReduce Job was processing 500GB data and took about 8 minutes to
> finish. It launched over 2000 map tasks.


There was a tasktracker running on the RegionServer that disconnected?  Is
the map i/o or cpu heavy?  Do you think it could have stole life from the
datanode/regionserver?



> I doubt that this introduced
> resource contention between DataNode and RegionServer Node.
> Also during the time that MapReduce job ran, I saw a few errors indicating
> that
>
> java.io.IOException: Could not obtain block:
> blk_7661247556283580226_4450462 file=/data/xxx.txt.200910141159
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
>        at java.io.DataInputStream.read(DataInputStream.java:83)
>        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>        at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>        at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>

This is interesting.  Your map task is running into hdfs-127.  Yeah, you
should patch your cluster if you want to get rid of these (hdfs-127 has been
applied to 0.21 hadoop and will be in the next hadoop release on 0.20.x
branch, hadoop 0.20.2).



>
> Possibly it is related to HDFS-127, but I don't see any datanodes went
> down. Does that affect Regionserver? Shall we apply the patch?
>
>
Read up on hdfs-127 for explaination of whats going on.



> Now I start to keep tracking of the virtual memory stats to see if the same
> issue happens tomorrow around the same time period.
>

Excellent.  Thanks for doing this.



> And I highly suspect that this particular MapReduce job hurts HBase
> Regionserver.
>
> PS. I also use gcviewer to parse the GC-log, I only see around 30 seconds
> pauses maximum.
>


Did you figure why your ten minute session is not applying?

St.Ack



>
> Thanks a lot.
> zhenyu
>
>
>
> On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <zhongresearch@gmail.com
> >wrote:
>
> > J-D,
> >
> > Thank you very much for your comments.
> > My company block the IRC port, so I have trouble to connect to IRC
> channel.
> > I have been trying to ask the IT to open the IRC port for me, it might
> take
> > a while.
> >
> > Best,
> > zhenyu
> >
> >
> > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> The error you are getting is a disconnection from a zookeeper server
> >> and is very generic.
> >>
> >> ZK-86 is still opened and the last comment refers to ZK-111 saying
> >> that the bug (in unit tests) was probably fixed in release 3.0.0 last
> >> year.
> >>
> >> To figure the hang you have, you can try to jstack the process pid and
> >> see exactly what's holding the RS from shutting down.
> >>
> >> Would it be possible for you to drop by the IRC channel? This way we
> >> can debug this a at much faster pace.
> >>
> >> Thx!
> >>
> >> J-D
> >>
> >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <zhongresearch@gmail.com>
> >> wrote:
> >> > I found this.
> >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86
> >> >
> >> > It looks like the same error I had. Is it a zookeeper bug? When will
> >> HBase
> >> > take the zookeeper version 3.3.0?
> >> >
> >> > thanks
> >> > zhenyu
> >> >
> >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
> zhongresearch@gmail.com
> >> >wrote:
> >> >
> >> >> Now I really doubt about the zookeeper, from the log I saw errors
> like
> >> >> IOException Read Error, while the zookeeper client (regionserver )
> >> tried to
> >> >> read. But it got disconneted status from zookeeper.
> >> >>
> >> >> I don't see any load on any zookeeper quorum servers. DiskIO is fine.
> >> >>
> >> >> Also when the regionserver decides to exit due to the disconnect
> status
> >> >> from zookeeper, sometimes the regionserver hangs during the exiting.
> We
> >> can
> >> >> still see the HRegionServer process even we don't see it in the
> Master
> >> web
> >> >> interface.
> >> >>
> >> >> I also notice that there is an zookeeper.retries and zookeeper.pauses
> >> >> settings in hbase-default.xml, would that matter ?
> >> >>
> >> >> thanks
> >> >>
> >> >> You may see the IOException error in the following log when the
> >> >> Regionserver lost connection to the zookeeper.
> >> >>
> >> >> 2009-11-14 15:58:31,993 INFO
> org.apache.hadoop.hbase.regionserver.HLog:
> >> >> Roll
> >> >>
> >>
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
> >> >> entries=1062964, calcsize=255014936, filesize=139566093. New hlog
> >> >>
> >>
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
> >> >> 2009-11-14 15:58:39,295 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 15:58:52,648 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 15:59:06,007 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 15:59:19,365 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 15:59:32,867 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms
> >> >> 2009-11-14 15:59:46,070 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 15:59:59,423 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 16:00:12,775 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> >> 2009-11-14 16:00:28,141 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms
> >> >> 2009-11-14 16:01:10,378 WARN org.apache.zookeeper.ClientCnxn:
> Exception
> >> >> closing session 0x424f0ed01ea00a6 to
> >> sun.nio.ch.SelectionKeyImpl@39fbb2d6
> >> >>
> >> >> java.io.IOException: Read error rc = -1
> java.nio.DirectByteBuffer[pos=0
> >> >> lim=4 cap=4]
> >> >>         at
> >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> >> >>         at
> >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> >> >> 2009-11-14 16:01:10,478 INFO
> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> >> event,
> >> >> state: Disconnected, type: None, path: null
> >> >> 2009-11-14 16:01:11,333 WARN
> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> >> >>
> >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> >> >>         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> >> Method)
> >> >>         at
> >> >>
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >> >>         at
> >> >>
> >>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >> >>         at
> >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> >> >>         at
> >> >>
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> >> >>         at java.lang.Thread.run(Thread.java:619)
> >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> >> Attempting
> >> >> connection to server superpyxis0001/192.168.100.115:2181
> >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn: Priming
> >> >> connection to java.nio.channels.SocketChannel[connected local=/
> >> >> 192.168.100.122:59575 remote=superpyxis0001/192.168.100.115:2181]
> >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn: Server
> >> >> connection successful
> >> >> 2009-11-14 16:01:11,433 INFO
> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> >> event,
> >> >> state: Expired, type: None, path: null
> >> >> 2009-11-14 16:01:11,433 WARN org.apache.zookeeper.ClientCnxn:
> Exception
> >> >> closing session 0x424f0ed01ea00a6 to
> >> sun.nio.ch.SelectionKeyImpl@6b93d343
> >> >> java.io.IOException: Session Expired
> >> >>         at
> >> >>
> >>
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
> >> >>         at
> >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
> >> >>
> >> >>         at
> >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> >> >> 2009-11-14 16:01:11,433 ERROR
> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> >> >> expired
> >> >> 2009-11-14 16:01:13,513 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> >> Server
> >> >> handler 3 on 60021, call put([B@73deb204,
> >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from
> >> 192.168.100.132:40728:
> >> >> error: java.io.IOException: Server not running, aborting
> >> >> java.io.IOException: Server not running, aborting
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
> zhongresearch@gmail.com
> >> >wrote:
> >> >>
> >> >>> Stack,
> >> >>>
> >> >>> Thanks a lot!
> >> >>>
> >> >>> I found out that the reason why HBase doesn't take the system file
> >> >>> descriptor value I set before. I started the HBase using root
> instead
> >> of the
> >> >>> normal hadoop user, while my system configures hadoop with higher
> file
> >> >>> descriptor value but configures root with 1024 default value.
> >> >>>
> >> >>>
> >> >>> Now my system has a clean start. However, it seems that the
> >> >>> zookeeper.session.timeout value doesn't take into effect. I still
> >> found
> >> >>> around 60 seconds pauses from the disconnected regionserver. I
> really
> >> don't
> >> >>> know why regionserver only times out after 60 seconds instead of 10
> >> minutes
> >> >>> which I set for zookeeper.session.timeout.
> >> >>>
> >> >>> Is there any other timeout value coming into play before the actual
> >> >>> session times out?
> >> >>>
> >> >>> zhenyu
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <stack@duboce.net> wrote:
> >> >>>
> >> >>>> Ok.  Lack of file descriptors manifests in all kinds of weird ways.
> >> >>>> Hopefully thats it.  If not, lets keep digging.
> >> >>>> St.Ack
> >> >>>>
> >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <
> >> zhongresearch@gmail.com
> >> >>>> >wrote:
> >> >>>>
> >> >>>> > Stack,
> >> >>>> >
> >> >>>> > You are right, the master started with ulimit -n 1024. It doesn't
> >> take
> >> >>>> the
> >> >>>> > system value.
> >> >>>> >
> >> >>>> > Regarding to the too many open files, it looks like the same as
> the
> >> one
> >> >>>> J-D
> >> >>>> > put up. But I will get the Master start with higher value first
> and
> >> see
> >> >>>> if
> >> >>>> > this kind of error goes away.
> >> >>>> >
> >> >>>> > thanks a lot!
> >> >>>> > zhenyu
> >> >>>> >
> >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <stack@duboce.net> wrote:
> >> >>>> >
> >> >>>> > > Does it say
> >> >>>> > >
> >> >>>> > > ulimit -n 32768
> >> >>>> > >
> >> >>>> > > ...as the second line in your log file on start of the master?
> >> >>>> > >
> >> >>>> > > Is the stack trace that complains about too many open files
> same
> >> as
> >> >>>> the
> >> >>>> > one
> >> >>>> > > in the blog post J-D put up?
> >> >>>> > >
> >> >>>> > > St.Ack
> >> >>>> > >
> >> >>>> > >
> >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <
> >> >>>> zhongresearch@gmail.com
> >> >>>> > > >wrote:
> >> >>>> > >
> >> >>>> > > > The ulimit file descriptors was set to fs.file-max =1578334,
> >> also
> >> >>>> in
> >> >>>> > the
> >> >>>> > > > limits.conf the value was set to 32768.
> >> >>>> > > > So these are way higher than the open descriptors for the
> >> running
> >> >>>> > > > processes.
> >> >>>> > > >
> >> >>>> > > > thanks
> >> >>>> > > > zhenyu
> >> >>>> > > >
> >> >>>> > > >
> >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <saint.ack@gmail.com>
> >> >>>> wrote:
> >> >>>> > > >
> >> >>>> > > > > You upped the ulimit file descriptors as per the getting
> >> started
> >> >>>> doc?
> >> >>>> > > > >
> >> >>>> > > > >
> >> >>>> > > > >
> >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <
> >> >>>> zhongresearch@gmail.com>
> >> >>>> > > > wrote:
> >> >>>> > > > >
> >> >>>> > > > >  Thanks a lot.
> >> >>>> > > > >>
> >> >>>> > > > >>
> >> >>>> > > > >> Bad news is my kernel is still 2.6.26.
> >> >>>> > > > >> But it was not a problem before.
> >> >>>> > > > >>
> >> >>>> > > > >> Very strange.
> >> >>>> > > > >>
> >> >>>> > > > >> zhenyu
> >> >>>> > > > >>
> >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <
> >> >>>> > > > jdcryans@apache.org
> >> >>>> > > > >> >wrote:
> >> >>>> > > > >>
> >> >>>> > > > >>  Looks like
> >> >>>> > > > >>>
> >> >>>> > > > >>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> >> >>>> > > > >>>
> >> >>>> > > > >>> J-D
> >> >>>> > > > >>>
> >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <
> >> >>>> > > zhongresearch@gmail.com
> >> >>>> > > > >
> >> >>>> > > > >>> wrote:
> >> >>>> > > > >>>
> >> >>>> > > > >>>> Hi,
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> After I re-organize the cluster, the experiment ran into
> >> >>>> problem
> >> >>>> > > > faster
> >> >>>> > > > >>>>
> >> >>>> > > > >>> than
> >> >>>> > > > >>>
> >> >>>> > > > >>>> before.
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> Basically the changes are to use machines with less
> >> resources
> >> >>>> as
> >> >>>> > > > >>>>
> >> >>>> > > > >>> zookeeper
> >> >>>> > > > >>>
> >> >>>> > > > >>>> quorums and machines with more resources as
> regionserver.
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> From the log, I still see the pause around 1 minute.
> >> >>>> > > > >>>> I enabled the GC logging,  please see
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> for details.
> >> >>>> > > > >>>> In general I don't see many pauses in the GC.
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> What is more interesting, 1 hour after the 1st
> >> regionserver
> >> >>>> > > > >>>> disconnected,
> >> >>>> > > > >>>> the master log complained about too many open files.
> This
> >> >>>> didn't
> >> >>>> > > > happen
> >> >>>> > > > >>>> before.
> >> >>>> > > > >>>> I checked the system OS setup as well as the
> limits.conf.
> >> I
> >> >>>> also
> >> >>>> > > > checked
> >> >>>> > > > >>>>
> >> >>>> > > > >>> the
> >> >>>> > > > >>>
> >> >>>> > > > >>>> running processes. The total open files don't reach the
> >> limit.
> >> >>>> I
> >> >>>> > am
> >> >>>> > > > >>>>
> >> >>>> > > > >>> confused
> >> >>>> > > > >>>
> >> >>>> > > > >>>> a bit.
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> Please see the following master log.
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658 row(s) of meta
> >> region
> >> >>>> > > {server:
> >> >>>> > > > >>>> 192.168.100.128:60021, regionname: .META.,,1, startKey:
> >> <>}
> >> >>>> > > complete
> >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> >>>> > > > >>>>
> >> >>>> > > > >>> All
> >> >>>> > > > >>>
> >> >>>> > > > >>>> 1 .META. region(s) scanned
> >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
> >> org.apache.zookeeper.ClientCnxn:
> >> >>>> Got
> >> >>>> > > > ping
> >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 3ms
> >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad connect
> >> ack
> >> >>>> with
> >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_-2808245019291145247_5478039
> >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.io.EOFException
> >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_1074853606841896259_5478048
> >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
> >> org.apache.zookeeper.ClientCnxn:
> >> >>>> Got
> >> >>>> > > > ping
> >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 1ms
> >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
> >> org.apache.zookeeper.ClientCnxn:
> >> >>>> Got
> >> >>>> > > > ping
> >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
> >> org.apache.zookeeper.ClientCnxn:
> >> >>>> Got
> >> >>>> > > > ping
> >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad connect
> >> ack
> >> >>>> with
> >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_-6642544517082142289_5478063
> >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> many
> >> >>>> open
> >> >>>> > > files
> >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_2057511041109796090_5478063
> >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> many
> >> >>>> open
> >> >>>> > > files
> >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_8219260302213892894_5478064
> >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> many
> >> >>>> open
> >> >>>> > > files
> >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_1669205542853067709_5478235
> >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> many
> >> >>>> open
> >> >>>> > > files
> >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_9128897691346270351_5478237
> >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> many
> >> >>>> open
> >> >>>> > > files
> >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_1116845144864123018_5478240
> >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
> >> org.apache.zookeeper.ClientCnxn:
> >> >>>> Got
> >> >>>> > > > ping
> >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
> >> org.apache.zookeeper.ClientCnxn:
> >> >>>> Got
> >> >>>> > > > ping
> >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>>
> >> >>>> > > > >>> DataStreamer
> >> >>>> > > > >>>
> >> >>>> > > > >>>> Exception: java.io.IOException: Too many open files
> >> >>>> > > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native Method)
> >> >>>> > > > >>>>      at
> >> >>>> > > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>
> >> >>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>      at
> >> >>>> java.io.DataOutputStream.write(DataOutputStream.java:90)
> >> >>>> > > > >>>>      at
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > Error
> >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475 bad
> >> >>>> datanode[0]
> >> >>>> > > > >>>> 192.168.100.123:50010
> >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > Error
> >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475 in
> >> pipeline
> >> >>>> > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010,
> >> >>>> > 192.168.100.122:50010
> >> >>>> > > :
> >> >>>> > > > >>>> bad
> >> >>>> > > > >>>> datanode 192.168.100.123:50010
> >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > Exception
> >> >>>> > > > >>>>
> >> >>>> > > > >>> in
> >> >>>> > > > >>>
> >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException: Too
> many
> >> >>>> open
> >> >>>> > > files
> >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> >> org.apache.hadoop.hdfs.DFSClient:
> >> >>>> > > > >>>> Abandoning
> >> >>>> > > > >>>> block blk_2119727700857186236_5478498
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <
> >> >>>> > > > zhongresearch@gmail.com
> >> >>>> > > > >>>> wrote:
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>  Will do.
> >> >>>> > > > >>>>>
> >> >>>> > > > >>>>> thanks
> >> >>>> > > > >>>>> zhenyu
> >> >>>> > > > >>>>>
> >> >>>> > > > >>>>>
> >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <
> stack@duboce.net
> >> >
> >> >>>> wrote:
> >> >>>> > > > >>>>>
> >> >>>> > > > >>>>>  Enable GC logging too on this next run (see
> >> hbase-env.sh).
> >> >>>>  Lets
> >> >>>> > > try
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>> and
> >> >>>> > > > >>>
> >> >>>> > > > >>>> get
> >> >>>> > > > >>>>>> to the bottom of whats going on.
> >> >>>> > > > >>>>>> Thanks,
> >> >>>> > > > >>>>>> St.Ack
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>> zhongresearch@gmail.com
> >> >>>> > > > >>>
> >> >>>> > > > >>>> wrote:
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>  I can switch the boxes that run zookeeper with the
> ones
> >> >>>> that
> >> >>>> > run
> >> >>>> > > > >>>>>>> regionservers.
> >> >>>> > > > >>>>>>> I will see the results later.
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes
> >> >>>> > zookeeper.session.timeout
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>> value
> >> >>>> > > > >>>
> >> >>>> > > > >>>> in
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> place.
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>> thanks
> >> >>>> > > > >>>>>>> zhenyu
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <
> >> stack@duboce.net>
> >> >>>> > wrote:
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong <
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>> zhongresearch@gmail.com
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> wrote:
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster have 2 disks
> 400GB
> >> >>>> each
> >> >>>> > per
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>> machine,
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> and
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per machine.
>  Maybe
> >> we
> >> >>>> > should
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>> run
> >> >>>> > > > >>>
> >> >>>> > > > >>>> zookeeper
> >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS on machines
> with
> >> 6
> >> >>>> disks?
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>> That would make most sense only in the below, it
> >> looks
> >> >>>> like
> >> >>>> > the
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>> RS
> >> >>>> > > > >>>
> >> >>>> > > > >>>> that
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> had
> >> >>>> > > > >>>>>>>> issue had 4 disks?
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>  BTW, the 10 minutes zookeeper.session.timeout has
> >> been
> >> >>>> set
> >> >>>> > > during
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>> the
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> experiment.
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>> And for sure this node had it in place?
> >> >>>> > > > >>>>>>>> St.Ack
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> thanks
> >> >>>> > > > >>>>>>>>> zhenyu
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <
> >> stack@duboce.net
> >> >>>> >
> >> >>>> > > wrote:
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong <
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> zhongresearch@gmail.com
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> wrote:
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>  Though I experienced 2 regionserver disconnection
> >> this
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> morning,
> >> >>>> > > > >>>
> >> >>>> > > > >>>> it
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> looks
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> better from the regionserver log. (please see the
> >> >>>> following
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> log)
> >> >>>> > > > >>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the
> >> >>>> > regionserver(192.168.100.116)
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> during
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> the
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> time it disconnected.
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
> >> >>>> > > > >>>>>>>>>>> Device:            tps   Blk_read/s   Blk_wrtn/s
> >> >>>> Blk_read
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> Blk_wrtn
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> sda             105.31      5458.83     19503.64
> >> >>>> 9043873239
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> 32312473676
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> sda1              2.90         6.64        99.25
> >> >>>> 10993934
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> 164433464
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> sda2              1.72        23.76        30.25
> >> >>>> 39365817
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> 50124008
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> sda3              0.30         0.38         3.58
> >> >>>> 624930
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> 5923000
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> sda4            100.39      5428.06     19370.56
> >> >>>> 8992888270
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> 32091993204
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k blocks/second would
> >> seem to
> >> >>>> be
> >> >>>> > > high
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> but
> >> >>>> > > > >>>
> >> >>>> > > > >>>> its
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> one
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> disk only and its not being shared by zk anymore
> so
> >> >>>> > shouldn't
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> matter?
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper quorum server that
> the
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> regionserver
> >> >>>> > > > >>>
> >> >>>> > > > >>>> tried
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> to
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> connect according to the log. However, I don't see
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> 192.168.100.116
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> in
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> the
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum server that
> the
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> regionserver
> >> >>>> > > > >>>
> >> >>>> > > > >>>> tried
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> to
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> connect.
> >> >>>> > > > >>>>>>>>>>> Would that be a problem?
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>  Is that because the ephemeral node for the
> >> >>>> regionserver
> >> >>>> > had
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> evaporated?
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> Lost
> >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you went to look?
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on
> >> 08/27/2009
> >> >>>> 18:48
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> GMT
> >> >>>> > > > >>>
> >> >>>> > > > >>>> Clients:
> >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
> >> [1](queued=0,recved=26,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
> >> [1](queued=0,recved=964,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
> >> [1](queued=0,recved=3266,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
> >> [1](queued=0,recved=233,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
> >> [1](queued=0,recved=2,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
> >> [1](queued=0,recved=295,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
> >> [1](queued=0,recved=2290,sent=0)
> >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
> >> [1](queued=0,recved=0,sent=0)
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
> >> >>>> > > > >>>>>>>>>>> Received: 47111
> >> >>>> > > > >>>>>>>>>>> Sent: 0
> >> >>>> > > > >>>>>>>>>>> Outstanding: 0
> >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
> >> >>>> > > > >>>>>>>>>>> Mode: leader
> >> >>>> > > > >>>>>>>>>>> Node count: 23
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is pretty bad but
> >> should
> >> >>>> be
> >> >>>> > well
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> within
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> the
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> zk
> >> >>>> > > > >>>>>>>>>> session timeout.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk client side of the
> >> >>>> session;
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> i.e.
> >> >>>> > > > >>>
> >> >>>> > > > >>>> in
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> the
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> RS.
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D suggested to
> see
> >> if
> >> >>>> you
> >> >>>> > > have
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> any
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> big
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't look too heavily
> >> >>>> loaded:
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
> >> >>>> > > > >>>>>>>>>>
>  org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> >>>> Dump of
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> metrics:
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>>  request=1.5166667, regions=322, stores=657,
> >> >>>> storefiles=631,
> >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61, memstoreSize=1472,
> >> >>>> usedHeap=2819,
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> maxHeap=4079,
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
> blockCacheFree=197395984,
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The number of store
> files
> >> >>>> seems
> >> >>>> > > fine.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> Not
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> too
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> much memory used.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease Still Held
> >> >>>> exception.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> This
> >> >>>> > > > >>>
> >> >>>> > > > >>>> happens
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> when the RS does its regular report to the master
> but
> >> the
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> master
> >> >>>> > > > >>>
> >> >>>> > > > >>>> thinks
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> the
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> RS has since restarted.  It'll think this probably
> >> >>>> because
> >> >>>> > it
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> noticed
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> that
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away and it
> considered
> >> the
> >> >>>> RS
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> dead.
> >> >>>> > > > >>>
> >> >>>> > > > >>>>
> >> >>>> > > > >>>>>>>>>> Looking too at your  logs I see this gap in the zk
> >> >>>> pinging:
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> Got
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>>  ping response for sessionid:0x224e55436ad0004 after
> >> 0ms
> >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> Got
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>>  ping response for sessionid:0x24e55436a0007d after
> >> 0ms
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> Where in the lines above it, its reporting about
> >> every
> >> >>>> ten
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> seconds,
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> here
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> there is a big gap.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> Do you have ganglia or something that will let you
> >> look
> >> >>>> more
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> into
> >> >>>> > > > >>>
> >> >>>> > > > >>>> what
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> was
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> happening on this machine around this time?  Is
> the
> >> >>>> machine
> >> >>>> > > OK?
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> It
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> looks
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> lightly loaded and you have your cluster nicely
> laid
> >> out.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> Something
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> odd
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> is
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> going on.  What about things like the write speed
> to
> >> >>>> disk?
> >> >>>> >  In
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> the
> >> >>>> > > > >>>
> >> >>>> > > > >>>> past
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> strange issues have been explained by incorrectly
> set
> >> BIOS
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>> which
> >> >>>> > > > >>>
> >> >>>> > > > >>>> made
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> disks
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper speed.
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> St.Ack
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> Best,
> >> >>>> > > > >>>>>>>>>>> zhenyu
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong <
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>> zhongresearch@gmail.com
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> wrote:
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>  Stack
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your comments.
> >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper monitoring script on my
> >> >>>> cluster
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> and
> >> >>>> > > > >>>
> >> >>>> > > > >>>> let
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>> it
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>> run
> >> >>>> > > > >>>>>>>>>
> >> >>>> > > > >>>>>>>>>> overnight to see the result.
> >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when I get anything.
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> thanks
> >> >>>> > > > >>>>>>>>>>>> zhenyu
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <
> >> >>>> stack@duboce.net>
> >> >>>> > > > >>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>> wrote:
> >> >>>> > > > >>>>>>
> >> >>>> > > > >>>>>>>
> >> >>>> > > > >>>>>>>>>>>>  I see these in your log too:
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> Got
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>  ping response for sessionid:0x424dfd908c50009
> >> after
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> 4544ms
> >> >>>> > > > >>>
> >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> Got
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>  ping response for sessionid:0x24dfd90c810002
> after
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> 4548ms
> >> >>>> > > > >>>
> >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>>> Got
> >> >>>> > > > >>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>  ping response for sessionid:0x424dfd908c50009
> >> after
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> 9030ms
> >> >>>> > > > >>>
> >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> >> >>>> > > > >>>>>>>>>>>>>
> >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
> >> >>>> > > > >>>>>>>>
> >> >>>> > > > >>>>>>>
> >> >>>> > > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message