hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhenyu Zhong <zhongresea...@gmail.com>
Subject Re: regionserver disconnection
Date Thu, 19 Nov 2009 16:38:12 GMT
Here is the zookeeper leader log. It had an unexpected exception.  I don't
know why.

http://pastebin.com/m6594ccc

During that time, I don't think the system load is high. Memory usage should
be normal.

Best,
zhenyu



On Thu, Nov 19, 2009 at 11:09 AM, stack <stack@duboce.net> wrote:

> What is in the other zk logs?  Why did the leader go away?  Memory issue or
> something?  Why'd it shutdown?  Thanks for pastebin'ing this stuff.
> St.Ack
>
> On Thu, Nov 19, 2009 at 7:53 AM, Zhenyu Zhong <zhongresearch@gmail.com
> >wrote:
>
> > After digging more on the zookeeper log, I found that there seemed to be
> a
> > socketconnection timeout. Also some warnings indicate that zookeeper
> server
> > is not running. Probably I need to understand the mechanism of zookeeper
> > first.
> >
> > Please find the zookeeper log around that time.
> >
> > http://pastebin.com/m37bb4ad1
> >
> > Many thanks
> > zhenyu
> >
> >
> > On Thu, Nov 19, 2009 at 10:31 AM, Zhenyu Zhong <zhongresearch@gmail.com
> > >wrote:
> >
> > > Stack,
> > >
> > > I am very appreciated. Let me dig into the zookeeper log more.
> > >
> > > FYI, I saw the sleeper complained about the delay for the first time.
> It
> > > looks like there are different reasons for the RS disconnections.
> > >
> > > 2009-11-19 08:08:32,708 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> > > response for sessionid:0x42508716bfc0001 after 0ms
> > > 2009-11-19 08:08:37,595 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> > > response for sessionid:0x42508716bfc0082 after 0ms
> > > 2009-11-19 08:08:46,073 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> > > response for sessionid:0x42508716bfc0001 after 0ms
> > > 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn: Exception
> > > closing session 0x42508716bfc0001 to
> sun.nio.ch.SelectionKeyImpl@4f3a608f
> > > java.io.IOException: TIMED OUT
> > >         at
> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn: Exception
> > > closing session 0x42508716bfc0082 to
> sun.nio.ch.SelectionKeyImpl@14c8875
> > > java.io.IOException: TIMED OUT
> > >         at
> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > 2009-11-19 08:09:39,880 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept
> > > 53556ms, ten times longer than scheduled: 3000
> > > 2009-11-19 08:09:39,980 WARN
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > Method)
> > >         at
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > >         at
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > >         at
> > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > >         at java.lang.Thread.run(Thread.java:619)
> > > 2009-11-19 08:09:39,980 INFO
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> event,
> > > state: Disconnected, type: None, path: null
> > > 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
> Attempting
> > > connection to server superpyxis0005/192.168.100.119:2181
> > > 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn: Priming
> > > connection to java.nio.channels.SocketChannel[connected local=/
> > > 192.168.100.122:40659 remote=superpyxis0005/192.168.100.119:2181]
> > >
> > >
> > >
> > > Best,
> > > zhenyu
> > >
> > > On Wed, Nov 18, 2009 at 6:52 PM, stack <stack@duboce.net> wrote:
> > >
> > >> Patrick Hunt, one of the zk lads, answered with the following:
> > >>
> > >>
> > >>
> > >>
> > >> According to the following snippet (I cut to just session
> > >> 0x425035c48210002)
> > >> the client is getting pings for this session id regularly every 6
> > seconds
> > >> or
> > >> so, then after 41,016 there's a 20 second gap (corresponds to a 30
> > second
> > >> timeout being set during session creation) at which point we
> disconnect
> > >> from
> > >> the server.
> > >>
> > >>
> > >> > 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> > response for sessionid:0x425035c48210002 after 0ms
> > >> > 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> > closing session 0x425035c48210002 to
> > >> sun.nio.ch.SelectionKeyImpl@421690ab
> > >> > java.io.IOException: TIMED OUT
> > >>
> > >> the client notifies hbase of the disco, and attempts to reconnect to a
> > >> server
> > >>
> > >>
> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > event,
> > >> > state: Disconnected, type: None, path: null
> > >> > 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> > Attempting
> > >> > connection to server superpyxis0003/192.168.100.117:2181
> > >>
> > >> but the server won't allow the session be be established for some
> reason
> > >> (is
> > >> that the right server:port that's it's trying to connect to?)
> > >>
> > >>
> > >> > 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Server
> > >> > connection successful
> > >> > 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> > closing session 0x425035c48210002 to
> > >> sun.nio.ch.SelectionKeyImpl@5c07076b
> > >> > java.io.IOException: Read error rc = -1
> > java.nio.DirectByteBuffer[pos=0
> > >> > lim=4 cap=4]
> > >>
> > >> The "read error rc = -1" means that the client read from the socket
> and
> > >> got
> > >> EOS. But it's not a session expiration afaict (and should not be at
> this
> > >> point in time).
> > >>
> > >>
> > >> You need to look at the ZK server logs for this period in time (for
> this
> > >> server superpyxis0003:2181, and superpyxis0001:2181, the client tries
> to
> > >> connect to both of these). There should be some indication there why
> the
> > >> server is closing down the connection and not allowing the session to
> be
> > >> re-established.
> > >>
> > >> One thought - by default we limit clients to a max of 10 connections
> > from
> > >> a
> > >> particular host to the server. If you exceed this limit the zk server
> > >> (it's
> > >> per server though, not per cluster, so perhaps this is not it) will
> not
> > >> allow any  additional connections from the client host (you can't
> exceed
> > >> 10
> > >> on a server). Could this be it? i.e. the user is running a number of
> ZK
> > >> clients (say hbase processes) from the same host? Regardless, get the
> > >> server
> > >> log for this time period and it should be more clear.
> > >>
> > >> Patrick
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Nov 18, 2009 at 11:18 AM, stack <stack@duboce.net> wrote:
> > >>
> > >> > And running stat on each of your zookeeper nodes tells you what
> about
> > >> > averages for connect-times?
> > >> >
> > >> > Its odd that there is a ping just 20 seconds beforehand and you have
> > set
> > >> > the zk session timeout at 60 seconds (or ten minutes?).  Let me ask
> > the
> > >> zk
> > >> > lads about it.
> > >> >
> > >> > If you look in the GC log for around this time, do you see anything?
> > >> >
> > >> > St.Ack
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <
> > zhongresearch@gmail.com
> > >> >wrote:
> > >> >
> > >> >> After running more experiment, it seems that the disconnection is
> not
> > >> >> related to the heavy load job, because the load average is low, the
> > >> diskio
> > >> >> is normal, memory Heap was not reached, also the virtual memory
> stats
> > >> >> shows
> > >> >> that no swappings.  However, the disconnection still happens.
> > >> >>
> > >> >> It looks like this time it pauses for 20 seconds. No idea why
> > >> regionserver
> > >> >> disconnected.
> > >> >>
> > >> >> Any other suggestions please?
> > >> >>
> > >> >>
> > >> >> 2009-11-18 12:41:00,956 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > >> >> 2009-11-18 12:41:08,074 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x425035c48210002 after 0ms
> > >> >> 2009-11-18 12:41:14,302 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > >> >> 2009-11-18 12:41:21,433 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x425035c48210002 after 0ms
> > >> >> 2009-11-18 12:41:27,656 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > >> >> 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x425035c48210002 after 0ms
> > >> >> 2009-11-18 12:41:41,016 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > >> >> response for sessionid:0x125035c495d02bb after 1ms
> > >> >> 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> >> closing session 0x425035c48210002 to
> > >> sun.nio.ch.SelectionKeyImpl@421690ab
> > >> >> java.io.IOException: TIMED OUT
> > >> >>        at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > >> >> 2009-11-18 12:42:01,582 INFO
> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > >> event,
> > >> >> state: Disconnected, type: None, path: null
> > >> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> > >> Attempting
> > >> >> connection to server superpyxis0003/192.168.100.117:2181
> > >> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> Priming
> > >> >> connection to java.nio.channels.SocketChannel[connected local=/
> > >> >> 192.168.100.132:46610 remote=superpyxis0003/192.168.100.117:2181]
> > >> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> Server
> > >> >> connection successful
> > >> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> >> closing session 0x425035c48210002 to
> > >> sun.nio.ch.SelectionKeyImpl@56dc6fac
> > >> >> java.io.IOException: Read error rc = -1
> > java.nio.DirectByteBuffer[pos=0
> > >> >> lim=4 cap=4]
> > >> >>        at
> > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > >> >>        at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > >> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn:
> > Ignoring
> > >> >> exception during shutdown input
> > >> >> java.net.SocketException: Transport endpoint is not connected
> > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> > >> >>        at
> > >> >>
> > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> > >> >>        at
> > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > >> >>         at
> > >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> > >> >>        at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > >> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn:
> > Ignoring
> > >> >> exception during shutdown output
> > >> >> java.net.SocketException: Transport endpoint is not connected
> > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> > >> >>        at
> > >> >>
> > sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
> > >> >>        at
> > >> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
> > >> >>        at
> > >> >>
> > >>
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
> > >> >>        at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > >> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> > >> Attempting
> > >> >> connection to server superpyxis0001/192.168.100.115:2181
> > >> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> Priming
> > >> >> connection to java.nio.channels.SocketChannel[connected local=/
> > >> >> 192.168.100.132:36197 remote=superpyxis0001/192.168.100.115:2181]
> > >> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> Server
> > >> >> connection successful
> > >> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > >> >> closing session 0x425035c48210002 to
> > >> sun.nio.ch.SelectionKeyImpl@5c07076b
> > >> >> java.io.IOException: Read error rc = -1
> > java.nio.DirectByteBuffer[pos=0
> > >> >> lim=4 cap=4]
> > >> >>        at
> > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > >> >>        at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > >> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > Ignoring
> > >> >> exception during shutdown input
> > >> >> java.net.SocketException: Transport endpoint is not connected
> > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> > >> >>        at
> > >> >>
> > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> > >> >>        at
> > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > >> >>         at
> > >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> > >> >>        at
> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > >> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > Ignoring
> > >> >> exception during shutdown output
> > >> >> java.net.SocketException: Transport endpoint is not connected
> > >> >>
> > >> >>
> > >> >> On Mon, Nov 16, 2009 at 4:16 PM, Zhenyu Zhong <
> > zhongresearch@gmail.com
> > >> >> >wrote:
> > >> >>
> > >> >> > Here is the diskIO and CPU around the time we had RS
> disconnection
> > on
> > >> >> one
> > >> >> > machine that runs RegionServer. It doesn't seem to be high.
> Similar
> > >> disk
> > >> >> and
> > >> >> > cpu usage have been seen before and the HBase was running fine.
> > >> >> >
> > >> >> >
> > >> >> > So far I haven't found why my 10 minutes session timeout doesn't
> > >> apply.
> > >> >> > Still digging.
> > >> >> >
> > >> >> > Many thanks!
> > >> >> >
> > >> >> > Time: 03:01:08 PM
> > >> >> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >> >> >            3.82    0.00    0.89    3.97    0.00   91.32
> > >> >> >
> > >> >> >
> > >> >> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
> > >> Blk_wrtn
> > >> >> > sda              24.30      1605.39      6679.35 6238623388
> > >> 25956251720
> > >> >> > sda1              0.51         1.94        15.71    7547265
> > >> 61067904
> > >> >> > sda2              0.13         2.11         5.20    8202079
> > >> 20189232
> > >> >> > sda3              0.13         0.25         1.63     987846
> > >>  6323336
> > >> >> > sda4              0.00         0.00         0.00          4
> > >>  0
> > >> >> > sda5              1.78         0.06        24.14     220010
> > >> 93817208
> > >> >> > sda6              0.38         0.42         8.41    1630584
> > >> 32688152
> > >> >> > sda7             21.38      1600.61      6624.26 6220035272
> > >> 25742165888
> > >> >> > sdb               4.52       767.33       565.91 2981868690
> > >> 2199132380
> > >> >> > sdb1              4.52       767.33       565.91 2981866755
> > >> 2199132372
> > >> >> > sdc               4.42       742.95       563.00 2887151482
> > >> 2187823092
> > >> >> > sdc1              4.42       742.95       563.00 2887149547
> > >> 2187823084
> > >> >> > sdd               4.49       750.78       557.25 2917554074
> > >> 2165513500
> > >> >> > sdd1              4.49       750.78       557.25 2917552139
> > >> 2165513492
> > >> >> > sde               4.52       758.51       569.46 2947593394
> > >> 2212964236
> > >> >> > sde1              4.52       758.51       569.46 2947591459
> > >> 2212964228
> > >> >> > sdf               4.51       747.22       571.78 2903740266
> > >> 2221972652
> > >> >> > sdf1              4.51       747.22       571.78 2903738331
> > >> 2221972644
> > >> >> >
> > >> >> > Best,
> > >> >> > zhenyu
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > On Mon, Nov 16, 2009 at 4:07 PM, stack <stack@duboce.net> wrote:
> > >> >> >
> > >> >> >> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <
> > >> >> zhongresearch@gmail.com
> > >> >> >> >wrote:
> > >> >> >>
> > >> >> >> > I just realized that there was a MapReduce job running during
> > the
> > >> >> time
> > >> >> >> the
> > >> >> >> > regionserver disconnected from the zookeeper.
> > >> >> >> > That MapReduce Job was processing 500GB data and took about 8
> > >> minutes
> > >> >> to
> > >> >> >> > finish. It launched over 2000 map tasks.
> > >> >> >>
> > >> >> >>
> > >> >> >> There was a tasktracker running on the RegionServer that
> > >> disconnected?
> > >> >>  Is
> > >> >> >> the map i/o or cpu heavy?  Do you think it could have stole life
> > >> from
> > >> >> the
> > >> >> >> datanode/regionserver?
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> > I doubt that this introduced
> > >> >> >> > resource contention between DataNode and RegionServer Node.
> > >> >> >> > Also during the time that MapReduce job ran, I saw a few
> errors
> > >> >> >> indicating
> > >> >> >> > that
> > >> >> >> >
> > >> >> >> > java.io.IOException: Could not obtain block:
> > >> >> >> > blk_7661247556283580226_4450462
> file=/data/xxx.txt.200910141159
> > >> >> >> >        at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
> > >> >> >> >        at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
> > >> >> >> >        at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
> > >> >> >> >        at
> java.io.DataInputStream.read(DataInputStream.java:83)
> > >> >> >> >        at
> > >> >> >> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > >> >> >> >        at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> > >> >> >> >        at
> > >> >> >> >
> > >> >>
> > >>
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> > >> >> >> >        at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> > >> >> >> >        at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> > >> >> >> >        at
> > >> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > >> >> >> >        at
> > >> >> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > >> >> >> >        at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > >> >> >> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > >> >> >> >
> > >> >> >>
> > >> >> >> This is interesting.  Your map task is running into hdfs-127.
> >  Yeah,
> > >> >> you
> > >> >> >> should patch your cluster if you want to get rid of these
> > (hdfs-127
> > >> has
> > >> >> >> been
> > >> >> >> applied to 0.21 hadoop and will be in the next hadoop release on
> > >> 0.20.x
> > >> >> >> branch, hadoop 0.20.2).
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> >
> > >> >> >> > Possibly it is related to HDFS-127, but I don't see any
> > datanodes
> > >> >> went
> > >> >> >> > down. Does that affect Regionserver? Shall we apply the patch?
> > >> >> >> >
> > >> >> >> >
> > >> >> >> Read up on hdfs-127 for explaination of whats going on.
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> > Now I start to keep tracking of the virtual memory stats to
> see
> > if
> > >> >> the
> > >> >> >> same
> > >> >> >> > issue happens tomorrow around the same time period.
> > >> >> >> >
> > >> >> >>
> > >> >> >> Excellent.  Thanks for doing this.
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> > And I highly suspect that this particular MapReduce job hurts
> > >> HBase
> > >> >> >> > Regionserver.
> > >> >> >> >
> > >> >> >> > PS. I also use gcviewer to parse the GC-log, I only see around
> > 30
> > >> >> >> seconds
> > >> >> >> > pauses maximum.
> > >> >> >> >
> > >> >> >>
> > >> >> >>
> > >> >> >> Did you figure why your ten minute session is not applying?
> > >> >> >>
> > >> >> >> St.Ack
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> >
> > >> >> >> > Thanks a lot.
> > >> >> >> > zhenyu
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <
> > >> >> zhongresearch@gmail.com
> > >> >> >> > >wrote:
> > >> >> >> >
> > >> >> >> > > J-D,
> > >> >> >> > >
> > >> >> >> > > Thank you very much for your comments.
> > >> >> >> > > My company block the IRC port, so I have trouble to connect
> to
> > >> IRC
> > >> >> >> > channel.
> > >> >> >> > > I have been trying to ask the IT to open the IRC port for
> me,
> > it
> > >> >> might
> > >> >> >> > take
> > >> >> >> > > a while.
> > >> >> >> > >
> > >> >> >> > > Best,
> > >> >> >> > > zhenyu
> > >> >> >> > >
> > >> >> >> > >
> > >> >> >> > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <
> > >> >> >> jdcryans@apache.org
> > >> >> >> > >wrote:
> > >> >> >> > >
> > >> >> >> > >> The error you are getting is a disconnection from a
> zookeeper
> > >> >> server
> > >> >> >> > >> and is very generic.
> > >> >> >> > >>
> > >> >> >> > >> ZK-86 is still opened and the last comment refers to ZK-111
> > >> saying
> > >> >> >> > >> that the bug (in unit tests) was probably fixed in release
> > >> 3.0.0
> > >> >> last
> > >> >> >> > >> year.
> > >> >> >> > >>
> > >> >> >> > >> To figure the hang you have, you can try to jstack the
> > process
> > >> pid
> > >> >> >> and
> > >> >> >> > >> see exactly what's holding the RS from shutting down.
> > >> >> >> > >>
> > >> >> >> > >> Would it be possible for you to drop by the IRC channel?
> This
> > >> way
> > >> >> we
> > >> >> >> > >> can debug this a at much faster pace.
> > >> >> >> > >>
> > >> >> >> > >> Thx!
> > >> >> >> > >>
> > >> >> >> > >> J-D
> > >> >> >> > >>
> > >> >> >> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
> > >> >> >> zhongresearch@gmail.com>
> > >> >> >> > >> wrote:
> > >> >> >> > >> > I found this.
> > >> >> >> > >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86
> > >> >> >> > >> >
> > >> >> >> > >> > It looks like the same error I had. Is it a zookeeper
> bug?
> > >> When
> > >> >> >> will
> > >> >> >> > >> HBase
> > >> >> >> > >> > take the zookeeper version 3.3.0?
> > >> >> >> > >> >
> > >> >> >> > >> > thanks
> > >> >> >> > >> > zhenyu
> > >> >> >> > >> >
> > >> >> >> > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
> > >> >> >> > zhongresearch@gmail.com
> > >> >> >> > >> >wrote:
> > >> >> >> > >> >
> > >> >> >> > >> >> Now I really doubt about the zookeeper, from the log I
> saw
> > >> >> errors
> > >> >> >> > like
> > >> >> >> > >> >> IOException Read Error, while the zookeeper client
> > >> >> (regionserver )
> > >> >> >> > >> tried to
> > >> >> >> > >> >> read. But it got disconneted status from zookeeper.
> > >> >> >> > >> >>
> > >> >> >> > >> >> I don't see any load on any zookeeper quorum servers.
> > DiskIO
> > >> is
> > >> >> >> fine.
> > >> >> >> > >> >>
> > >> >> >> > >> >> Also when the regionserver decides to exit due to the
> > >> >> disconnect
> > >> >> >> > status
> > >> >> >> > >> >> from zookeeper, sometimes the regionserver hangs during
> > the
> > >> >> >> exiting.
> > >> >> >> > We
> > >> >> >> > >> can
> > >> >> >> > >> >> still see the HRegionServer process even we don't see it
> > in
> > >> the
> > >> >> >> > Master
> > >> >> >> > >> web
> > >> >> >> > >> >> interface.
> > >> >> >> > >> >>
> > >> >> >> > >> >> I also notice that there is an zookeeper.retries and
> > >> >> >> zookeeper.pauses
> > >> >> >> > >> >> settings in hbase-default.xml, would that matter ?
> > >> >> >> > >> >>
> > >> >> >> > >> >> thanks
> > >> >> >> > >> >>
> > >> >> >> > >> >> You may see the IOException error in the following log
> > when
> > >> the
> > >> >> >> > >> >> Regionserver lost connection to the zookeeper.
> > >> >> >> > >> >>
> > >> >> >> > >> >> 2009-11-14 15:58:31,993 INFO
> > >> >> >> > org.apache.hadoop.hbase.regionserver.HLog:
> > >> >> >> > >> >> Roll
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
> > >> >> >> > >> >> entries=1062964, calcsize=255014936, filesize=139566093.
> > New
> > >> >> hlog
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
> > >> >> >> > >> >> 2009-11-14 15:58:39,295 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 15:58:52,648 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 15:59:06,007 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 15:59:19,365 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 15:59:32,867 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms
> > >> >> >> > >> >> 2009-11-14 15:59:46,070 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 15:59:59,423 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 16:00:12,775 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > >> >> >> > >> >> 2009-11-14 16:00:28,141 DEBUG
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> Got
> > >> >> >> > ping
> > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms
> > >> >> >> > >> >> 2009-11-14 16:01:10,378 WARN
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > Exception
> > >> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
> > >> >> >> > >> sun.nio.ch.SelectionKeyImpl@39fbb2d6
> > >> >> >> > >> >>
> > >> >> >> > >> >> java.io.IOException: Read error rc = -1
> > >> >> >> > java.nio.DirectByteBuffer[pos=0
> > >> >> >> > >> >> lim=4 cap=4]
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > >> >> >> > >> >> 2009-11-14 16:01:10,478 INFO
> > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> > >> >> ZooKeeper
> > >> >> >> > >> event,
> > >> >> >> > >> >> state: Disconnected, type: None, path: null
> > >> >> >> > >> >> 2009-11-14 16:01:11,333 WARN
> > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> Attempt=1
> > >> >> >> > >> >>
> > >> >> >> > >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > >> >> >> > >> >>         at
> > >> >> >> > >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> >> >> > >> >> Method)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > >> >> >> > >> >>         at
> > >> >> >> > >>
> > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > >> >> >> > >> >>         at java.lang.Thread.run(Thread.java:619)
> > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> Attempting
> > >> >> >> > >> >> connection to server superpyxis0001/
> 192.168.100.115:2181
> > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> Priming
> > >> >> >> > >> >> connection to java.nio.channels.SocketChannel[connected
> > >> local=/
> > >> >> >> > >> >> 192.168.100.122:59575 remote=superpyxis0001/
> > >> >> 192.168.100.115:2181]
> > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> Server
> > >> >> >> > >> >> connection successful
> > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> > >> >> ZooKeeper
> > >> >> >> > >> event,
> > >> >> >> > >> >> state: Expired, type: None, path: null
> > >> >> >> > >> >> 2009-11-14 16:01:11,433 WARN
> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > Exception
> > >> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
> > >> >> >> > >> sun.nio.ch.SelectionKeyImpl@6b93d343
> > >> >> >> > >> >> java.io.IOException: Session Expired
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
> > >> >> >> > >> >>
> > >> >> >> > >> >>         at
> > >> >> >> > >> >>
> > >> >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > >> >> >> > >> >> 2009-11-14 16:01:11,433 ERROR
> > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> ZooKeeper
> > >> >> >> session
> > >> >> >> > >> >> expired
> > >> >> >> > >> >> 2009-11-14 16:01:13,513 INFO
> > >> org.apache.hadoop.ipc.HBaseServer:
> > >> >> >> IPC
> > >> >> >> > >> Server
> > >> >> >> > >> >> handler 3 on 60021, call put([B@73deb204,
> > >> >> >> > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from
> > >> >> >> > >> 192.168.100.132:40728:
> > >> >> >> > >> >> error: java.io.IOException: Server not running, aborting
> > >> >> >> > >> >> java.io.IOException: Server not running, aborting
> > >> >> >> > >> >>
> > >> >> >> > >> >>
> > >> >> >> > >> >>
> > >> >> >> > >> >>
> > >> >> >> > >> >>
> > >> >> >> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
> > >> >> >> > zhongresearch@gmail.com
> > >> >> >> > >> >wrote:
> > >> >> >> > >> >>
> > >> >> >> > >> >>> Stack,
> > >> >> >> > >> >>>
> > >> >> >> > >> >>> Thanks a lot!
> > >> >> >> > >> >>>
> > >> >> >> > >> >>> I found out that the reason why HBase doesn't take the
> > >> system
> > >> >> >> file
> > >> >> >> > >> >>> descriptor value I set before. I started the HBase
> using
> > >> root
> > >> >> >> > instead
> > >> >> >> > >> of the
> > >> >> >> > >> >>> normal hadoop user, while my system configures hadoop
> > with
> > >> >> higher
> > >> >> >> > file
> > >> >> >> > >> >>> descriptor value but configures root with 1024 default
> > >> value.
> > >> >> >> > >> >>>
> > >> >> >> > >> >>>
> > >> >> >> > >> >>> Now my system has a clean start. However, it seems that
> > the
> > >> >> >> > >> >>> zookeeper.session.timeout value doesn't take into
> effect.
> > I
> > >> >> still
> > >> >> >> > >> found
> > >> >> >> > >> >>> around 60 seconds pauses from the disconnected
> > >> regionserver. I
> > >> >> >> > really
> > >> >> >> > >> don't
> > >> >> >> > >> >>> know why regionserver only times out after 60 seconds
> > >> instead
> > >> >> of
> > >> >> >> 10
> > >> >> >> > >> minutes
> > >> >> >> > >> >>> which I set for zookeeper.session.timeout.
> > >> >> >> > >> >>>
> > >> >> >> > >> >>> Is there any other timeout value coming into play
> before
> > >> the
> > >> >> >> actual
> > >> >> >> > >> >>> session times out?
> > >> >> >> > >> >>>
> > >> >> >> > >> >>> zhenyu
> > >> >> >> > >> >>>
> > >> >> >> > >> >>>
> > >> >> >> > >> >>>
> > >> >> >> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <
> stack@duboce.net
> > >
> > >> >> wrote:
> > >> >> >> > >> >>>
> > >> >> >> > >> >>>> Ok.  Lack of file descriptors manifests in all kinds
> of
> > >> weird
> > >> >> >> ways.
> > >> >> >> > >> >>>> Hopefully thats it.  If not, lets keep digging.
> > >> >> >> > >> >>>> St.Ack
> > >> >> >> > >> >>>>
> > >> >> >> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <
> > >> >> >> > >> zhongresearch@gmail.com
> > >> >> >> > >> >>>> >wrote:
> > >> >> >> > >> >>>>
> > >> >> >> > >> >>>> > Stack,
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>> > You are right, the master started with ulimit -n
> 1024.
> > >> It
> > >> >> >> doesn't
> > >> >> >> > >> take
> > >> >> >> > >> >>>> the
> > >> >> >> > >> >>>> > system value.
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>> > Regarding to the too many open files, it looks like
> > the
> > >> >> same
> > >> >> >> as
> > >> >> >> > the
> > >> >> >> > >> one
> > >> >> >> > >> >>>> J-D
> > >> >> >> > >> >>>> > put up. But I will get the Master start with higher
> > >> value
> > >> >> >> first
> > >> >> >> > and
> > >> >> >> > >> see
> > >> >> >> > >> >>>> if
> > >> >> >> > >> >>>> > this kind of error goes away.
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>> > thanks a lot!
> > >> >> >> > >> >>>> > zhenyu
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <
> > >> stack@duboce.net>
> > >> >> >> wrote:
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>> > > Does it say
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > > ulimit -n 32768
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > > ...as the second line in your log file on start of
> > the
> > >> >> >> master?
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > > Is the stack trace that complains about too many
> > open
> > >> >> files
> > >> >> >> > same
> > >> >> >> > >> as
> > >> >> >> > >> >>>> the
> > >> >> >> > >> >>>> > one
> > >> >> >> > >> >>>> > > in the blog post J-D put up?
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > > St.Ack
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <
> > >> >> >> > >> >>>> zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > >wrote:
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> > > > The ulimit file descriptors was set to
> fs.file-max
> > >> >> >> =1578334,
> > >> >> >> > >> also
> > >> >> >> > >> >>>> in
> > >> >> >> > >> >>>> > the
> > >> >> >> > >> >>>> > > > limits.conf the value was set to 32768.
> > >> >> >> > >> >>>> > > > So these are way higher than the open
> descriptors
> > >> for
> > >> >> the
> > >> >> >> > >> running
> > >> >> >> > >> >>>> > > > processes.
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > > > thanks
> > >> >> >> > >> >>>> > > > zhenyu
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <
> > >> >> >> saint.ack@gmail.com>
> > >> >> >> > >> >>>> wrote:
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > > > > You upped the ulimit file descriptors as per
> the
> > >> >> getting
> > >> >> >> > >> started
> > >> >> >> > >> >>>> doc?
> > >> >> >> > >> >>>> > > > >
> > >> >> >> > >> >>>> > > > >
> > >> >> >> > >> >>>> > > > >
> > >> >> >> > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <
> > >> >> >> > >> >>>> zhongresearch@gmail.com>
> > >> >> >> > >> >>>> > > > wrote:
> > >> >> >> > >> >>>> > > > >
> > >> >> >> > >> >>>> > > > >  Thanks a lot.
> > >> >> >> > >> >>>> > > > >>
> > >> >> >> > >> >>>> > > > >>
> > >> >> >> > >> >>>> > > > >> Bad news is my kernel is still 2.6.26.
> > >> >> >> > >> >>>> > > > >> But it was not a problem before.
> > >> >> >> > >> >>>> > > > >>
> > >> >> >> > >> >>>> > > > >> Very strange.
> > >> >> >> > >> >>>> > > > >>
> > >> >> >> > >> >>>> > > > >> zhenyu
> > >> >> >> > >> >>>> > > > >>
> > >> >> >> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel
> > >> Cryans
> > >> >> <
> > >> >> >> > >> >>>> > > > jdcryans@apache.org
> > >> >> >> > >> >>>> > > > >> >wrote:
> > >> >> >> > >> >>>> > > > >>
> > >> >> >> > >> >>>> > > > >>  Looks like
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>> J-D
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu
> Zhong
> > <
> > >> >> >> > >> >>>> > > zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > > >
> > >> >> >> > >> >>>> > > > >>> wrote:
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> Hi,
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> After I re-organize the cluster, the
> > experiment
> > >> >> ran
> > >> >> >> into
> > >> >> >> > >> >>>> problem
> > >> >> >> > >> >>>> > > > faster
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> than
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> before.
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> Basically the changes are to use machines
> > with
> > >> >> less
> > >> >> >> > >> resources
> > >> >> >> > >> >>>> as
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> zookeeper
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> quorums and machines with more resources as
> > >> >> >> > regionserver.
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> From the log, I still see the pause around
> 1
> > >> >> minute.
> > >> >> >> > >> >>>> > > > >>>> I enabled the GC logging,  please see
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> for details.
> > >> >> >> > >> >>>> > > > >>>> In general I don't see many pauses in the
> GC.
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> What is more interesting, 1 hour after the
> > 1st
> > >> >> >> > >> regionserver
> > >> >> >> > >> >>>> > > > >>>> disconnected,
> > >> >> >> > >> >>>> > > > >>>> the master log complained about too many
> open
> > >> >> files.
> > >> >> >> > This
> > >> >> >> > >> >>>> didn't
> > >> >> >> > >> >>>> > > > happen
> > >> >> >> > >> >>>> > > > >>>> before.
> > >> >> >> > >> >>>> > > > >>>> I checked the system OS setup as well as
> the
> > >> >> >> > limits.conf.
> > >> >> >> > >> I
> > >> >> >> > >> >>>> also
> > >> >> >> > >> >>>> > > > checked
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> the
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> running processes. The total open files
> don't
> > >> >> reach
> > >> >> >> the
> > >> >> >> > >> limit.
> > >> >> >> > >> >>>> I
> > >> >> >> > >> >>>> > am
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> confused
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> a bit.
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> Please see the following master log.
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > >> >> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> > >> >> >> > >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658
> row(s)
> > >> of
> > >> >> meta
> > >> >> >> > >> region
> > >> >> >> > >> >>>> > > {server:
> > >> >> >> > >> >>>> > > > >>>> 192.168.100.128:60021, regionname:
> > .META.,,1,
> > >> >> >> startKey:
> > >> >> >> > >> <>}
> > >> >> >> > >> >>>> > > complete
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > >> >> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> All
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> 1 .META. region(s) scanned
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
> > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> Got
> > >> >> >> > >> >>>> > > > ping
> > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c
> > after
> > >> 3ms
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> java.io.IOException:
> > >> Bad
> > >> >> >> connect
> > >> >> >> > >> ack
> > >> >> >> > >> >>>> with
> > >> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_-2808245019291145247_5478039
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> java.io.EOFException
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_1074853606841896259_5478048
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
> > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> Got
> > >> >> >> > >> >>>> > > > ping
> > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c
> after
> > >> 1ms
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
> > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> Got
> > >> >> >> > >> >>>> > > > ping
> > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c
> > after
> > >> 0ms
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
> > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> Got
> > >> >> >> > >> >>>> > > > ping
> > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c
> after
> > >> 0ms
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> java.io.IOException:
> > >> Bad
> > >> >> >> connect
> > >> >> >> > >> ack
> > >> >> >> > >> >>>> with
> > >> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_-6642544517082142289_5478063
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > >> java.net.SocketException:
> > >> >> Too
> > >> >> >> > many
> > >> >> >> > >> >>>> open
> > >> >> >> > >> >>>> > > files
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_2057511041109796090_5478063
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > >> java.net.SocketException:
> > >> >> Too
> > >> >> >> > many
> > >> >> >> > >> >>>> open
> > >> >> >> > >> >>>> > > files
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_8219260302213892894_5478064
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > >> java.net.SocketException:
> > >> >> Too
> > >> >> >> > many
> > >> >> >> > >> >>>> open
> > >> >> >> > >> >>>> > > files
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_1669205542853067709_5478235
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > >> java.net.SocketException:
> > >> >> Too
> > >> >> >> > many
> > >> >> >> > >> >>>> open
> > >> >> >> > >> >>>> > > files
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_9128897691346270351_5478237
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > >> java.net.SocketException:
> > >> >> Too
> > >> >> >> > many
> > >> >> >> > >> >>>> open
> > >> >> >> > >> >>>> > > files
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_1116845144864123018_5478240
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
> > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> Got
> > >> >> >> > >> >>>> > > > ping
> > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c
> > after
> > >> 0ms
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
> > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> Got
> > >> >> >> > >> >>>> > > > ping
> > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c
> after
> > >> 0ms
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> DataStreamer
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> Exception: java.io.IOException: Too many
> open
> > >> >> files
> > >> >> >> > >> >>>> > > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native
> > >> Method)
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > >
> > >> >> >> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>>
> > >> >> >>
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>>
> java.io.DataOutputStream.write(DataOutputStream.java:90)
> > >> >> >> > >> >>>> > > > >>>>      at
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > Error
> > >> >> >> > >> >>>> > > > >>>> Recovery for block
> > >> blk_8148813491785406356_5478475
> > >> >> >> bad
> > >> >> >> > >> >>>> datanode[0]
> > >> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > Error
> > >> >> >> > >> >>>> > > > >>>> Recovery for block
> > >> blk_8148813491785406356_5478475
> > >> >> in
> > >> >> >> > >> pipeline
> > >> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010,
> 192.168.100.134:50010
> > ,
> > >> >> >> > >> >>>> > 192.168.100.122:50010
> > >> >> >> > >> >>>> > > :
> > >> >> >> > >> >>>> > > > >>>> bad
> > >> >> >> > >> >>>> > > > >>>> datanode 192.168.100.123:50010
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > Exception
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>> in
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > >> java.net.SocketException:
> > >> >> Too
> > >> >> >> > many
> > >> >> >> > >> >>>> open
> > >> >> >> > >> >>>> > > files
> > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > >> >> >> > >> >>>> > > > >>>> Abandoning
> > >> >> >> > >> >>>> > > > >>>> block blk_2119727700857186236_5478498
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu
> Zhong
> > <
> > >> >> >> > >> >>>> > > > zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > > >>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>  Will do.
> > >> >> >> > >> >>>> > > > >>>>>
> > >> >> >> > >> >>>> > > > >>>>> thanks
> > >> >> >> > >> >>>> > > > >>>>> zhenyu
> > >> >> >> > >> >>>> > > > >>>>>
> > >> >> >> > >> >>>> > > > >>>>>
> > >> >> >> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <
> > >> >> >> > stack@duboce.net
> > >> >> >> > >> >
> > >> >> >> > >> >>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>>
> > >> >> >> > >> >>>> > > > >>>>>  Enable GC logging too on this next run
> (see
> > >> >> >> > >> hbase-env.sh).
> > >> >> >> > >> >>>>  Lets
> > >> >> >> > >> >>>> > > try
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>> and
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> get
> > >> >> >> > >> >>>> > > > >>>>>> to the bottom of whats going on.
> > >> >> >> > >> >>>> > > > >>>>>> Thanks,
> > >> >> >> > >> >>>> > > > >>>>>> St.Ack
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu
> > >> Zhong <
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>> zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>  I can switch the boxes that run
> zookeeper
> > >> with
> > >> >> the
> > >> >> >> > ones
> > >> >> >> > >> >>>> that
> > >> >> >> > >> >>>> > run
> > >> >> >> > >> >>>> > > > >>>>>>> regionservers.
> > >> >> >> > >> >>>> > > > >>>>>>> I will see the results later.
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes
> > >> >> >> > >> >>>> > zookeeper.session.timeout
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>> value
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> in
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> place.
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> thanks
> > >> >> >> > >> >>>> > > > >>>>>>> zhenyu
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <
> > >> >> >> > >> stack@duboce.net>
> > >> >> >> > >> >>>> > wrote:
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM,
> Zhenyu
> > >> Zhong
> > >> >> <
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster
> have
> > 2
> > >> >> disks
> > >> >> >> > 400GB
> > >> >> >> > >> >>>> each
> > >> >> >> > >> >>>> > per
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> machine,
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> and
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per
> > >> machine.
> > >> >> >> >  Maybe
> > >> >> >> > >> we
> > >> >> >> > >> >>>> > should
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> run
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> zookeeper
> > >> >> >> > >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS on
> > >> >> machines
> > >> >> >> > with
> > >> >> >> > >> 6
> > >> >> >> > >> >>>> disks?
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> That would make most sense only in the
> > >> below,
> > >> >> it
> > >> >> >> > >> looks
> > >> >> >> > >> >>>> like
> > >> >> >> > >> >>>> > the
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> RS
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> that
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> had
> > >> >> >> > >> >>>> > > > >>>>>>>> issue had 4 disks?
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>  BTW, the 10 minutes
> > >> zookeeper.session.timeout
> > >> >> >> has
> > >> >> >> > >> been
> > >> >> >> > >> >>>> set
> > >> >> >> > >> >>>> > > during
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> the
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> experiment.
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> And for sure this node had it in place?
> > >> >> >> > >> >>>> > > > >>>>>>>> St.Ack
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> thanks
> > >> >> >> > >> >>>> > > > >>>>>>>>> zhenyu
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack
> <
> > >> >> >> > >> stack@duboce.net
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>> > > wrote:
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM,
> Zhenyu
> > >> >> Zhong <
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  Though I experienced 2 regionserver
> > >> >> >> disconnection
> > >> >> >> > >> this
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> morning,
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> it
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> looks
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> better from the regionserver log.
> > (please
> > >> >> see
> > >> >> >> the
> > >> >> >> > >> >>>> following
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> log)
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the
> > >> >> >> > >> >>>> > regionserver(192.168.100.116)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> during
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> the
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> time it disconnected.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Device:            tps   Blk_read/s
> > >> >> >> Blk_wrtn/s
> > >> >> >> > >> >>>> Blk_read
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> Blk_wrtn
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> sda             105.31      5458.83
> > >> >> 19503.64
> > >> >> >> > >> >>>> 9043873239
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> 32312473676
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> sda1              2.90         6.64
> > >> >>  99.25
> > >> >> >> > >> >>>> 10993934
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> 164433464
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> sda2              1.72        23.76
> > >> >>  30.25
> > >> >> >> > >> >>>> 39365817
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> 50124008
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> sda3              0.30         0.38
> > >> >> 3.58
> > >> >> >> > >> >>>> 624930
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> 5923000
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> sda4            100.39      5428.06
> > >> >> 19370.56
> > >> >> >> > >> >>>> 8992888270
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> 32091993204
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k
> > >> blocks/second
> > >> >> >> would
> > >> >> >> > >> seem to
> > >> >> >> > >> >>>> be
> > >> >> >> > >> >>>> > > high
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> but
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> its
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> one
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> disk only and its not being shared by
> > zk
> > >> >> >> anymore
> > >> >> >> > so
> > >> >> >> > >> >>>> > shouldn't
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> matter?
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper quorum
> > >> server
> > >> >> >> that
> > >> >> >> > the
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> tried
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> to
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> connect according to the log.
> However,
> > I
> > >> >> don't
> > >> >> >> see
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> 192.168.100.116
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> in
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> the
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum
> > >> server
> > >> >> that
> > >> >> >> > the
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> tried
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> to
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> connect.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Would that be a problem?
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Is that because the ephemeral node
> > for
> > >> the
> > >> >> >> > >> >>>> regionserver
> > >> >> >> > >> >>>> > had
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> evaporated?
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> Lost
> > >> >> >> > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you went
> to
> > >> >> look?
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558,
> built
> > >> on
> > >> >> >> > >> 08/27/2009
> > >> >> >> > >> >>>> 18:48
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> GMT
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> Clients:
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
> > >> >> >> > >> [1](queued=0,recved=26,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
> > >> >> >> > >> [1](queued=0,recved=964,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
> > >> >> >> > >> [1](queued=0,recved=3266,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
> > >> >> >> > >> [1](queued=0,recved=233,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
> > >> >> >> > >> [1](queued=0,recved=2,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
> > >> >> >> > >> [1](queued=0,recved=295,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
> > >> >> >> > >> [1](queued=0,recved=2290,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
> > >> >> >> > >> [1](queued=0,recved=0,sent=0)
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Received: 47111
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Sent: 0
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Outstanding: 0
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Mode: leader
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Node count: 23
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is
> pretty
> > >> bad
> > >> >> but
> > >> >> >> > >> should
> > >> >> >> > >> >>>> be
> > >> >> >> > >> >>>> > well
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> within
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> the
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> zk
> > >> >> >> > >> >>>> > > > >>>>>>>>>> session timeout.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk
> client
> > >> side
> > >> >> of
> > >> >> >> the
> > >> >> >> > >> >>>> session;
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> i.e.
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> in
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> the
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> RS.
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D
> > >> suggested
> > >> >> to
> > >> >> >> > see
> > >> >> >> > >> if
> > >> >> >> > >> >>>> you
> > >> >> >> > >> >>>> > > have
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> any
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> big
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't look
> > too
> > >> >> >> heavily
> > >> >> >> > >> >>>> loaded:
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> >  org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> >> >> > >> >>>> Dump of
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> metrics:
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>  request=1.5166667, regions=322,
> > >> stores=657,
> > >> >> >> > >> >>>> storefiles=631,
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61,
> > >> memstoreSize=1472,
> > >> >> >> > >> >>>> usedHeap=2819,
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> maxHeap=4079,
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
> > >> >> >> > blockCacheFree=197395984,
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The number
> of
> > >> >> store
> > >> >> >> > files
> > >> >> >> > >> >>>> seems
> > >> >> >> > >> >>>> > > fine.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> Not
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> too
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> much memory used.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease
> > >> Still
> > >> >> Held
> > >> >> >> > >> >>>> exception.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> This
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> happens
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> when the RS does its regular report to
> > the
> > >> >> >> master
> > >> >> >> > but
> > >> >> >> > >> the
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> master
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> thinks
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> the
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> RS has since restarted.  It'll think
> > this
> > >> >> >> probably
> > >> >> >> > >> >>>> because
> > >> >> >> > >> >>>> > it
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> noticed
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> that
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away
> and
> > it
> > >> >> >> > considered
> > >> >> >> > >> the
> > >> >> >> > >> >>>> RS
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> dead.
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> Looking too at your  logs I see this
> > gap
> > >> in
> > >> >> the
> > >> >> >> zk
> > >> >> >> > >> >>>> pinging:
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> Got
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>  ping response for
> > >> sessionid:0x224e55436ad0004
> > >> >> >> after
> > >> >> >> > >> 0ms
> > >> >> >> > >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> Got
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>  ping response for
> > >> sessionid:0x24e55436a0007d
> > >> >> >> after
> > >> >> >> > >> 0ms
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> Where in the lines above it, its
> > >> reporting
> > >> >> >> about
> > >> >> >> > >> every
> > >> >> >> > >> >>>> ten
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> seconds,
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> here
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> there is a big gap.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or something that
> > >> will
> > >> >> let
> > >> >> >> you
> > >> >> >> > >> look
> > >> >> >> > >> >>>> more
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> into
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> what
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> was
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> happening on this machine around this
> > >> time?
> > >> >>  Is
> > >> >> >> > the
> > >> >> >> > >> >>>> machine
> > >> >> >> > >> >>>> > > OK?
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> It
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> looks
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> lightly loaded and you have your
> cluster
> > >> >> nicely
> > >> >> >> > laid
> > >> >> >> > >> out.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> Something
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> odd
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> is
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> going on.  What about things like the
> > >> write
> > >> >> >> speed
> > >> >> >> > to
> > >> >> >> > >> >>>> disk?
> > >> >> >> > >> >>>> >  In
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> the
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> past
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> strange issues have been explained by
> > >> >> incorrectly
> > >> >> >> > set
> > >> >> >> > >> BIOS
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> which
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> made
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> disks
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper speed.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> St.Ack
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> Best,
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> zhenyu
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM,
> > Zhenyu
> > >> >> Zhong
> > >> >> >> <
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> zhongresearch@gmail.com
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Stack
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your
> > >> comments.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper monitoring
> > >> script
> > >> >> on
> > >> >> >> my
> > >> >> >> > >> >>>> cluster
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> and
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>> let
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>> it
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>> run
> > >> >> >> > >> >>>> > > > >>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>> overnight to see the result.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when I
> > get
> > >> >> >> anything.
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> thanks
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> zhenyu
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM,
> > stack
> > >> <
> > >> >> >> > >> >>>> stack@duboce.net>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>> wrote:
> > >> >> >> > >> >>>> > > > >>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>  I see these in your log too:
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > >> >> sessionid:0x424dfd908c50009
> > >> >> >> > >> after
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 4544ms
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > >> >> sessionid:0x24dfd90c810002
> > >> >> >> > after
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 4548ms
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > >> >> sessionid:0x424dfd908c50009
> > >> >> >> > >> after
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 9030ms
> > >> >> >> > >> >>>> > > > >>>
> > >> >> >> > >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
> > >> >> >> > >> >>>> > > > >>>>>>>>
> > >> >> >> > >> >>>> > > > >>>>>>>
> > >> >> >> > >> >>>> > > >
> > >> >> >> > >> >>>> > >
> > >> >> >> > >> >>>> >
> > >> >> >> > >> >>>>
> > >> >> >> > >> >>>
> > >> >> >> > >> >>>
> > >> >> >> > >> >>
> > >> >> >> > >> >
> > >> >> >> > >>
> > >> >> >> > >
> > >> >> >> > >
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message