hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: regionserver disconnection
Date Wed, 18 Nov 2009 19:18:07 GMT
And running stat on each of your zookeeper nodes tells you what about
averages for connect-times?

Its odd that there is a ping just 20 seconds beforehand and you have set the
zk session timeout at 60 seconds (or ten minutes?).  Let me ask the zk lads
about it.

If you look in the GC log for around this time, do you see anything?

St.Ack



On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <zhongresearch@gmail.com>wrote:

> After running more experiment, it seems that the disconnection is not
> related to the heavy load job, because the load average is low, the diskio
> is normal, memory Heap was not reached, also the virtual memory stats shows
> that no swappings.  However, the disconnection still happens.
>
> It looks like this time it pauses for 20 seconds. No idea why regionserver
> disconnected.
>
> Any other suggestions please?
>
>
> 2009-11-18 12:41:00,956 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x125035c495d02bb after 0ms
> 2009-11-18 12:41:08,074 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x425035c48210002 after 0ms
> 2009-11-18 12:41:14,302 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x125035c495d02bb after 0ms
> 2009-11-18 12:41:21,433 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x425035c48210002 after 0ms
> 2009-11-18 12:41:27,656 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x125035c495d02bb after 0ms
> 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x425035c48210002 after 0ms
> 2009-11-18 12:41:41,016 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x125035c495d02bb after 1ms
> 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x425035c48210002 to sun.nio.ch.SelectionKeyImpl@421690ab
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2009-11-18 12:42:01,582 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server superpyxis0003/192.168.100.117:2181
> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected local=/
> 192.168.100.132:46610 remote=superpyxis0003/192.168.100.117:2181]
> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn: Server
> connection successful
> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x425035c48210002 to sun.nio.ch.SelectionKeyImpl@56dc6fac
> java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
> lim=4 cap=4]
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn: Ignoring
> exception during shutdown input
> java.net.SocketException: Transport endpoint is not connected
>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>        at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn: Ignoring
> exception during shutdown output
> java.net.SocketException: Transport endpoint is not connected
>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
>        at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server superpyxis0001/192.168.100.115:2181
> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected local=/
> 192.168.100.132:36197 remote=superpyxis0001/192.168.100.115:2181]
> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Server
> connection successful
> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x425035c48210002 to sun.nio.ch.SelectionKeyImpl@5c07076b
> java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
> lim=4 cap=4]
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Ignoring
> exception during shutdown input
> java.net.SocketException: Transport endpoint is not connected
>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>        at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Ignoring
> exception during shutdown output
> java.net.SocketException: Transport endpoint is not connected
>
>
> On Mon, Nov 16, 2009 at 4:16 PM, Zhenyu Zhong <zhongresearch@gmail.com
> >wrote:
>
> > Here is the diskIO and CPU around the time we had RS disconnection on one
> > machine that runs RegionServer. It doesn't seem to be high. Similar disk
> and
> > cpu usage have been seen before and the HBase was running fine.
> >
> >
> > So far I haven't found why my 10 minutes session timeout doesn't apply.
> > Still digging.
> >
> > Many thanks!
> >
> > Time: 03:01:08 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >            3.82    0.00    0.89    3.97    0.00   91.32
> >
> >
> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> > sda              24.30      1605.39      6679.35 6238623388 25956251720
> > sda1              0.51         1.94        15.71    7547265   61067904
> > sda2              0.13         2.11         5.20    8202079   20189232
> > sda3              0.13         0.25         1.63     987846    6323336
> > sda4              0.00         0.00         0.00          4          0
> > sda5              1.78         0.06        24.14     220010   93817208
> > sda6              0.38         0.42         8.41    1630584   32688152
> > sda7             21.38      1600.61      6624.26 6220035272 25742165888
> > sdb               4.52       767.33       565.91 2981868690 2199132380
> > sdb1              4.52       767.33       565.91 2981866755 2199132372
> > sdc               4.42       742.95       563.00 2887151482 2187823092
> > sdc1              4.42       742.95       563.00 2887149547 2187823084
> > sdd               4.49       750.78       557.25 2917554074 2165513500
> > sdd1              4.49       750.78       557.25 2917552139 2165513492
> > sde               4.52       758.51       569.46 2947593394 2212964236
> > sde1              4.52       758.51       569.46 2947591459 2212964228
> > sdf               4.51       747.22       571.78 2903740266 2221972652
> > sdf1              4.51       747.22       571.78 2903738331 2221972644
> >
> > Best,
> > zhenyu
> >
> >
> >
> > On Mon, Nov 16, 2009 at 4:07 PM, stack <stack@duboce.net> wrote:
> >
> >> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <zhongresearch@gmail.com
> >> >wrote:
> >>
> >> > I just realized that there was a MapReduce job running during the time
> >> the
> >> > regionserver disconnected from the zookeeper.
> >> > That MapReduce Job was processing 500GB data and took about 8 minutes
> to
> >> > finish. It launched over 2000 map tasks.
> >>
> >>
> >> There was a tasktracker running on the RegionServer that disconnected?
>  Is
> >> the map i/o or cpu heavy?  Do you think it could have stole life from
> the
> >> datanode/regionserver?
> >>
> >>
> >>
> >> > I doubt that this introduced
> >> > resource contention between DataNode and RegionServer Node.
> >> > Also during the time that MapReduce job ran, I saw a few errors
> >> indicating
> >> > that
> >> >
> >> > java.io.IOException: Could not obtain block:
> >> > blk_7661247556283580226_4450462 file=/data/xxx.txt.200910141159
> >> >        at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
> >> >        at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
> >> >        at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
> >> >        at java.io.DataInputStream.read(DataInputStream.java:83)
> >> >        at
> >> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> >> >        at
> >> >
> >>
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> >> >        at
> >> >
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> >> >        at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> >> >        at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> >> >        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >> >        at
> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >> >
> >>
> >> This is interesting.  Your map task is running into hdfs-127.  Yeah, you
> >> should patch your cluster if you want to get rid of these (hdfs-127 has
> >> been
> >> applied to 0.21 hadoop and will be in the next hadoop release on 0.20.x
> >> branch, hadoop 0.20.2).
> >>
> >>
> >>
> >> >
> >> > Possibly it is related to HDFS-127, but I don't see any datanodes went
> >> > down. Does that affect Regionserver? Shall we apply the patch?
> >> >
> >> >
> >> Read up on hdfs-127 for explaination of whats going on.
> >>
> >>
> >>
> >> > Now I start to keep tracking of the virtual memory stats to see if the
> >> same
> >> > issue happens tomorrow around the same time period.
> >> >
> >>
> >> Excellent.  Thanks for doing this.
> >>
> >>
> >>
> >> > And I highly suspect that this particular MapReduce job hurts HBase
> >> > Regionserver.
> >> >
> >> > PS. I also use gcviewer to parse the GC-log, I only see around 30
> >> seconds
> >> > pauses maximum.
> >> >
> >>
> >>
> >> Did you figure why your ten minute session is not applying?
> >>
> >> St.Ack
> >>
> >>
> >>
> >> >
> >> > Thanks a lot.
> >> > zhenyu
> >> >
> >> >
> >> >
> >> > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <
> zhongresearch@gmail.com
> >> > >wrote:
> >> >
> >> > > J-D,
> >> > >
> >> > > Thank you very much for your comments.
> >> > > My company block the IRC port, so I have trouble to connect to IRC
> >> > channel.
> >> > > I have been trying to ask the IT to open the IRC port for me, it
> might
> >> > take
> >> > > a while.
> >> > >
> >> > > Best,
> >> > > zhenyu
> >> > >
> >> > >
> >> > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <
> >> jdcryans@apache.org
> >> > >wrote:
> >> > >
> >> > >> The error you are getting is a disconnection from a zookeeper
> server
> >> > >> and is very generic.
> >> > >>
> >> > >> ZK-86 is still opened and the last comment refers to ZK-111 saying
> >> > >> that the bug (in unit tests) was probably fixed in release 3.0.0
> last
> >> > >> year.
> >> > >>
> >> > >> To figure the hang you have, you can try to jstack the process pid
> >> and
> >> > >> see exactly what's holding the RS from shutting down.
> >> > >>
> >> > >> Would it be possible for you to drop by the IRC channel? This way
> we
> >> > >> can debug this a at much faster pace.
> >> > >>
> >> > >> Thx!
> >> > >>
> >> > >> J-D
> >> > >>
> >> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
> >> zhongresearch@gmail.com>
> >> > >> wrote:
> >> > >> > I found this.
> >> > >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86
> >> > >> >
> >> > >> > It looks like the same error I had. Is it a zookeeper bug? When
> >> will
> >> > >> HBase
> >> > >> > take the zookeeper version 3.3.0?
> >> > >> >
> >> > >> > thanks
> >> > >> > zhenyu
> >> > >> >
> >> > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
> >> > zhongresearch@gmail.com
> >> > >> >wrote:
> >> > >> >
> >> > >> >> Now I really doubt about the zookeeper, from the log I saw
> errors
> >> > like
> >> > >> >> IOException Read Error, while the zookeeper client (regionserver
> )
> >> > >> tried to
> >> > >> >> read. But it got disconneted status from zookeeper.
> >> > >> >>
> >> > >> >> I don't see any load on any zookeeper quorum servers. DiskIO is
> >> fine.
> >> > >> >>
> >> > >> >> Also when the regionserver decides to exit due to the disconnect
> >> > status
> >> > >> >> from zookeeper, sometimes the regionserver hangs during the
> >> exiting.
> >> > We
> >> > >> can
> >> > >> >> still see the HRegionServer process even we don't see it in the
> >> > Master
> >> > >> web
> >> > >> >> interface.
> >> > >> >>
> >> > >> >> I also notice that there is an zookeeper.retries and
> >> zookeeper.pauses
> >> > >> >> settings in hbase-default.xml, would that matter ?
> >> > >> >>
> >> > >> >> thanks
> >> > >> >>
> >> > >> >> You may see the IOException error in the following log when the
> >> > >> >> Regionserver lost connection to the zookeeper.
> >> > >> >>
> >> > >> >> 2009-11-14 15:58:31,993 INFO
> >> > org.apache.hadoop.hbase.regionserver.HLog:
> >> > >> >> Roll
> >> > >> >>
> >> > >>
> >> >
> >>
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
> >> > >> >> entries=1062964, calcsize=255014936, filesize=139566093. New
> hlog
> >> > >> >>
> >> > >>
> >> >
> >>
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
> >> > >> >> 2009-11-14 15:58:39,295 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 15:58:52,648 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 15:59:06,007 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 15:59:19,365 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 15:59:32,867 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms
> >> > >> >> 2009-11-14 15:59:46,070 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 15:59:59,423 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 16:00:12,775 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> >> > >> >> 2009-11-14 16:00:28,141 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> >> > ping
> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms
> >> > >> >> 2009-11-14 16:01:10,378 WARN org.apache.zookeeper.ClientCnxn:
> >> > Exception
> >> > >> >> closing session 0x424f0ed01ea00a6 to
> >> > >> sun.nio.ch.SelectionKeyImpl@39fbb2d6
> >> > >> >>
> >> > >> >> java.io.IOException: Read error rc = -1
> >> > java.nio.DirectByteBuffer[pos=0
> >> > >> >> lim=4 cap=4]
> >> > >> >>         at
> >> > >> >>
> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> >> > >> >>         at
> >> > >> >>
> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> >> > >> >> 2009-11-14 16:01:10,478 INFO
> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> ZooKeeper
> >> > >> event,
> >> > >> >> state: Disconnected, type: None, path: null
> >> > >> >> 2009-11-14 16:01:11,333 WARN
> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> >> > >> >>
> >> > >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> >> > >> >>         at
> >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > >> >> Method)
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> >> > >> >>         at
> >> > >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> >> > >> >>         at java.lang.Thread.run(Thread.java:619)
> >> > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> >> > >> Attempting
> >> > >> >> connection to server superpyxis0001/192.168.100.115:2181
> >> > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> >> Priming
> >> > >> >> connection to java.nio.channels.SocketChannel[connected local=/
> >> > >> >> 192.168.100.122:59575 remote=superpyxis0001/
> 192.168.100.115:2181]
> >> > >> >> 2009-11-14 16:01:11,433 INFO org.apache.zookeeper.ClientCnxn:
> >> Server
> >> > >> >> connection successful
> >> > >> >> 2009-11-14 16:01:11,433 INFO
> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> ZooKeeper
> >> > >> event,
> >> > >> >> state: Expired, type: None, path: null
> >> > >> >> 2009-11-14 16:01:11,433 WARN org.apache.zookeeper.ClientCnxn:
> >> > Exception
> >> > >> >> closing session 0x424f0ed01ea00a6 to
> >> > >> sun.nio.ch.SelectionKeyImpl@6b93d343
> >> > >> >> java.io.IOException: Session Expired
> >> > >> >>         at
> >> > >> >>
> >> > >>
> >> >
> >>
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
> >> > >> >>         at
> >> > >> >>
> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
> >> > >> >>
> >> > >> >>         at
> >> > >> >>
> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> >> > >> >> 2009-11-14 16:01:11,433 ERROR
> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper
> >> session
> >> > >> >> expired
> >> > >> >> 2009-11-14 16:01:13,513 INFO org.apache.hadoop.ipc.HBaseServer:
> >> IPC
> >> > >> Server
> >> > >> >> handler 3 on 60021, call put([B@73deb204,
> >> > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from
> >> > >> 192.168.100.132:40728:
> >> > >> >> error: java.io.IOException: Server not running, aborting
> >> > >> >> java.io.IOException: Server not running, aborting
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >>
> >> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
> >> > zhongresearch@gmail.com
> >> > >> >wrote:
> >> > >> >>
> >> > >> >>> Stack,
> >> > >> >>>
> >> > >> >>> Thanks a lot!
> >> > >> >>>
> >> > >> >>> I found out that the reason why HBase doesn't take the system
> >> file
> >> > >> >>> descriptor value I set before. I started the HBase using root
> >> > instead
> >> > >> of the
> >> > >> >>> normal hadoop user, while my system configures hadoop with
> higher
> >> > file
> >> > >> >>> descriptor value but configures root with 1024 default value.
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> Now my system has a clean start. However, it seems that the
> >> > >> >>> zookeeper.session.timeout value doesn't take into effect. I
> still
> >> > >> found
> >> > >> >>> around 60 seconds pauses from the disconnected regionserver. I
> >> > really
> >> > >> don't
> >> > >> >>> know why regionserver only times out after 60 seconds instead
> of
> >> 10
> >> > >> minutes
> >> > >> >>> which I set for zookeeper.session.timeout.
> >> > >> >>>
> >> > >> >>> Is there any other timeout value coming into play before the
> >> actual
> >> > >> >>> session times out?
> >> > >> >>>
> >> > >> >>> zhenyu
> >> > >> >>>
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <stack@duboce.net>
> wrote:
> >> > >> >>>
> >> > >> >>>> Ok.  Lack of file descriptors manifests in all kinds of weird
> >> ways.
> >> > >> >>>> Hopefully thats it.  If not, lets keep digging.
> >> > >> >>>> St.Ack
> >> > >> >>>>
> >> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <
> >> > >> zhongresearch@gmail.com
> >> > >> >>>> >wrote:
> >> > >> >>>>
> >> > >> >>>> > Stack,
> >> > >> >>>> >
> >> > >> >>>> > You are right, the master started with ulimit -n 1024. It
> >> doesn't
> >> > >> take
> >> > >> >>>> the
> >> > >> >>>> > system value.
> >> > >> >>>> >
> >> > >> >>>> > Regarding to the too many open files, it looks like the same
> >> as
> >> > the
> >> > >> one
> >> > >> >>>> J-D
> >> > >> >>>> > put up. But I will get the Master start with higher value
> >> first
> >> > and
> >> > >> see
> >> > >> >>>> if
> >> > >> >>>> > this kind of error goes away.
> >> > >> >>>> >
> >> > >> >>>> > thanks a lot!
> >> > >> >>>> > zhenyu
> >> > >> >>>> >
> >> > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <stack@duboce.net>
> >> wrote:
> >> > >> >>>> >
> >> > >> >>>> > > Does it say
> >> > >> >>>> > >
> >> > >> >>>> > > ulimit -n 32768
> >> > >> >>>> > >
> >> > >> >>>> > > ...as the second line in your log file on start of the
> >> master?
> >> > >> >>>> > >
> >> > >> >>>> > > Is the stack trace that complains about too many open
> files
> >> > same
> >> > >> as
> >> > >> >>>> the
> >> > >> >>>> > one
> >> > >> >>>> > > in the blog post J-D put up?
> >> > >> >>>> > >
> >> > >> >>>> > > St.Ack
> >> > >> >>>> > >
> >> > >> >>>> > >
> >> > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <
> >> > >> >>>> zhongresearch@gmail.com
> >> > >> >>>> > > >wrote:
> >> > >> >>>> > >
> >> > >> >>>> > > > The ulimit file descriptors was set to fs.file-max
> >> =1578334,
> >> > >> also
> >> > >> >>>> in
> >> > >> >>>> > the
> >> > >> >>>> > > > limits.conf the value was set to 32768.
> >> > >> >>>> > > > So these are way higher than the open descriptors for
> the
> >> > >> running
> >> > >> >>>> > > > processes.
> >> > >> >>>> > > >
> >> > >> >>>> > > > thanks
> >> > >> >>>> > > > zhenyu
> >> > >> >>>> > > >
> >> > >> >>>> > > >
> >> > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <
> >> saint.ack@gmail.com>
> >> > >> >>>> wrote:
> >> > >> >>>> > > >
> >> > >> >>>> > > > > You upped the ulimit file descriptors as per the
> getting
> >> > >> started
> >> > >> >>>> doc?
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >
> >> > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <
> >> > >> >>>> zhongresearch@gmail.com>
> >> > >> >>>> > > > wrote:
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >  Thanks a lot.
> >> > >> >>>> > > > >>
> >> > >> >>>> > > > >>
> >> > >> >>>> > > > >> Bad news is my kernel is still 2.6.26.
> >> > >> >>>> > > > >> But it was not a problem before.
> >> > >> >>>> > > > >>
> >> > >> >>>> > > > >> Very strange.
> >> > >> >>>> > > > >>
> >> > >> >>>> > > > >> zhenyu
> >> > >> >>>> > > > >>
> >> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <
> >> > >> >>>> > > > jdcryans@apache.org
> >> > >> >>>> > > > >> >wrote:
> >> > >> >>>> > > > >>
> >> > >> >>>> > > > >>  Looks like
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>> J-D
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <
> >> > >> >>>> > > zhongresearch@gmail.com
> >> > >> >>>> > > > >
> >> > >> >>>> > > > >>> wrote:
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> Hi,
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> After I re-organize the cluster, the experiment ran
> >> into
> >> > >> >>>> problem
> >> > >> >>>> > > > faster
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> than
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> before.
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> Basically the changes are to use machines with less
> >> > >> resources
> >> > >> >>>> as
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> zookeeper
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> quorums and machines with more resources as
> >> > regionserver.
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> From the log, I still see the pause around 1
> minute.
> >> > >> >>>> > > > >>>> I enabled the GC logging,  please see
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> for details.
> >> > >> >>>> > > > >>>> In general I don't see many pauses in the GC.
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> What is more interesting, 1 hour after the 1st
> >> > >> regionserver
> >> > >> >>>> > > > >>>> disconnected,
> >> > >> >>>> > > > >>>> the master log complained about too many open
> files.
> >> > This
> >> > >> >>>> didn't
> >> > >> >>>> > > > happen
> >> > >> >>>> > > > >>>> before.
> >> > >> >>>> > > > >>>> I checked the system OS setup as well as the
> >> > limits.conf.
> >> > >> I
> >> > >> >>>> also
> >> > >> >>>> > > > checked
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> the
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> running processes. The total open files don't reach
> >> the
> >> > >> limit.
> >> > >> >>>> I
> >> > >> >>>> > am
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> confused
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> a bit.
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> Please see the following master log.
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658 row(s) of
> meta
> >> > >> region
> >> > >> >>>> > > {server:
> >> > >> >>>> > > > >>>> 192.168.100.128:60021, regionname: .META.,,1,
> >> startKey:
> >> > >> <>}
> >> > >> >>>> > > complete
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> All
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> 1 .META. region(s) scanned
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
> >> > >> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> Got
> >> > >> >>>> > > > ping
> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 3ms
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad
> >> connect
> >> > >> ack
> >> > >> >>>> with
> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_-2808245019291145247_5478039
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.io.EOFException
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_1074853606841896259_5478048
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
> >> > >> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> Got
> >> > >> >>>> > > > ping
> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 1ms
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
> >> > >> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> Got
> >> > >> >>>> > > > ping
> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
> >> > >> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> Got
> >> > >> >>>> > > > ping
> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException: Bad
> >> connect
> >> > >> ack
> >> > >> >>>> with
> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_-6642544517082142289_5478063
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException:
> Too
> >> > many
> >> > >> >>>> open
> >> > >> >>>> > > files
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_2057511041109796090_5478063
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException:
> Too
> >> > many
> >> > >> >>>> open
> >> > >> >>>> > > files
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_8219260302213892894_5478064
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException:
> Too
> >> > many
> >> > >> >>>> open
> >> > >> >>>> > > files
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_1669205542853067709_5478235
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException:
> Too
> >> > many
> >> > >> >>>> open
> >> > >> >>>> > > files
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_9128897691346270351_5478237
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException:
> Too
> >> > many
> >> > >> >>>> open
> >> > >> >>>> > > files
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_1116845144864123018_5478240
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
> >> > >> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> Got
> >> > >> >>>> > > > ping
> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after 0ms
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
> >> > >> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> Got
> >> > >> >>>> > > > ping
> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after 0ms
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> DataStreamer
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> Exception: java.io.IOException: Too many open files
> >> > >> >>>> > > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native Method)
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > >
> >> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>
> >> > >> >>>>
> >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> java.io.DataOutputStream.write(DataOutputStream.java:90)
> >> > >> >>>> > > > >>>>      at
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > Error
> >> > >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475
> >> bad
> >> > >> >>>> datanode[0]
> >> > >> >>>> > > > >>>> 192.168.100.123:50010
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > Error
> >> > >> >>>> > > > >>>> Recovery for block blk_8148813491785406356_5478475
> in
> >> > >> pipeline
> >> > >> >>>> > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010,
> >> > >> >>>> > 192.168.100.122:50010
> >> > >> >>>> > > :
> >> > >> >>>> > > > >>>> bad
> >> > >> >>>> > > > >>>> datanode 192.168.100.123:50010
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > Exception
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>> in
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> createBlockOutputStream java.net.SocketException:
> Too
> >> > many
> >> > >> >>>> open
> >> > >> >>>> > > files
> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> >> > >> org.apache.hadoop.hdfs.DFSClient:
> >> > >> >>>> > > > >>>> Abandoning
> >> > >> >>>> > > > >>>> block blk_2119727700857186236_5478498
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <
> >> > >> >>>> > > > zhongresearch@gmail.com
> >> > >> >>>> > > > >>>> wrote:
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>  Will do.
> >> > >> >>>> > > > >>>>>
> >> > >> >>>> > > > >>>>> thanks
> >> > >> >>>> > > > >>>>> zhenyu
> >> > >> >>>> > > > >>>>>
> >> > >> >>>> > > > >>>>>
> >> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <
> >> > stack@duboce.net
> >> > >> >
> >> > >> >>>> wrote:
> >> > >> >>>> > > > >>>>>
> >> > >> >>>> > > > >>>>>  Enable GC logging too on this next run (see
> >> > >> hbase-env.sh).
> >> > >> >>>>  Lets
> >> > >> >>>> > > try
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>> and
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> get
> >> > >> >>>> > > > >>>>>> to the bottom of whats going on.
> >> > >> >>>> > > > >>>>>> Thanks,
> >> > >> >>>> > > > >>>>>> St.Ack
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>> zhongresearch@gmail.com
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> wrote:
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>  I can switch the boxes that run zookeeper with
> the
> >> > ones
> >> > >> >>>> that
> >> > >> >>>> > run
> >> > >> >>>> > > > >>>>>>> regionservers.
> >> > >> >>>> > > > >>>>>>> I will see the results later.
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes
> >> > >> >>>> > zookeeper.session.timeout
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>> value
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> in
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> place.
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>> thanks
> >> > >> >>>> > > > >>>>>>> zhenyu
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <
> >> > >> stack@duboce.net>
> >> > >> >>>> > wrote:
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong
> <
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>> zhongresearch@gmail.com
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> wrote:
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster have 2
> disks
> >> > 400GB
> >> > >> >>>> each
> >> > >> >>>> > per
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>> machine,
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> and
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per machine.
> >> >  Maybe
> >> > >> we
> >> > >> >>>> > should
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>> run
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> zookeeper
> >> > >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS on
> machines
> >> > with
> >> > >> 6
> >> > >> >>>> disks?
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> That would make most sense only in the below,
> it
> >> > >> looks
> >> > >> >>>> like
> >> > >> >>>> > the
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>> RS
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> that
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> had
> >> > >> >>>> > > > >>>>>>>> issue had 4 disks?
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>  BTW, the 10 minutes zookeeper.session.timeout
> >> has
> >> > >> been
> >> > >> >>>> set
> >> > >> >>>> > > during
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>> the
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> experiment.
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>> And for sure this node had it in place?
> >> > >> >>>> > > > >>>>>>>> St.Ack
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> thanks
> >> > >> >>>> > > > >>>>>>>>> zhenyu
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <
> >> > >> stack@duboce.net
> >> > >> >>>> >
> >> > >> >>>> > > wrote:
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong
> <
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> zhongresearch@gmail.com
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> wrote:
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>  Though I experienced 2 regionserver
> >> disconnection
> >> > >> this
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> morning,
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> it
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> looks
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> better from the regionserver log. (please see
> >> the
> >> > >> >>>> following
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> log)
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the
> >> > >> >>>> > regionserver(192.168.100.116)
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> during
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> the
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> time it disconnected.
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
> >> > >> >>>> > > > >>>>>>>>>>> Device:            tps   Blk_read/s
> >> Blk_wrtn/s
> >> > >> >>>> Blk_read
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> Blk_wrtn
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> sda             105.31      5458.83
> 19503.64
> >> > >> >>>> 9043873239
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> 32312473676
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> sda1              2.90         6.64
>  99.25
> >> > >> >>>> 10993934
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> 164433464
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> sda2              1.72        23.76
>  30.25
> >> > >> >>>> 39365817
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> 50124008
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> sda3              0.30         0.38
> 3.58
> >> > >> >>>> 624930
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> 5923000
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> sda4            100.39      5428.06
> 19370.56
> >> > >> >>>> 8992888270
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> 32091993204
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k blocks/second
> >> would
> >> > >> seem to
> >> > >> >>>> be
> >> > >> >>>> > > high
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> but
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> its
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> one
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> disk only and its not being shared by zk
> >> anymore
> >> > so
> >> > >> >>>> > shouldn't
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> matter?
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper quorum server
> >> that
> >> > the
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> regionserver
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> tried
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> to
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> connect according to the log. However, I
> don't
> >> see
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> 192.168.100.116
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> in
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> the
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum server
> that
> >> > the
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> regionserver
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> tried
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> to
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> connect.
> >> > >> >>>> > > > >>>>>>>>>>> Would that be a problem?
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>  Is that because the ephemeral node for the
> >> > >> >>>> regionserver
> >> > >> >>>> > had
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> evaporated?
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> Lost
> >> > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you went to look?
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> >> > >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on
> >> > >> 08/27/2009
> >> > >> >>>> 18:48
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> GMT
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> Clients:
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
> >> > >> [1](queued=0,recved=26,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
> >> > >> [1](queued=0,recved=964,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
> >> > >> [1](queued=0,recved=3266,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
> >> > >> [1](queued=0,recved=233,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
> >> > >> [1](queued=0,recved=2,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
> >> > >> [1](queued=0,recved=295,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
> >> > >> [1](queued=0,recved=2290,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
> >> > >> [1](queued=0,recved=0,sent=0)
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
> >> > >> >>>> > > > >>>>>>>>>>> Received: 47111
> >> > >> >>>> > > > >>>>>>>>>>> Sent: 0
> >> > >> >>>> > > > >>>>>>>>>>> Outstanding: 0
> >> > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
> >> > >> >>>> > > > >>>>>>>>>>> Mode: leader
> >> > >> >>>> > > > >>>>>>>>>>> Node count: 23
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is pretty bad
> but
> >> > >> should
> >> > >> >>>> be
> >> > >> >>>> > well
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> within
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> the
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> zk
> >> > >> >>>> > > > >>>>>>>>>> session timeout.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk client side
> of
> >> the
> >> > >> >>>> session;
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> i.e.
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> in
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> the
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> RS.
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D suggested
> to
> >> > see
> >> > >> if
> >> > >> >>>> you
> >> > >> >>>> > > have
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> any
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> big
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't look too
> >> heavily
> >> > >> >>>> loaded:
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
> >> > >> >>>> > > > >>>>>>>>>>
> >> >  org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > >> >>>> Dump of
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> metrics:
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>>  request=1.5166667, regions=322, stores=657,
> >> > >> >>>> storefiles=631,
> >> > >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61, memstoreSize=1472,
> >> > >> >>>> usedHeap=2819,
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> maxHeap=4079,
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
> >> > blockCacheFree=197395984,
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The number of store
> >> > files
> >> > >> >>>> seems
> >> > >> >>>> > > fine.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> Not
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> too
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> much memory used.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease Still
> Held
> >> > >> >>>> exception.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> This
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> happens
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> when the RS does its regular report to the
> >> master
> >> > but
> >> > >> the
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> master
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> thinks
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> the
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> RS has since restarted.  It'll think this
> >> probably
> >> > >> >>>> because
> >> > >> >>>> > it
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> noticed
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> that
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away and it
> >> > considered
> >> > >> the
> >> > >> >>>> RS
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> dead.
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>
> >> > >> >>>> > > > >>>>>>>>>> Looking too at your  logs I see this gap in
> the
> >> zk
> >> > >> >>>> pinging:
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> Got
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>>  ping response for sessionid:0x224e55436ad0004
> >> after
> >> > >> 0ms
> >> > >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> Got
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>>  ping response for sessionid:0x24e55436a0007d
> >> after
> >> > >> 0ms
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> Where in the lines above it, its reporting
> >> about
> >> > >> every
> >> > >> >>>> ten
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> seconds,
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> here
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> there is a big gap.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or something that will
> let
> >> you
> >> > >> look
> >> > >> >>>> more
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> into
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> what
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> was
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> happening on this machine around this time?
>  Is
> >> > the
> >> > >> >>>> machine
> >> > >> >>>> > > OK?
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> It
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> looks
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> lightly loaded and you have your cluster
> nicely
> >> > laid
> >> > >> out.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> Something
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> odd
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> is
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> going on.  What about things like the write
> >> speed
> >> > to
> >> > >> >>>> disk?
> >> > >> >>>> >  In
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> the
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> past
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> strange issues have been explained by
> incorrectly
> >> > set
> >> > >> BIOS
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> which
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> made
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> disks
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper speed.
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> St.Ack
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> Best,
> >> > >> >>>> > > > >>>>>>>>>>> zhenyu
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu
> Zhong
> >> <
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> zhongresearch@gmail.com
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> wrote:
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>  Stack
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your comments.
> >> > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper monitoring script
> on
> >> my
> >> > >> >>>> cluster
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> and
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>> let
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>> it
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>> run
> >> > >> >>>> > > > >>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>> overnight to see the result.
> >> > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when I get
> >> anything.
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> thanks
> >> > >> >>>> > > > >>>>>>>>>>>> zhenyu
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack <
> >> > >> >>>> stack@duboce.net>
> >> > >> >>>> > > > >>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>> wrote:
> >> > >> >>>> > > > >>>>>>
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>>  I see these in your log too:
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> Got
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> sessionid:0x424dfd908c50009
> >> > >> after
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> 4544ms
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> Got
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> sessionid:0x24dfd90c810002
> >> > after
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> 4548ms
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>>> Got
> >> > >> >>>> > > > >>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> sessionid:0x424dfd908c50009
> >> > >> after
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> 9030ms
> >> > >> >>>> > > > >>>
> >> > >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> >> > >> >>>> > > > >>>>>>>>>>>>>
> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
> >> > >> >>>> > > > >>>>>>>>
> >> > >> >>>> > > > >>>>>>>
> >> > >> >>>> > > >
> >> > >> >>>> > >
> >> > >> >>>> >
> >> > >> >>>>
> >> > >> >>>
> >> > >> >>>
> >> > >> >>
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message