hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhenyu Zhong <zhongresea...@gmail.com>
Subject Re: regionserver disconnection
Date Thu, 19 Nov 2009 15:53:52 GMT
After digging more on the zookeeper log, I found that there seemed to be a
socketconnection timeout. Also some warnings indicate that zookeeper server
is not running. Probably I need to understand the mechanism of zookeeper
first.

Please find the zookeeper log around that time.

http://pastebin.com/m37bb4ad1

Many thanks
zhenyu


On Thu, Nov 19, 2009 at 10:31 AM, Zhenyu Zhong <zhongresearch@gmail.com>wrote:

> Stack,
>
> I am very appreciated. Let me dig into the zookeeper log more.
>
> FYI, I saw the sleeper complained about the delay for the first time. It
> looks like there are different reasons for the RS disconnections.
>
> 2009-11-19 08:08:32,708 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x42508716bfc0001 after 0ms
> 2009-11-19 08:08:37,595 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x42508716bfc0082 after 0ms
> 2009-11-19 08:08:46,073 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
> response for sessionid:0x42508716bfc0001 after 0ms
> 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x42508716bfc0001 to sun.nio.ch.SelectionKeyImpl@4f3a608f
> java.io.IOException: TIMED OUT
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x42508716bfc0082 to sun.nio.ch.SelectionKeyImpl@14c8875
> java.io.IOException: TIMED OUT
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2009-11-19 08:09:39,880 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 53556ms, ten times longer than scheduled: 3000
> 2009-11-19 08:09:39,980 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>         at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>         at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>         at java.lang.Thread.run(Thread.java:619)
> 2009-11-19 08:09:39,980 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn: Attempting
> connection to server superpyxis0005/192.168.100.119:2181
> 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected local=/
> 192.168.100.122:40659 remote=superpyxis0005/192.168.100.119:2181]
>
>
>
> Best,
> zhenyu
>
> On Wed, Nov 18, 2009 at 6:52 PM, stack <stack@duboce.net> wrote:
>
>> Patrick Hunt, one of the zk lads, answered with the following:
>>
>>
>>
>>
>> According to the following snippet (I cut to just session
>> 0x425035c48210002)
>> the client is getting pings for this session id regularly every 6 seconds
>> or
>> so, then after 41,016 there's a 20 second gap (corresponds to a 30 second
>> timeout being set during session creation) at which point we disconnect
>> from
>> the server.
>>
>>
>> > 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> > response for sessionid:0x425035c48210002 after 0ms
>> > 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn: Exception
>> > closing session 0x425035c48210002 to
>> sun.nio.ch.SelectionKeyImpl@421690ab
>> > java.io.IOException: TIMED OUT
>>
>> the client notifies hbase of the disco, and attempts to reconnect to a
>> server
>>
>>
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> > state: Disconnected, type: None, path: null
>> > 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn: Attempting
>> > connection to server superpyxis0003/192.168.100.117:2181
>>
>> but the server won't allow the session be be established for some reason
>> (is
>> that the right server:port that's it's trying to connect to?)
>>
>>
>> > 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Server
>> > connection successful
>> > 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Exception
>> > closing session 0x425035c48210002 to
>> sun.nio.ch.SelectionKeyImpl@5c07076b
>> > java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
>> > lim=4 cap=4]
>>
>> The "read error rc = -1" means that the client read from the socket and
>> got
>> EOS. But it's not a session expiration afaict (and should not be at this
>> point in time).
>>
>>
>> You need to look at the ZK server logs for this period in time (for this
>> server superpyxis0003:2181, and superpyxis0001:2181, the client tries to
>> connect to both of these). There should be some indication there why the
>> server is closing down the connection and not allowing the session to be
>> re-established.
>>
>> One thought - by default we limit clients to a max of 10 connections from
>> a
>> particular host to the server. If you exceed this limit the zk server
>> (it's
>> per server though, not per cluster, so perhaps this is not it) will not
>> allow any  additional connections from the client host (you can't exceed
>> 10
>> on a server). Could this be it? i.e. the user is running a number of ZK
>> clients (say hbase processes) from the same host? Regardless, get the
>> server
>> log for this time period and it should be more clear.
>>
>> Patrick
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 18, 2009 at 11:18 AM, stack <stack@duboce.net> wrote:
>>
>> > And running stat on each of your zookeeper nodes tells you what about
>> > averages for connect-times?
>> >
>> > Its odd that there is a ping just 20 seconds beforehand and you have set
>> > the zk session timeout at 60 seconds (or ten minutes?).  Let me ask the
>> zk
>> > lads about it.
>> >
>> > If you look in the GC log for around this time, do you see anything?
>> >
>> > St.Ack
>> >
>> >
>> >
>> > On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <zhongresearch@gmail.com
>> >wrote:
>> >
>> >> After running more experiment, it seems that the disconnection is not
>> >> related to the heavy load job, because the load average is low, the
>> diskio
>> >> is normal, memory Heap was not reached, also the virtual memory stats
>> >> shows
>> >> that no swappings.  However, the disconnection still happens.
>> >>
>> >> It looks like this time it pauses for 20 seconds. No idea why
>> regionserver
>> >> disconnected.
>> >>
>> >> Any other suggestions please?
>> >>
>> >>
>> >> 2009-11-18 12:41:00,956 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x125035c495d02bb after 0ms
>> >> 2009-11-18 12:41:08,074 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x425035c48210002 after 0ms
>> >> 2009-11-18 12:41:14,302 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x125035c495d02bb after 0ms
>> >> 2009-11-18 12:41:21,433 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x425035c48210002 after 0ms
>> >> 2009-11-18 12:41:27,656 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x125035c495d02bb after 0ms
>> >> 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x425035c48210002 after 0ms
>> >> 2009-11-18 12:41:41,016 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> >> response for sessionid:0x125035c495d02bb after 1ms
>> >> 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn: Exception
>> >> closing session 0x425035c48210002 to
>> sun.nio.ch.SelectionKeyImpl@421690ab
>> >> java.io.IOException: TIMED OUT
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> >> 2009-11-18 12:42:01,582 INFO
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
>> event,
>> >> state: Disconnected, type: None, path: null
>> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
>> Attempting
>> >> connection to server superpyxis0003/192.168.100.117:2181
>> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn: Priming
>> >> connection to java.nio.channels.SocketChannel[connected local=/
>> >> 192.168.100.132:46610 remote=superpyxis0003/192.168.100.117:2181]
>> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn: Server
>> >> connection successful
>> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn: Exception
>> >> closing session 0x425035c48210002 to
>> sun.nio.ch.SelectionKeyImpl@56dc6fac
>> >> java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
>> >> lim=4 cap=4]
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn: Ignoring
>> >> exception during shutdown input
>> >> java.net.SocketException: Transport endpoint is not connected
>> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>> >>        at
>> >> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>> >>        at
>> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>> >>         at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn: Ignoring
>> >> exception during shutdown output
>> >> java.net.SocketException: Transport endpoint is not connected
>> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>> >>        at
>> >> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
>> >>        at
>> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>> >>        at
>> >>
>> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
>> Attempting
>> >> connection to server superpyxis0001/192.168.100.115:2181
>> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Priming
>> >> connection to java.nio.channels.SocketChannel[connected local=/
>> >> 192.168.100.132:36197 remote=superpyxis0001/192.168.100.115:2181]
>> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn: Server
>> >> connection successful
>> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Exception
>> >> closing session 0x425035c48210002 to
>> sun.nio.ch.SelectionKeyImpl@5c07076b
>> >> java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
>> >> lim=4 cap=4]
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Ignoring
>> >> exception during shutdown input
>> >> java.net.SocketException: Transport endpoint is not connected
>> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
>> >>        at
>> >> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>> >>        at
>> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>> >>         at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>> >>        at
>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn: Ignoring
>> >> exception during shutdown output
>> >> java.net.SocketException: Transport endpoint is not connected
>> >>
>> >>
>> >> On Mon, Nov 16, 2009 at 4:16 PM, Zhenyu Zhong <zhongresearch@gmail.com
>> >> >wrote:
>> >>
>> >> > Here is the diskIO and CPU around the time we had RS disconnection on
>> >> one
>> >> > machine that runs RegionServer. It doesn't seem to be high. Similar
>> disk
>> >> and
>> >> > cpu usage have been seen before and the HBase was running fine.
>> >> >
>> >> >
>> >> > So far I haven't found why my 10 minutes session timeout doesn't
>> apply.
>> >> > Still digging.
>> >> >
>> >> > Many thanks!
>> >> >
>> >> > Time: 03:01:08 PM
>> >> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >> >            3.82    0.00    0.89    3.97    0.00   91.32
>> >> >
>> >> >
>> >> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
>> Blk_wrtn
>> >> > sda              24.30      1605.39      6679.35 6238623388
>> 25956251720
>> >> > sda1              0.51         1.94        15.71    7547265
>> 61067904
>> >> > sda2              0.13         2.11         5.20    8202079
>> 20189232
>> >> > sda3              0.13         0.25         1.63     987846
>>  6323336
>> >> > sda4              0.00         0.00         0.00          4
>>  0
>> >> > sda5              1.78         0.06        24.14     220010
>> 93817208
>> >> > sda6              0.38         0.42         8.41    1630584
>> 32688152
>> >> > sda7             21.38      1600.61      6624.26 6220035272
>> 25742165888
>> >> > sdb               4.52       767.33       565.91 2981868690
>> 2199132380
>> >> > sdb1              4.52       767.33       565.91 2981866755
>> 2199132372
>> >> > sdc               4.42       742.95       563.00 2887151482
>> 2187823092
>> >> > sdc1              4.42       742.95       563.00 2887149547
>> 2187823084
>> >> > sdd               4.49       750.78       557.25 2917554074
>> 2165513500
>> >> > sdd1              4.49       750.78       557.25 2917552139
>> 2165513492
>> >> > sde               4.52       758.51       569.46 2947593394
>> 2212964236
>> >> > sde1              4.52       758.51       569.46 2947591459
>> 2212964228
>> >> > sdf               4.51       747.22       571.78 2903740266
>> 2221972652
>> >> > sdf1              4.51       747.22       571.78 2903738331
>> 2221972644
>> >> >
>> >> > Best,
>> >> > zhenyu
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Nov 16, 2009 at 4:07 PM, stack <stack@duboce.net> wrote:
>> >> >
>> >> >> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <
>> >> zhongresearch@gmail.com
>> >> >> >wrote:
>> >> >>
>> >> >> > I just realized that there was a MapReduce job running during the
>> >> time
>> >> >> the
>> >> >> > regionserver disconnected from the zookeeper.
>> >> >> > That MapReduce Job was processing 500GB data and took about 8
>> minutes
>> >> to
>> >> >> > finish. It launched over 2000 map tasks.
>> >> >>
>> >> >>
>> >> >> There was a tasktracker running on the RegionServer that
>> disconnected?
>> >>  Is
>> >> >> the map i/o or cpu heavy?  Do you think it could have stole life
>> from
>> >> the
>> >> >> datanode/regionserver?
>> >> >>
>> >> >>
>> >> >>
>> >> >> > I doubt that this introduced
>> >> >> > resource contention between DataNode and RegionServer Node.
>> >> >> > Also during the time that MapReduce job ran, I saw a few errors
>> >> >> indicating
>> >> >> > that
>> >> >> >
>> >> >> > java.io.IOException: Could not obtain block:
>> >> >> > blk_7661247556283580226_4450462 file=/data/xxx.txt.200910141159
>> >> >> >        at
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
>> >> >> >        at
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
>> >> >> >        at
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
>> >> >> >        at java.io.DataInputStream.read(DataInputStream.java:83)
>> >> >> >        at
>> >> >> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>> >> >> >        at
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>> >> >> >        at
>> >> >> >
>> >>
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>> >> >> >        at
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>> >> >> >        at
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>> >> >> >        at
>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>> >> >> >        at
>> >> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>> >> >> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>> >> >> >        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> >> >> >
>> >> >>
>> >> >> This is interesting.  Your map task is running into hdfs-127.  Yeah,
>> >> you
>> >> >> should patch your cluster if you want to get rid of these (hdfs-127
>> has
>> >> >> been
>> >> >> applied to 0.21 hadoop and will be in the next hadoop release on
>> 0.20.x
>> >> >> branch, hadoop 0.20.2).
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > Possibly it is related to HDFS-127, but I don't see any datanodes
>> >> went
>> >> >> > down. Does that affect Regionserver? Shall we apply the patch?
>> >> >> >
>> >> >> >
>> >> >> Read up on hdfs-127 for explaination of whats going on.
>> >> >>
>> >> >>
>> >> >>
>> >> >> > Now I start to keep tracking of the virtual memory stats to see if
>> >> the
>> >> >> same
>> >> >> > issue happens tomorrow around the same time period.
>> >> >> >
>> >> >>
>> >> >> Excellent.  Thanks for doing this.
>> >> >>
>> >> >>
>> >> >>
>> >> >> > And I highly suspect that this particular MapReduce job hurts
>> HBase
>> >> >> > Regionserver.
>> >> >> >
>> >> >> > PS. I also use gcviewer to parse the GC-log, I only see around 30
>> >> >> seconds
>> >> >> > pauses maximum.
>> >> >> >
>> >> >>
>> >> >>
>> >> >> Did you figure why your ten minute session is not applying?
>> >> >>
>> >> >> St.Ack
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > Thanks a lot.
>> >> >> > zhenyu
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <
>> >> zhongresearch@gmail.com
>> >> >> > >wrote:
>> >> >> >
>> >> >> > > J-D,
>> >> >> > >
>> >> >> > > Thank you very much for your comments.
>> >> >> > > My company block the IRC port, so I have trouble to connect to
>> IRC
>> >> >> > channel.
>> >> >> > > I have been trying to ask the IT to open the IRC port for me, it
>> >> might
>> >> >> > take
>> >> >> > > a while.
>> >> >> > >
>> >> >> > > Best,
>> >> >> > > zhenyu
>> >> >> > >
>> >> >> > >
>> >> >> > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <
>> >> >> jdcryans@apache.org
>> >> >> > >wrote:
>> >> >> > >
>> >> >> > >> The error you are getting is a disconnection from a zookeeper
>> >> server
>> >> >> > >> and is very generic.
>> >> >> > >>
>> >> >> > >> ZK-86 is still opened and the last comment refers to ZK-111
>> saying
>> >> >> > >> that the bug (in unit tests) was probably fixed in release
>> 3.0.0
>> >> last
>> >> >> > >> year.
>> >> >> > >>
>> >> >> > >> To figure the hang you have, you can try to jstack the process
>> pid
>> >> >> and
>> >> >> > >> see exactly what's holding the RS from shutting down.
>> >> >> > >>
>> >> >> > >> Would it be possible for you to drop by the IRC channel? This
>> way
>> >> we
>> >> >> > >> can debug this a at much faster pace.
>> >> >> > >>
>> >> >> > >> Thx!
>> >> >> > >>
>> >> >> > >> J-D
>> >> >> > >>
>> >> >> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
>> >> >> zhongresearch@gmail.com>
>> >> >> > >> wrote:
>> >> >> > >> > I found this.
>> >> >> > >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86
>> >> >> > >> >
>> >> >> > >> > It looks like the same error I had. Is it a zookeeper bug?
>> When
>> >> >> will
>> >> >> > >> HBase
>> >> >> > >> > take the zookeeper version 3.3.0?
>> >> >> > >> >
>> >> >> > >> > thanks
>> >> >> > >> > zhenyu
>> >> >> > >> >
>> >> >> > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
>> >> >> > zhongresearch@gmail.com
>> >> >> > >> >wrote:
>> >> >> > >> >
>> >> >> > >> >> Now I really doubt about the zookeeper, from the log I saw
>> >> errors
>> >> >> > like
>> >> >> > >> >> IOException Read Error, while the zookeeper client
>> >> (regionserver )
>> >> >> > >> tried to
>> >> >> > >> >> read. But it got disconneted status from zookeeper.
>> >> >> > >> >>
>> >> >> > >> >> I don't see any load on any zookeeper quorum servers. DiskIO
>> is
>> >> >> fine.
>> >> >> > >> >>
>> >> >> > >> >> Also when the regionserver decides to exit due to the
>> >> disconnect
>> >> >> > status
>> >> >> > >> >> from zookeeper, sometimes the regionserver hangs during the
>> >> >> exiting.
>> >> >> > We
>> >> >> > >> can
>> >> >> > >> >> still see the HRegionServer process even we don't see it in
>> the
>> >> >> > Master
>> >> >> > >> web
>> >> >> > >> >> interface.
>> >> >> > >> >>
>> >> >> > >> >> I also notice that there is an zookeeper.retries and
>> >> >> zookeeper.pauses
>> >> >> > >> >> settings in hbase-default.xml, would that matter ?
>> >> >> > >> >>
>> >> >> > >> >> thanks
>> >> >> > >> >>
>> >> >> > >> >> You may see the IOException error in the following log when
>> the
>> >> >> > >> >> Regionserver lost connection to the zookeeper.
>> >> >> > >> >>
>> >> >> > >> >> 2009-11-14 15:58:31,993 INFO
>> >> >> > org.apache.hadoop.hbase.regionserver.HLog:
>> >> >> > >> >> Roll
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
>> >> >> > >> >> entries=1062964, calcsize=255014936, filesize=139566093. New
>> >> hlog
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
>> >> >> > >> >> 2009-11-14 15:58:39,295 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 15:58:52,648 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 15:59:06,007 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 15:59:19,365 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 15:59:32,867 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms
>> >> >> > >> >> 2009-11-14 15:59:46,070 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 15:59:59,423 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 16:00:12,775 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
>> >> >> > >> >> 2009-11-14 16:00:28,141 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>> >> Got
>> >> >> > ping
>> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms
>> >> >> > >> >> 2009-11-14 16:01:10,378 WARN
>> org.apache.zookeeper.ClientCnxn:
>> >> >> > Exception
>> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
>> >> >> > >> sun.nio.ch.SelectionKeyImpl@39fbb2d6
>> >> >> > >> >>
>> >> >> > >> >> java.io.IOException: Read error rc = -1
>> >> >> > java.nio.DirectByteBuffer[pos=0
>> >> >> > >> >> lim=4 cap=4]
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> >> >> > >> >> 2009-11-14 16:01:10,478 INFO
>> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
>> >> ZooKeeper
>> >> >> > >> event,
>> >> >> > >> >> state: Disconnected, type: None, path: null
>> >> >> > >> >> 2009-11-14 16:01:11,333 WARN
>> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> Attempt=1
>> >> >> > >> >>
>> >> >> > >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>> >> >> > >> >>         at
>> >> >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> >> > >> >> Method)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>> >> >> > >> >>         at
>> >> >> > >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>> >> >> > >> >>         at java.lang.Thread.run(Thread.java:619)
>> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
>> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> Attempting
>> >> >> > >> >> connection to server superpyxis0001/192.168.100.115:2181
>> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
>> org.apache.zookeeper.ClientCnxn:
>> >> >> Priming
>> >> >> > >> >> connection to java.nio.channels.SocketChannel[connected
>> local=/
>> >> >> > >> >> 192.168.100.122:59575 remote=superpyxis0001/
>> >> 192.168.100.115:2181]
>> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
>> org.apache.zookeeper.ClientCnxn:
>> >> >> Server
>> >> >> > >> >> connection successful
>> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
>> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
>> >> ZooKeeper
>> >> >> > >> event,
>> >> >> > >> >> state: Expired, type: None, path: null
>> >> >> > >> >> 2009-11-14 16:01:11,433 WARN
>> org.apache.zookeeper.ClientCnxn:
>> >> >> > Exception
>> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
>> >> >> > >> sun.nio.ch.SelectionKeyImpl@6b93d343
>> >> >> > >> >> java.io.IOException: Session Expired
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>> >> >> > >> >>
>> >> >> > >> >>         at
>> >> >> > >> >>
>> >> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> >> >> > >> >> 2009-11-14 16:01:11,433 ERROR
>> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> ZooKeeper
>> >> >> session
>> >> >> > >> >> expired
>> >> >> > >> >> 2009-11-14 16:01:13,513 INFO
>> org.apache.hadoop.ipc.HBaseServer:
>> >> >> IPC
>> >> >> > >> Server
>> >> >> > >> >> handler 3 on 60021, call put([B@73deb204,
>> >> >> > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from
>> >> >> > >> 192.168.100.132:40728:
>> >> >> > >> >> error: java.io.IOException: Server not running, aborting
>> >> >> > >> >> java.io.IOException: Server not running, aborting
>> >> >> > >> >>
>> >> >> > >> >>
>> >> >> > >> >>
>> >> >> > >> >>
>> >> >> > >> >>
>> >> >> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
>> >> >> > zhongresearch@gmail.com
>> >> >> > >> >wrote:
>> >> >> > >> >>
>> >> >> > >> >>> Stack,
>> >> >> > >> >>>
>> >> >> > >> >>> Thanks a lot!
>> >> >> > >> >>>
>> >> >> > >> >>> I found out that the reason why HBase doesn't take the
>> system
>> >> >> file
>> >> >> > >> >>> descriptor value I set before. I started the HBase using
>> root
>> >> >> > instead
>> >> >> > >> of the
>> >> >> > >> >>> normal hadoop user, while my system configures hadoop with
>> >> higher
>> >> >> > file
>> >> >> > >> >>> descriptor value but configures root with 1024 default
>> value.
>> >> >> > >> >>>
>> >> >> > >> >>>
>> >> >> > >> >>> Now my system has a clean start. However, it seems that the
>> >> >> > >> >>> zookeeper.session.timeout value doesn't take into effect. I
>> >> still
>> >> >> > >> found
>> >> >> > >> >>> around 60 seconds pauses from the disconnected
>> regionserver. I
>> >> >> > really
>> >> >> > >> don't
>> >> >> > >> >>> know why regionserver only times out after 60 seconds
>> instead
>> >> of
>> >> >> 10
>> >> >> > >> minutes
>> >> >> > >> >>> which I set for zookeeper.session.timeout.
>> >> >> > >> >>>
>> >> >> > >> >>> Is there any other timeout value coming into play before
>> the
>> >> >> actual
>> >> >> > >> >>> session times out?
>> >> >> > >> >>>
>> >> >> > >> >>> zhenyu
>> >> >> > >> >>>
>> >> >> > >> >>>
>> >> >> > >> >>>
>> >> >> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <stack@duboce.net>
>> >> wrote:
>> >> >> > >> >>>
>> >> >> > >> >>>> Ok.  Lack of file descriptors manifests in all kinds of
>> weird
>> >> >> ways.
>> >> >> > >> >>>> Hopefully thats it.  If not, lets keep digging.
>> >> >> > >> >>>> St.Ack
>> >> >> > >> >>>>
>> >> >> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <
>> >> >> > >> zhongresearch@gmail.com
>> >> >> > >> >>>> >wrote:
>> >> >> > >> >>>>
>> >> >> > >> >>>> > Stack,
>> >> >> > >> >>>> >
>> >> >> > >> >>>> > You are right, the master started with ulimit -n 1024.
>> It
>> >> >> doesn't
>> >> >> > >> take
>> >> >> > >> >>>> the
>> >> >> > >> >>>> > system value.
>> >> >> > >> >>>> >
>> >> >> > >> >>>> > Regarding to the too many open files, it looks like the
>> >> same
>> >> >> as
>> >> >> > the
>> >> >> > >> one
>> >> >> > >> >>>> J-D
>> >> >> > >> >>>> > put up. But I will get the Master start with higher
>> value
>> >> >> first
>> >> >> > and
>> >> >> > >> see
>> >> >> > >> >>>> if
>> >> >> > >> >>>> > this kind of error goes away.
>> >> >> > >> >>>> >
>> >> >> > >> >>>> > thanks a lot!
>> >> >> > >> >>>> > zhenyu
>> >> >> > >> >>>> >
>> >> >> > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <
>> stack@duboce.net>
>> >> >> wrote:
>> >> >> > >> >>>> >
>> >> >> > >> >>>> > > Does it say
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > > ulimit -n 32768
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > > ...as the second line in your log file on start of the
>> >> >> master?
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > > Is the stack trace that complains about too many open
>> >> files
>> >> >> > same
>> >> >> > >> as
>> >> >> > >> >>>> the
>> >> >> > >> >>>> > one
>> >> >> > >> >>>> > > in the blog post J-D put up?
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > > St.Ack
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <
>> >> >> > >> >>>> zhongresearch@gmail.com
>> >> >> > >> >>>> > > >wrote:
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> > > > The ulimit file descriptors was set to fs.file-max
>> >> >> =1578334,
>> >> >> > >> also
>> >> >> > >> >>>> in
>> >> >> > >> >>>> > the
>> >> >> > >> >>>> > > > limits.conf the value was set to 32768.
>> >> >> > >> >>>> > > > So these are way higher than the open descriptors
>> for
>> >> the
>> >> >> > >> running
>> >> >> > >> >>>> > > > processes.
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > > > thanks
>> >> >> > >> >>>> > > > zhenyu
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <
>> >> >> saint.ack@gmail.com>
>> >> >> > >> >>>> wrote:
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > > > > You upped the ulimit file descriptors as per the
>> >> getting
>> >> >> > >> started
>> >> >> > >> >>>> doc?
>> >> >> > >> >>>> > > > >
>> >> >> > >> >>>> > > > >
>> >> >> > >> >>>> > > > >
>> >> >> > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <
>> >> >> > >> >>>> zhongresearch@gmail.com>
>> >> >> > >> >>>> > > > wrote:
>> >> >> > >> >>>> > > > >
>> >> >> > >> >>>> > > > >  Thanks a lot.
>> >> >> > >> >>>> > > > >>
>> >> >> > >> >>>> > > > >>
>> >> >> > >> >>>> > > > >> Bad news is my kernel is still 2.6.26.
>> >> >> > >> >>>> > > > >> But it was not a problem before.
>> >> >> > >> >>>> > > > >>
>> >> >> > >> >>>> > > > >> Very strange.
>> >> >> > >> >>>> > > > >>
>> >> >> > >> >>>> > > > >> zhenyu
>> >> >> > >> >>>> > > > >>
>> >> >> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel
>> Cryans
>> >> <
>> >> >> > >> >>>> > > > jdcryans@apache.org
>> >> >> > >> >>>> > > > >> >wrote:
>> >> >> > >> >>>> > > > >>
>> >> >> > >> >>>> > > > >>  Looks like
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>> J-D
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <
>> >> >> > >> >>>> > > zhongresearch@gmail.com
>> >> >> > >> >>>> > > > >
>> >> >> > >> >>>> > > > >>> wrote:
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> Hi,
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> After I re-organize the cluster, the experiment
>> >> ran
>> >> >> into
>> >> >> > >> >>>> problem
>> >> >> > >> >>>> > > > faster
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> than
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> before.
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> Basically the changes are to use machines with
>> >> less
>> >> >> > >> resources
>> >> >> > >> >>>> as
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> zookeeper
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> quorums and machines with more resources as
>> >> >> > regionserver.
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> From the log, I still see the pause around 1
>> >> minute.
>> >> >> > >> >>>> > > > >>>> I enabled the GC logging,  please see
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> for details.
>> >> >> > >> >>>> > > > >>>> In general I don't see many pauses in the GC.
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> What is more interesting, 1 hour after the 1st
>> >> >> > >> regionserver
>> >> >> > >> >>>> > > > >>>> disconnected,
>> >> >> > >> >>>> > > > >>>> the master log complained about too many open
>> >> files.
>> >> >> > This
>> >> >> > >> >>>> didn't
>> >> >> > >> >>>> > > > happen
>> >> >> > >> >>>> > > > >>>> before.
>> >> >> > >> >>>> > > > >>>> I checked the system OS setup as well as the
>> >> >> > limits.conf.
>> >> >> > >> I
>> >> >> > >> >>>> also
>> >> >> > >> >>>> > > > checked
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> the
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> running processes. The total open files don't
>> >> reach
>> >> >> the
>> >> >> > >> limit.
>> >> >> > >> >>>> I
>> >> >> > >> >>>> > am
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> confused
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> a bit.
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> Please see the following master log.
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
>> >> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
>> >> >> > >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658 row(s)
>> of
>> >> meta
>> >> >> > >> region
>> >> >> > >> >>>> > > {server:
>> >> >> > >> >>>> > > > >>>> 192.168.100.128:60021, regionname: .META.,,1,
>> >> >> startKey:
>> >> >> > >> <>}
>> >> >> > >> >>>> > > complete
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
>> >> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> All
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> 1 .META. region(s) scanned
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
>> >> >> > >> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> Got
>> >> >> > >> >>>> > > > ping
>> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after
>> 3ms
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException:
>> Bad
>> >> >> connect
>> >> >> > >> ack
>> >> >> > >> >>>> with
>> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_-2808245019291145247_5478039
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream java.io.EOFException
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_1074853606841896259_5478048
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
>> >> >> > >> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> Got
>> >> >> > >> >>>> > > > ping
>> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after
>> 1ms
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
>> >> >> > >> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> Got
>> >> >> > >> >>>> > > > ping
>> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after
>> 0ms
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
>> >> >> > >> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> Got
>> >> >> > >> >>>> > > > ping
>> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after
>> 0ms
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream java.io.IOException:
>> Bad
>> >> >> connect
>> >> >> > >> ack
>> >> >> > >> >>>> with
>> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_-6642544517082142289_5478063
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream
>> java.net.SocketException:
>> >> Too
>> >> >> > many
>> >> >> > >> >>>> open
>> >> >> > >> >>>> > > files
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_2057511041109796090_5478063
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream
>> java.net.SocketException:
>> >> Too
>> >> >> > many
>> >> >> > >> >>>> open
>> >> >> > >> >>>> > > files
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_8219260302213892894_5478064
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream
>> java.net.SocketException:
>> >> Too
>> >> >> > many
>> >> >> > >> >>>> open
>> >> >> > >> >>>> > > files
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_1669205542853067709_5478235
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream
>> java.net.SocketException:
>> >> Too
>> >> >> > many
>> >> >> > >> >>>> open
>> >> >> > >> >>>> > > files
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_9128897691346270351_5478237
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream
>> java.net.SocketException:
>> >> Too
>> >> >> > many
>> >> >> > >> >>>> open
>> >> >> > >> >>>> > > files
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_1116845144864123018_5478240
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
>> >> >> > >> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> Got
>> >> >> > >> >>>> > > > ping
>> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c after
>> 0ms
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
>> >> >> > >> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> Got
>> >> >> > >> >>>> > > > ping
>> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c after
>> 0ms
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> DataStreamer
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> Exception: java.io.IOException: Too many open
>> >> files
>> >> >> > >> >>>> > > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native
>> Method)
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > >
>> >> >> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>>
>> >> >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >> >> > >> >>>> > > > >>>>      at
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > Error
>> >> >> > >> >>>> > > > >>>> Recovery for block
>> blk_8148813491785406356_5478475
>> >> >> bad
>> >> >> > >> >>>> datanode[0]
>> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > Error
>> >> >> > >> >>>> > > > >>>> Recovery for block
>> blk_8148813491785406356_5478475
>> >> in
>> >> >> > >> pipeline
>> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010, 192.168.100.134:50010,
>> >> >> > >> >>>> > 192.168.100.122:50010
>> >> >> > >> >>>> > > :
>> >> >> > >> >>>> > > > >>>> bad
>> >> >> > >> >>>> > > > >>>> datanode 192.168.100.123:50010
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > Exception
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>> in
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> createBlockOutputStream
>> java.net.SocketException:
>> >> Too
>> >> >> > many
>> >> >> > >> >>>> open
>> >> >> > >> >>>> > > files
>> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
>> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
>> >> >> > >> >>>> > > > >>>> Abandoning
>> >> >> > >> >>>> > > > >>>> block blk_2119727700857186236_5478498
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <
>> >> >> > >> >>>> > > > zhongresearch@gmail.com
>> >> >> > >> >>>> > > > >>>> wrote:
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>  Will do.
>> >> >> > >> >>>> > > > >>>>>
>> >> >> > >> >>>> > > > >>>>> thanks
>> >> >> > >> >>>> > > > >>>>> zhenyu
>> >> >> > >> >>>> > > > >>>>>
>> >> >> > >> >>>> > > > >>>>>
>> >> >> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <
>> >> >> > stack@duboce.net
>> >> >> > >> >
>> >> >> > >> >>>> wrote:
>> >> >> > >> >>>> > > > >>>>>
>> >> >> > >> >>>> > > > >>>>>  Enable GC logging too on this next run (see
>> >> >> > >> hbase-env.sh).
>> >> >> > >> >>>>  Lets
>> >> >> > >> >>>> > > try
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>> and
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> get
>> >> >> > >> >>>> > > > >>>>>> to the bottom of whats going on.
>> >> >> > >> >>>> > > > >>>>>> Thanks,
>> >> >> > >> >>>> > > > >>>>>> St.Ack
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu
>> Zhong <
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>> zhongresearch@gmail.com
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> wrote:
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>  I can switch the boxes that run zookeeper
>> with
>> >> the
>> >> >> > ones
>> >> >> > >> >>>> that
>> >> >> > >> >>>> > run
>> >> >> > >> >>>> > > > >>>>>>> regionservers.
>> >> >> > >> >>>> > > > >>>>>>> I will see the results later.
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes
>> >> >> > >> >>>> > zookeeper.session.timeout
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>> value
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> in
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> place.
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>> thanks
>> >> >> > >> >>>> > > > >>>>>>> zhenyu
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <
>> >> >> > >> stack@duboce.net>
>> >> >> > >> >>>> > wrote:
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu
>> Zhong
>> >> <
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>> zhongresearch@gmail.com
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> wrote:
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster have 2
>> >> disks
>> >> >> > 400GB
>> >> >> > >> >>>> each
>> >> >> > >> >>>> > per
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> machine,
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> and
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per
>> machine.
>> >> >> >  Maybe
>> >> >> > >> we
>> >> >> > >> >>>> > should
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> run
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> zookeeper
>> >> >> > >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS on
>> >> machines
>> >> >> > with
>> >> >> > >> 6
>> >> >> > >> >>>> disks?
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> That would make most sense only in the
>> below,
>> >> it
>> >> >> > >> looks
>> >> >> > >> >>>> like
>> >> >> > >> >>>> > the
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> RS
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> that
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> had
>> >> >> > >> >>>> > > > >>>>>>>> issue had 4 disks?
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>  BTW, the 10 minutes
>> zookeeper.session.timeout
>> >> >> has
>> >> >> > >> been
>> >> >> > >> >>>> set
>> >> >> > >> >>>> > > during
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> the
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> experiment.
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> And for sure this node had it in place?
>> >> >> > >> >>>> > > > >>>>>>>> St.Ack
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> thanks
>> >> >> > >> >>>> > > > >>>>>>>>> zhenyu
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <
>> >> >> > >> stack@duboce.net
>> >> >> > >> >>>> >
>> >> >> > >> >>>> > > wrote:
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu
>> >> Zhong <
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> zhongresearch@gmail.com
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> wrote:
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>  Though I experienced 2 regionserver
>> >> >> disconnection
>> >> >> > >> this
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> morning,
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> it
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> looks
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> better from the regionserver log. (please
>> >> see
>> >> >> the
>> >> >> > >> >>>> following
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> log)
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the
>> >> >> > >> >>>> > regionserver(192.168.100.116)
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> during
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> the
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> time it disconnected.
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
>> >> >> > >> >>>> > > > >>>>>>>>>>> Device:            tps   Blk_read/s
>> >> >> Blk_wrtn/s
>> >> >> > >> >>>> Blk_read
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> Blk_wrtn
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> sda             105.31      5458.83
>> >> 19503.64
>> >> >> > >> >>>> 9043873239
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> 32312473676
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> sda1              2.90         6.64
>> >>  99.25
>> >> >> > >> >>>> 10993934
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> 164433464
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> sda2              1.72        23.76
>> >>  30.25
>> >> >> > >> >>>> 39365817
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> 50124008
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> sda3              0.30         0.38
>> >> 3.58
>> >> >> > >> >>>> 624930
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> 5923000
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> sda4            100.39      5428.06
>> >> 19370.56
>> >> >> > >> >>>> 8992888270
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> 32091993204
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k
>> blocks/second
>> >> >> would
>> >> >> > >> seem to
>> >> >> > >> >>>> be
>> >> >> > >> >>>> > > high
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> but
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> its
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> one
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> disk only and its not being shared by zk
>> >> >> anymore
>> >> >> > so
>> >> >> > >> >>>> > shouldn't
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> matter?
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper quorum
>> server
>> >> >> that
>> >> >> > the
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> tried
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> to
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> connect according to the log. However, I
>> >> don't
>> >> >> see
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> 192.168.100.116
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> in
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> the
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum
>> server
>> >> that
>> >> >> > the
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> tried
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> to
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> connect.
>> >> >> > >> >>>> > > > >>>>>>>>>>> Would that be a problem?
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>  Is that because the ephemeral node for
>> the
>> >> >> > >> >>>> regionserver
>> >> >> > >> >>>> > had
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> evaporated?
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> Lost
>> >> >> > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you went to
>> >> look?
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
>> >> >> > >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558, built
>> on
>> >> >> > >> 08/27/2009
>> >> >> > >> >>>> 18:48
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> GMT
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> Clients:
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
>> >> >> > >> [1](queued=0,recved=26,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
>> >> >> > >> [1](queued=0,recved=964,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
>> >> >> > >> [1](queued=0,recved=3266,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
>> >> >> > >> [1](queued=0,recved=233,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
>> >> >> > >> [1](queued=0,recved=2,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
>> >> >> > >> [1](queued=0,recved=295,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
>> >> >> > >> [1](queued=0,recved=2290,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
>> >> >> > >> [1](queued=0,recved=0,sent=0)
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
>> >> >> > >> >>>> > > > >>>>>>>>>>> Received: 47111
>> >> >> > >> >>>> > > > >>>>>>>>>>> Sent: 0
>> >> >> > >> >>>> > > > >>>>>>>>>>> Outstanding: 0
>> >> >> > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
>> >> >> > >> >>>> > > > >>>>>>>>>>> Mode: leader
>> >> >> > >> >>>> > > > >>>>>>>>>>> Node count: 23
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is pretty
>> bad
>> >> but
>> >> >> > >> should
>> >> >> > >> >>>> be
>> >> >> > >> >>>> > well
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> within
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> the
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> zk
>> >> >> > >> >>>> > > > >>>>>>>>>> session timeout.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk client
>> side
>> >> of
>> >> >> the
>> >> >> > >> >>>> session;
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> i.e.
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> in
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> the
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> RS.
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D
>> suggested
>> >> to
>> >> >> > see
>> >> >> > >> if
>> >> >> > >> >>>> you
>> >> >> > >> >>>> > > have
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> any
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> big
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't look too
>> >> >> heavily
>> >> >> > >> >>>> loaded:
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> >  org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> >> > >> >>>> Dump of
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> metrics:
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>  request=1.5166667, regions=322,
>> stores=657,
>> >> >> > >> >>>> storefiles=631,
>> >> >> > >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61,
>> memstoreSize=1472,
>> >> >> > >> >>>> usedHeap=2819,
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> maxHeap=4079,
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
>> >> >> > blockCacheFree=197395984,
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The number of
>> >> store
>> >> >> > files
>> >> >> > >> >>>> seems
>> >> >> > >> >>>> > > fine.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> Not
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> too
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> much memory used.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the Lease
>> Still
>> >> Held
>> >> >> > >> >>>> exception.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> This
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> happens
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> when the RS does its regular report to the
>> >> >> master
>> >> >> > but
>> >> >> > >> the
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> master
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> thinks
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> the
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> RS has since restarted.  It'll think this
>> >> >> probably
>> >> >> > >> >>>> because
>> >> >> > >> >>>> > it
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> noticed
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> that
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away and it
>> >> >> > considered
>> >> >> > >> the
>> >> >> > >> >>>> RS
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> dead.
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> Looking too at your  logs I see this gap
>> in
>> >> the
>> >> >> zk
>> >> >> > >> >>>> pinging:
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> Got
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>  ping response for
>> sessionid:0x224e55436ad0004
>> >> >> after
>> >> >> > >> 0ms
>> >> >> > >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> Got
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>  ping response for
>> sessionid:0x24e55436a0007d
>> >> >> after
>> >> >> > >> 0ms
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> Where in the lines above it, its
>> reporting
>> >> >> about
>> >> >> > >> every
>> >> >> > >> >>>> ten
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> seconds,
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> here
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> there is a big gap.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or something that
>> will
>> >> let
>> >> >> you
>> >> >> > >> look
>> >> >> > >> >>>> more
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> into
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> what
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> was
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> happening on this machine around this
>> time?
>> >>  Is
>> >> >> > the
>> >> >> > >> >>>> machine
>> >> >> > >> >>>> > > OK?
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> It
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> looks
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> lightly loaded and you have your cluster
>> >> nicely
>> >> >> > laid
>> >> >> > >> out.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> Something
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> odd
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> is
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> going on.  What about things like the
>> write
>> >> >> speed
>> >> >> > to
>> >> >> > >> >>>> disk?
>> >> >> > >> >>>> >  In
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> the
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> past
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> strange issues have been explained by
>> >> incorrectly
>> >> >> > set
>> >> >> > >> BIOS
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> which
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> made
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> disks
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper speed.
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> St.Ack
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> Best,
>> >> >> > >> >>>> > > > >>>>>>>>>>> zhenyu
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu
>> >> Zhong
>> >> >> <
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> zhongresearch@gmail.com
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> wrote:
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>  Stack
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your
>> comments.
>> >> >> > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper monitoring
>> script
>> >> on
>> >> >> my
>> >> >> > >> >>>> cluster
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> and
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>> let
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>> it
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>> run
>> >> >> > >> >>>> > > > >>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>> overnight to see the result.
>> >> >> > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when I get
>> >> >> anything.
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> thanks
>> >> >> > >> >>>> > > > >>>>>>>>>>>> zhenyu
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack
>> <
>> >> >> > >> >>>> stack@duboce.net>
>> >> >> > >> >>>> > > > >>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>> wrote:
>> >> >> > >> >>>> > > > >>>>>>
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>>  I see these in your log too:
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> Got
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
>> >> sessionid:0x424dfd908c50009
>> >> >> > >> after
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> 4544ms
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> Got
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
>> >> sessionid:0x24dfd90c810002
>> >> >> > after
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> 4548ms
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>> Got
>> >> >> > >> >>>> > > > >>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
>> >> sessionid:0x424dfd908c50009
>> >> >> > >> after
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> 9030ms
>> >> >> > >> >>>> > > > >>>
>> >> >> > >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
>> >> >> > >> >>>> > > > >>>>>>>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
>> >> >> > >> >>>> > > > >>>>>>>>
>> >> >> > >> >>>> > > > >>>>>>>
>> >> >> > >> >>>> > > >
>> >> >> > >> >>>> > >
>> >> >> > >> >>>> >
>> >> >> > >> >>>>
>> >> >> > >> >>>
>> >> >> > >> >>>
>> >> >> > >> >>
>> >> >> > >> >
>> >> >> > >>
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message