hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: regionserver disconnection
Date Thu, 19 Nov 2009 16:43:00 GMT
So, the time at which we see

2009-11-18 12:41:56,782 ERROR
org.apache.zookeeper.server.quorum.FollowerHandler: Unexpected exception
causing shutdown while sock still open

... corresponds to the time at which the regionserver loses is session?

St.Ack

On Thu, Nov 19, 2009 at 8:38 AM, Zhenyu Zhong <zhongresearch@gmail.com>wrote:

> Here is the zookeeper leader log. It had an unexpected exception.  I don't
> know why.
>
> http://pastebin.com/m6594ccc
>
> During that time, I don't think the system load is high. Memory usage
> should
> be normal.
>
> Best,
> zhenyu
>
>
>
> On Thu, Nov 19, 2009 at 11:09 AM, stack <stack@duboce.net> wrote:
>
> > What is in the other zk logs?  Why did the leader go away?  Memory issue
> or
> > something?  Why'd it shutdown?  Thanks for pastebin'ing this stuff.
> > St.Ack
> >
> > On Thu, Nov 19, 2009 at 7:53 AM, Zhenyu Zhong <zhongresearch@gmail.com
> > >wrote:
> >
> > > After digging more on the zookeeper log, I found that there seemed to
> be
> > a
> > > socketconnection timeout. Also some warnings indicate that zookeeper
> > server
> > > is not running. Probably I need to understand the mechanism of
> zookeeper
> > > first.
> > >
> > > Please find the zookeeper log around that time.
> > >
> > > http://pastebin.com/m37bb4ad1
> > >
> > > Many thanks
> > > zhenyu
> > >
> > >
> > > On Thu, Nov 19, 2009 at 10:31 AM, Zhenyu Zhong <
> zhongresearch@gmail.com
> > > >wrote:
> > >
> > > > Stack,
> > > >
> > > > I am very appreciated. Let me dig into the zookeeper log more.
> > > >
> > > > FYI, I saw the sleeper complained about the delay for the first time.
> > It
> > > > looks like there are different reasons for the RS disconnections.
> > > >
> > > > 2009-11-19 08:08:32,708 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> > > > response for sessionid:0x42508716bfc0001 after 0ms
> > > > 2009-11-19 08:08:37,595 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> > > > response for sessionid:0x42508716bfc0082 after 0ms
> > > > 2009-11-19 08:08:46,073 DEBUG org.apache.zookeeper.ClientCnxn: Got
> ping
> > > > response for sessionid:0x42508716bfc0001 after 0ms
> > > > 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
> Exception
> > > > closing session 0x42508716bfc0001 to
> > sun.nio.ch.SelectionKeyImpl@4f3a608f
> > > > java.io.IOException: TIMED OUT
> > > >         at
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > > 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
> Exception
> > > > closing session 0x42508716bfc0082 to
> > sun.nio.ch.SelectionKeyImpl@14c8875
> > > > java.io.IOException: TIMED OUT
> > > >         at
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > > 2009-11-19 08:09:39,880 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept
> > > > 53556ms, ten times longer than scheduled: 3000
> > > > 2009-11-19 08:09:39,980 WARN
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > > >         at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > Method)
> > > >         at
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > >         at
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > >         at
> > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > > >         at java.lang.Thread.run(Thread.java:619)
> > > > 2009-11-19 08:09:39,980 INFO
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > event,
> > > > state: Disconnected, type: None, path: null
> > > > 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
> > Attempting
> > > > connection to server superpyxis0005/192.168.100.119:2181
> > > > 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn: Priming
> > > > connection to java.nio.channels.SocketChannel[connected local=/
> > > > 192.168.100.122:40659 remote=superpyxis0005/192.168.100.119:2181]
> > > >
> > > >
> > > >
> > > > Best,
> > > > zhenyu
> > > >
> > > > On Wed, Nov 18, 2009 at 6:52 PM, stack <stack@duboce.net> wrote:
> > > >
> > > >> Patrick Hunt, one of the zk lads, answered with the following:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> According to the following snippet (I cut to just session
> > > >> 0x425035c48210002)
> > > >> the client is getting pings for this session id regularly every 6
> > > seconds
> > > >> or
> > > >> so, then after 41,016 there's a 20 second gap (corresponds to a 30
> > > second
> > > >> timeout being set during session creation) at which point we
> > disconnect
> > > >> from
> > > >> the server.
> > > >>
> > > >>
> > > >> > 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > > ping
> > > >> > response for sessionid:0x425035c48210002 after 0ms
> > > >> > 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn:
> > > Exception
> > > >> > closing session 0x425035c48210002 to
> > > >> sun.nio.ch.SelectionKeyImpl@421690ab
> > > >> > java.io.IOException: TIMED OUT
> > > >>
> > > >> the client notifies hbase of the disco, and attempts to reconnect to
> a
> > > >> server
> > > >>
> > > >>
> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > > event,
> > > >> > state: Disconnected, type: None, path: null
> > > >> > 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> > > Attempting
> > > >> > connection to server superpyxis0003/192.168.100.117:2181
> > > >>
> > > >> but the server won't allow the session be be established for some
> > reason
> > > >> (is
> > > >> that the right server:port that's it's trying to connect to?)
> > > >>
> > > >>
> > > >> > 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> Server
> > > >> > connection successful
> > > >> > 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > > Exception
> > > >> > closing session 0x425035c48210002 to
> > > >> sun.nio.ch.SelectionKeyImpl@5c07076b
> > > >> > java.io.IOException: Read error rc = -1
> > > java.nio.DirectByteBuffer[pos=0
> > > >> > lim=4 cap=4]
> > > >>
> > > >> The "read error rc = -1" means that the client read from the socket
> > and
> > > >> got
> > > >> EOS. But it's not a session expiration afaict (and should not be at
> > this
> > > >> point in time).
> > > >>
> > > >>
> > > >> You need to look at the ZK server logs for this period in time (for
> > this
> > > >> server superpyxis0003:2181, and superpyxis0001:2181, the client
> tries
> > to
> > > >> connect to both of these). There should be some indication there why
> > the
> > > >> server is closing down the connection and not allowing the session
> to
> > be
> > > >> re-established.
> > > >>
> > > >> One thought - by default we limit clients to a max of 10 connections
> > > from
> > > >> a
> > > >> particular host to the server. If you exceed this limit the zk
> server
> > > >> (it's
> > > >> per server though, not per cluster, so perhaps this is not it) will
> > not
> > > >> allow any  additional connections from the client host (you can't
> > exceed
> > > >> 10
> > > >> on a server). Could this be it? i.e. the user is running a number of
> > ZK
> > > >> clients (say hbase processes) from the same host? Regardless, get
> the
> > > >> server
> > > >> log for this time period and it should be more clear.
> > > >>
> > > >> Patrick
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Nov 18, 2009 at 11:18 AM, stack <stack@duboce.net> wrote:
> > > >>
> > > >> > And running stat on each of your zookeeper nodes tells you what
> > about
> > > >> > averages for connect-times?
> > > >> >
> > > >> > Its odd that there is a ping just 20 seconds beforehand and you
> have
> > > set
> > > >> > the zk session timeout at 60 seconds (or ten minutes?).  Let me
> ask
> > > the
> > > >> zk
> > > >> > lads about it.
> > > >> >
> > > >> > If you look in the GC log for around this time, do you see
> anything?
> > > >> >
> > > >> > St.Ack
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <
> > > zhongresearch@gmail.com
> > > >> >wrote:
> > > >> >
> > > >> >> After running more experiment, it seems that the disconnection is
> > not
> > > >> >> related to the heavy load job, because the load average is low,
> the
> > > >> diskio
> > > >> >> is normal, memory Heap was not reached, also the virtual memory
> > stats
> > > >> >> shows
> > > >> >> that no swappings.  However, the disconnection still happens.
> > > >> >>
> > > >> >> It looks like this time it pauses for 20 seconds. No idea why
> > > >> regionserver
> > > >> >> disconnected.
> > > >> >>
> > > >> >> Any other suggestions please?
> > > >> >>
> > > >> >>
> > > >> >> 2009-11-18 12:41:00,956 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > > >> >> 2009-11-18 12:41:08,074 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x425035c48210002 after 0ms
> > > >> >> 2009-11-18 12:41:14,302 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > > >> >> 2009-11-18 12:41:21,433 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x425035c48210002 after 0ms
> > > >> >> 2009-11-18 12:41:27,656 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > > >> >> 2009-11-18 12:41:34,788 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x425035c48210002 after 0ms
> > > >> >> 2009-11-18 12:41:41,016 DEBUG org.apache.zookeeper.ClientCnxn:
> Got
> > > ping
> > > >> >> response for sessionid:0x125035c495d02bb after 1ms
> > > >> >> 2009-11-18 12:42:01,482 WARN org.apache.zookeeper.ClientCnxn:
> > > Exception
> > > >> >> closing session 0x425035c48210002 to
> > > >> sun.nio.ch.SelectionKeyImpl@421690ab
> > > >> >> java.io.IOException: TIMED OUT
> > > >> >>        at
> > > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > >> >> 2009-11-18 12:42:01,582 INFO
> > > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> > > >> event,
> > > >> >> state: Disconnected, type: None, path: null
> > > >> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> > > >> Attempting
> > > >> >> connection to server superpyxis0003/192.168.100.117:2181
> > > >> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> > Priming
> > > >> >> connection to java.nio.channels.SocketChannel[connected local=/
> > > >> >> 192.168.100.132:46610 remote=superpyxis0003/192.168.100.117:2181
> ]
> > > >> >> 2009-11-18 12:42:01,782 INFO org.apache.zookeeper.ClientCnxn:
> > Server
> > > >> >> connection successful
> > > >> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn:
> > > Exception
> > > >> >> closing session 0x425035c48210002 to
> > > >> sun.nio.ch.SelectionKeyImpl@56dc6fac
> > > >> >> java.io.IOException: Read error rc = -1
> > > java.nio.DirectByteBuffer[pos=0
> > > >> >> lim=4 cap=4]
> > > >> >>        at
> > > >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > > >> >>        at
> > > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > >> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn:
> > > Ignoring
> > > >> >> exception during shutdown input
> > > >> >> java.net.SocketException: Transport endpoint is not connected
> > > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> > > >> >>        at
> > > >> >>
> > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> > > >> >>        at
> > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > > >> >>         at
> > > >> >>
> > > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> > > >> >>        at
> > > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > > >> >> 2009-11-18 12:42:01,782 WARN org.apache.zookeeper.ClientCnxn:
> > > Ignoring
> > > >> >> exception during shutdown output
> > > >> >> java.net.SocketException: Transport endpoint is not connected
> > > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> > > >> >>        at
> > > >> >>
> > > sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
> > > >> >>        at
> > > >> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
> > > >> >>        at
> > > >> >>
> > > >>
> > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
> > > >> >>        at
> > > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > > >> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> > > >> Attempting
> > > >> >> connection to server superpyxis0001/192.168.100.115:2181
> > > >> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> > Priming
> > > >> >> connection to java.nio.channels.SocketChannel[connected local=/
> > > >> >> 192.168.100.132:36197 remote=superpyxis0001/192.168.100.115:2181
> ]
> > > >> >> 2009-11-18 12:42:02,182 INFO org.apache.zookeeper.ClientCnxn:
> > Server
> > > >> >> connection successful
> > > >> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > > Exception
> > > >> >> closing session 0x425035c48210002 to
> > > >> sun.nio.ch.SelectionKeyImpl@5c07076b
> > > >> >> java.io.IOException: Read error rc = -1
> > > java.nio.DirectByteBuffer[pos=0
> > > >> >> lim=4 cap=4]
> > > >> >>        at
> > > >> >>
> > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > > >> >>        at
> > > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > >> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > > Ignoring
> > > >> >> exception during shutdown input
> > > >> >> java.net.SocketException: Transport endpoint is not connected
> > > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> > > >> >>        at
> > > >> >>
> > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> > > >> >>        at
> > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > > >> >>         at
> > > >> >>
> > > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> > > >> >>        at
> > > >> >>
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > > >> >> 2009-11-18 12:42:02,182 WARN org.apache.zookeeper.ClientCnxn:
> > > Ignoring
> > > >> >> exception during shutdown output
> > > >> >> java.net.SocketException: Transport endpoint is not connected
> > > >> >>
> > > >> >>
> > > >> >> On Mon, Nov 16, 2009 at 4:16 PM, Zhenyu Zhong <
> > > zhongresearch@gmail.com
> > > >> >> >wrote:
> > > >> >>
> > > >> >> > Here is the diskIO and CPU around the time we had RS
> > disconnection
> > > on
> > > >> >> one
> > > >> >> > machine that runs RegionServer. It doesn't seem to be high.
> > Similar
> > > >> disk
> > > >> >> and
> > > >> >> > cpu usage have been seen before and the HBase was running fine.
> > > >> >> >
> > > >> >> >
> > > >> >> > So far I haven't found why my 10 minutes session timeout
> doesn't
> > > >> apply.
> > > >> >> > Still digging.
> > > >> >> >
> > > >> >> > Many thanks!
> > > >> >> >
> > > >> >> > Time: 03:01:08 PM
> > > >> >> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >> >> >            3.82    0.00    0.89    3.97    0.00   91.32
> > > >> >> >
> > > >> >> >
> > > >> >> > Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
> > > >> Blk_wrtn
> > > >> >> > sda              24.30      1605.39      6679.35 6238623388
> > > >> 25956251720
> > > >> >> > sda1              0.51         1.94        15.71    7547265
> > > >> 61067904
> > > >> >> > sda2              0.13         2.11         5.20    8202079
> > > >> 20189232
> > > >> >> > sda3              0.13         0.25         1.63     987846
> > > >>  6323336
> > > >> >> > sda4              0.00         0.00         0.00          4
> > > >>  0
> > > >> >> > sda5              1.78         0.06        24.14     220010
> > > >> 93817208
> > > >> >> > sda6              0.38         0.42         8.41    1630584
> > > >> 32688152
> > > >> >> > sda7             21.38      1600.61      6624.26 6220035272
> > > >> 25742165888
> > > >> >> > sdb               4.52       767.33       565.91 2981868690
> > > >> 2199132380
> > > >> >> > sdb1              4.52       767.33       565.91 2981866755
> > > >> 2199132372
> > > >> >> > sdc               4.42       742.95       563.00 2887151482
> > > >> 2187823092
> > > >> >> > sdc1              4.42       742.95       563.00 2887149547
> > > >> 2187823084
> > > >> >> > sdd               4.49       750.78       557.25 2917554074
> > > >> 2165513500
> > > >> >> > sdd1              4.49       750.78       557.25 2917552139
> > > >> 2165513492
> > > >> >> > sde               4.52       758.51       569.46 2947593394
> > > >> 2212964236
> > > >> >> > sde1              4.52       758.51       569.46 2947591459
> > > >> 2212964228
> > > >> >> > sdf               4.51       747.22       571.78 2903740266
> > > >> 2221972652
> > > >> >> > sdf1              4.51       747.22       571.78 2903738331
> > > >> 2221972644
> > > >> >> >
> > > >> >> > Best,
> > > >> >> > zhenyu
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > On Mon, Nov 16, 2009 at 4:07 PM, stack <stack@duboce.net>
> wrote:
> > > >> >> >
> > > >> >> >> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <
> > > >> >> zhongresearch@gmail.com
> > > >> >> >> >wrote:
> > > >> >> >>
> > > >> >> >> > I just realized that there was a MapReduce job running
> during
> > > the
> > > >> >> time
> > > >> >> >> the
> > > >> >> >> > regionserver disconnected from the zookeeper.
> > > >> >> >> > That MapReduce Job was processing 500GB data and took about
> 8
> > > >> minutes
> > > >> >> to
> > > >> >> >> > finish. It launched over 2000 map tasks.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> There was a tasktracker running on the RegionServer that
> > > >> disconnected?
> > > >> >>  Is
> > > >> >> >> the map i/o or cpu heavy?  Do you think it could have stole
> life
> > > >> from
> > > >> >> the
> > > >> >> >> datanode/regionserver?
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> > I doubt that this introduced
> > > >> >> >> > resource contention between DataNode and RegionServer Node.
> > > >> >> >> > Also during the time that MapReduce job ran, I saw a few
> > errors
> > > >> >> >> indicating
> > > >> >> >> > that
> > > >> >> >> >
> > > >> >> >> > java.io.IOException: Could not obtain block:
> > > >> >> >> > blk_7661247556283580226_4450462
> > file=/data/xxx.txt.200910141159
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
> > > >> >> >> >        at
> > java.io.DataInputStream.read(DataInputStream.java:83)
> > > >> >> >> >        at
> > > >> >> >>
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >>
> > > >>
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> > > >> >> >> >        at
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> > > >> >> >> >        at
> > > >> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > > >> >> >> >        at
> > > >> >> >>
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > > >> >> >> >        at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > > >> >> >> >        at
> org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > >> >> >> >
> > > >> >> >>
> > > >> >> >> This is interesting.  Your map task is running into hdfs-127.
> > >  Yeah,
> > > >> >> you
> > > >> >> >> should patch your cluster if you want to get rid of these
> > > (hdfs-127
> > > >> has
> > > >> >> >> been
> > > >> >> >> applied to 0.21 hadoop and will be in the next hadoop release
> on
> > > >> 0.20.x
> > > >> >> >> branch, hadoop 0.20.2).
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> >
> > > >> >> >> > Possibly it is related to HDFS-127, but I don't see any
> > > datanodes
> > > >> >> went
> > > >> >> >> > down. Does that affect Regionserver? Shall we apply the
> patch?
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> Read up on hdfs-127 for explaination of whats going on.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> > Now I start to keep tracking of the virtual memory stats to
> > see
> > > if
> > > >> >> the
> > > >> >> >> same
> > > >> >> >> > issue happens tomorrow around the same time period.
> > > >> >> >> >
> > > >> >> >>
> > > >> >> >> Excellent.  Thanks for doing this.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> > And I highly suspect that this particular MapReduce job
> hurts
> > > >> HBase
> > > >> >> >> > Regionserver.
> > > >> >> >> >
> > > >> >> >> > PS. I also use gcviewer to parse the GC-log, I only see
> around
> > > 30
> > > >> >> >> seconds
> > > >> >> >> > pauses maximum.
> > > >> >> >> >
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> Did you figure why your ten minute session is not applying?
> > > >> >> >>
> > > >> >> >> St.Ack
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> >
> > > >> >> >> > Thanks a lot.
> > > >> >> >> > zhenyu
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <
> > > >> >> zhongresearch@gmail.com
> > > >> >> >> > >wrote:
> > > >> >> >> >
> > > >> >> >> > > J-D,
> > > >> >> >> > >
> > > >> >> >> > > Thank you very much for your comments.
> > > >> >> >> > > My company block the IRC port, so I have trouble to
> connect
> > to
> > > >> IRC
> > > >> >> >> > channel.
> > > >> >> >> > > I have been trying to ask the IT to open the IRC port for
> > me,
> > > it
> > > >> >> might
> > > >> >> >> > take
> > > >> >> >> > > a while.
> > > >> >> >> > >
> > > >> >> >> > > Best,
> > > >> >> >> > > zhenyu
> > > >> >> >> > >
> > > >> >> >> > >
> > > >> >> >> > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans <
> > > >> >> >> jdcryans@apache.org
> > > >> >> >> > >wrote:
> > > >> >> >> > >
> > > >> >> >> > >> The error you are getting is a disconnection from a
> > zookeeper
> > > >> >> server
> > > >> >> >> > >> and is very generic.
> > > >> >> >> > >>
> > > >> >> >> > >> ZK-86 is still opened and the last comment refers to
> ZK-111
> > > >> saying
> > > >> >> >> > >> that the bug (in unit tests) was probably fixed in
> release
> > > >> 3.0.0
> > > >> >> last
> > > >> >> >> > >> year.
> > > >> >> >> > >>
> > > >> >> >> > >> To figure the hang you have, you can try to jstack the
> > > process
> > > >> pid
> > > >> >> >> and
> > > >> >> >> > >> see exactly what's holding the RS from shutting down.
> > > >> >> >> > >>
> > > >> >> >> > >> Would it be possible for you to drop by the IRC channel?
> > This
> > > >> way
> > > >> >> we
> > > >> >> >> > >> can debug this a at much faster pace.
> > > >> >> >> > >>
> > > >> >> >> > >> Thx!
> > > >> >> >> > >>
> > > >> >> >> > >> J-D
> > > >> >> >> > >>
> > > >> >> >> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
> > > >> >> >> zhongresearch@gmail.com>
> > > >> >> >> > >> wrote:
> > > >> >> >> > >> > I found this.
> > > >> >> >> > >> > http://issues.apache.org/jira/browse/ZOOKEEPER-86
> > > >> >> >> > >> >
> > > >> >> >> > >> > It looks like the same error I had. Is it a zookeeper
> > bug?
> > > >> When
> > > >> >> >> will
> > > >> >> >> > >> HBase
> > > >> >> >> > >> > take the zookeeper version 3.3.0?
> > > >> >> >> > >> >
> > > >> >> >> > >> > thanks
> > > >> >> >> > >> > zhenyu
> > > >> >> >> > >> >
> > > >> >> >> > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
> > > >> >> >> > zhongresearch@gmail.com
> > > >> >> >> > >> >wrote:
> > > >> >> >> > >> >
> > > >> >> >> > >> >> Now I really doubt about the zookeeper, from the log I
> > saw
> > > >> >> errors
> > > >> >> >> > like
> > > >> >> >> > >> >> IOException Read Error, while the zookeeper client
> > > >> >> (regionserver )
> > > >> >> >> > >> tried to
> > > >> >> >> > >> >> read. But it got disconneted status from zookeeper.
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> I don't see any load on any zookeeper quorum servers.
> > > DiskIO
> > > >> is
> > > >> >> >> fine.
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> Also when the regionserver decides to exit due to the
> > > >> >> disconnect
> > > >> >> >> > status
> > > >> >> >> > >> >> from zookeeper, sometimes the regionserver hangs
> during
> > > the
> > > >> >> >> exiting.
> > > >> >> >> > We
> > > >> >> >> > >> can
> > > >> >> >> > >> >> still see the HRegionServer process even we don't see
> it
> > > in
> > > >> the
> > > >> >> >> > Master
> > > >> >> >> > >> web
> > > >> >> >> > >> >> interface.
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> I also notice that there is an zookeeper.retries and
> > > >> >> >> zookeeper.pauses
> > > >> >> >> > >> >> settings in hbase-default.xml, would that matter ?
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> thanks
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> You may see the IOException error in the following log
> > > when
> > > >> the
> > > >> >> >> > >> >> Regionserver lost connection to the zookeeper.
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> 2009-11-14 15:58:31,993 INFO
> > > >> >> >> > org.apache.hadoop.hbase.regionserver.HLog:
> > > >> >> >> > >> >> Roll
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
> > > >> >> >> > >> >> entries=1062964, calcsize=255014936,
> filesize=139566093.
> > > New
> > > >> >> hlog
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
> > > >> >> >> > >> >> 2009-11-14 15:58:39,295 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 15:58:52,648 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 15:59:06,007 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 15:59:19,365 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 15:59:32,867 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 186ms
> > > >> >> >> > >> >> 2009-11-14 15:59:46,070 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 15:59:59,423 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 16:00:12,775 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 0ms
> > > >> >> >> > >> >> 2009-11-14 16:00:28,141 DEBUG
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> Got
> > > >> >> >> > ping
> > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after 2010ms
> > > >> >> >> > >> >> 2009-11-14 16:01:10,378 WARN
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > Exception
> > > >> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
> > > >> >> >> > >> sun.nio.ch.SelectionKeyImpl@39fbb2d6
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> java.io.IOException: Read error rc = -1
> > > >> >> >> > java.nio.DirectByteBuffer[pos=0
> > > >> >> >> > >> >> lim=4 cap=4]
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >>
> > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >>
> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > >> >> >> > >> >> 2009-11-14 16:01:10,478 INFO
> > > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> Got
> > > >> >> ZooKeeper
> > > >> >> >> > >> event,
> > > >> >> >> > >> >> state: Disconnected, type: None, path: null
> > > >> >> >> > >> >> 2009-11-14 16:01:11,333 WARN
> > > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > >> Attempt=1
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > > >> >> >> > >> >>         at
> > > >> >> >> > >>
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > >> >> >> > >> >> Method)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >>
> > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > > >> >> >> > >> >>         at java.lang.Thread.run(Thread.java:619)
> > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> Attempting
> > > >> >> >> > >> >> connection to server superpyxis0001/
> > 192.168.100.115:2181
> > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> Priming
> > > >> >> >> > >> >> connection to
> java.nio.channels.SocketChannel[connected
> > > >> local=/
> > > >> >> >> > >> >> 192.168.100.122:59575 remote=superpyxis0001/
> > > >> >> 192.168.100.115:2181]
> > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> Server
> > > >> >> >> > >> >> connection successful
> > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> Got
> > > >> >> ZooKeeper
> > > >> >> >> > >> event,
> > > >> >> >> > >> >> state: Expired, type: None, path: null
> > > >> >> >> > >> >> 2009-11-14 16:01:11,433 WARN
> > > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > Exception
> > > >> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
> > > >> >> >> > >> sun.nio.ch.SelectionKeyImpl@6b93d343
> > > >> >> >> > >> >> java.io.IOException: Session Expired
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >>
> > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
> > > >> >> >> > >> >>
> > > >> >> >> > >> >>         at
> > > >> >> >> > >> >>
> > > >> >> >>
> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > >> >> >> > >> >> 2009-11-14 16:01:11,433 ERROR
> > > >> >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > >> ZooKeeper
> > > >> >> >> session
> > > >> >> >> > >> >> expired
> > > >> >> >> > >> >> 2009-11-14 16:01:13,513 INFO
> > > >> org.apache.hadoop.ipc.HBaseServer:
> > > >> >> >> IPC
> > > >> >> >> > >> Server
> > > >> >> >> > >> >> handler 3 on 60021, call put([B@73deb204,
> > > >> >> >> > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a) from
> > > >> >> >> > >> 192.168.100.132:40728:
> > > >> >> >> > >> >> error: java.io.IOException: Server not running,
> aborting
> > > >> >> >> > >> >> java.io.IOException: Server not running, aborting
> > > >> >> >> > >> >>
> > > >> >> >> > >> >>
> > > >> >> >> > >> >>
> > > >> >> >> > >> >>
> > > >> >> >> > >> >>
> > > >> >> >> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
> > > >> >> >> > zhongresearch@gmail.com
> > > >> >> >> > >> >wrote:
> > > >> >> >> > >> >>
> > > >> >> >> > >> >>> Stack,
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>> Thanks a lot!
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>> I found out that the reason why HBase doesn't take
> the
> > > >> system
> > > >> >> >> file
> > > >> >> >> > >> >>> descriptor value I set before. I started the HBase
> > using
> > > >> root
> > > >> >> >> > instead
> > > >> >> >> > >> of the
> > > >> >> >> > >> >>> normal hadoop user, while my system configures hadoop
> > > with
> > > >> >> higher
> > > >> >> >> > file
> > > >> >> >> > >> >>> descriptor value but configures root with 1024
> default
> > > >> value.
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>> Now my system has a clean start. However, it seems
> that
> > > the
> > > >> >> >> > >> >>> zookeeper.session.timeout value doesn't take into
> > effect.
> > > I
> > > >> >> still
> > > >> >> >> > >> found
> > > >> >> >> > >> >>> around 60 seconds pauses from the disconnected
> > > >> regionserver. I
> > > >> >> >> > really
> > > >> >> >> > >> don't
> > > >> >> >> > >> >>> know why regionserver only times out after 60 seconds
> > > >> instead
> > > >> >> of
> > > >> >> >> 10
> > > >> >> >> > >> minutes
> > > >> >> >> > >> >>> which I set for zookeeper.session.timeout.
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>> Is there any other timeout value coming into play
> > before
> > > >> the
> > > >> >> >> actual
> > > >> >> >> > >> >>> session times out?
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>> zhenyu
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <
> > stack@duboce.net
> > > >
> > > >> >> wrote:
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>>> Ok.  Lack of file descriptors manifests in all kinds
> > of
> > > >> weird
> > > >> >> >> ways.
> > > >> >> >> > >> >>>> Hopefully thats it.  If not, lets keep digging.
> > > >> >> >> > >> >>>> St.Ack
> > > >> >> >> > >> >>>>
> > > >> >> >> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong <
> > > >> >> >> > >> zhongresearch@gmail.com
> > > >> >> >> > >> >>>> >wrote:
> > > >> >> >> > >> >>>>
> > > >> >> >> > >> >>>> > Stack,
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>> > You are right, the master started with ulimit -n
> > 1024.
> > > >> It
> > > >> >> >> doesn't
> > > >> >> >> > >> take
> > > >> >> >> > >> >>>> the
> > > >> >> >> > >> >>>> > system value.
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>> > Regarding to the too many open files, it looks
> like
> > > the
> > > >> >> same
> > > >> >> >> as
> > > >> >> >> > the
> > > >> >> >> > >> one
> > > >> >> >> > >> >>>> J-D
> > > >> >> >> > >> >>>> > put up. But I will get the Master start with
> higher
> > > >> value
> > > >> >> >> first
> > > >> >> >> > and
> > > >> >> >> > >> see
> > > >> >> >> > >> >>>> if
> > > >> >> >> > >> >>>> > this kind of error goes away.
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>> > thanks a lot!
> > > >> >> >> > >> >>>> > zhenyu
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <
> > > >> stack@duboce.net>
> > > >> >> >> wrote:
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>> > > Does it say
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > > ulimit -n 32768
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > > ...as the second line in your log file on start
> of
> > > the
> > > >> >> >> master?
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > > Is the stack trace that complains about too many
> > > open
> > > >> >> files
> > > >> >> >> > same
> > > >> >> >> > >> as
> > > >> >> >> > >> >>>> the
> > > >> >> >> > >> >>>> > one
> > > >> >> >> > >> >>>> > > in the blog post J-D put up?
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > > St.Ack
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu Zhong <
> > > >> >> >> > >> >>>> zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > >wrote:
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> > > > The ulimit file descriptors was set to
> > fs.file-max
> > > >> >> >> =1578334,
> > > >> >> >> > >> also
> > > >> >> >> > >> >>>> in
> > > >> >> >> > >> >>>> > the
> > > >> >> >> > >> >>>> > > > limits.conf the value was set to 32768.
> > > >> >> >> > >> >>>> > > > So these are way higher than the open
> > descriptors
> > > >> for
> > > >> >> the
> > > >> >> >> > >> running
> > > >> >> >> > >> >>>> > > > processes.
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > > > thanks
> > > >> >> >> > >> >>>> > > > zhenyu
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <
> > > >> >> >> saint.ack@gmail.com>
> > > >> >> >> > >> >>>> wrote:
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > > > > You upped the ulimit file descriptors as per
> > the
> > > >> >> getting
> > > >> >> >> > >> started
> > > >> >> >> > >> >>>> doc?
> > > >> >> >> > >> >>>> > > > >
> > > >> >> >> > >> >>>> > > > >
> > > >> >> >> > >> >>>> > > > >
> > > >> >> >> > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <
> > > >> >> >> > >> >>>> zhongresearch@gmail.com>
> > > >> >> >> > >> >>>> > > > wrote:
> > > >> >> >> > >> >>>> > > > >
> > > >> >> >> > >> >>>> > > > >  Thanks a lot.
> > > >> >> >> > >> >>>> > > > >>
> > > >> >> >> > >> >>>> > > > >>
> > > >> >> >> > >> >>>> > > > >> Bad news is my kernel is still 2.6.26.
> > > >> >> >> > >> >>>> > > > >> But it was not a problem before.
> > > >> >> >> > >> >>>> > > > >>
> > > >> >> >> > >> >>>> > > > >> Very strange.
> > > >> >> >> > >> >>>> > > > >>
> > > >> >> >> > >> >>>> > > > >> zhenyu
> > > >> >> >> > >> >>>> > > > >>
> > > >> >> >> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM,
> Jean-Daniel
> > > >> Cryans
> > > >> >> <
> > > >> >> >> > >> >>>> > > > jdcryans@apache.org
> > > >> >> >> > >> >>>> > > > >> >wrote:
> > > >> >> >> > >> >>>> > > > >>
> > > >> >> >> > >> >>>> > > > >>  Looks like
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>> J-D
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu
> > Zhong
> > > <
> > > >> >> >> > >> >>>> > > zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > > >
> > > >> >> >> > >> >>>> > > > >>> wrote:
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> Hi,
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> After I re-organize the cluster, the
> > > experiment
> > > >> >> ran
> > > >> >> >> into
> > > >> >> >> > >> >>>> problem
> > > >> >> >> > >> >>>> > > > faster
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> than
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> before.
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> Basically the changes are to use machines
> > > with
> > > >> >> less
> > > >> >> >> > >> resources
> > > >> >> >> > >> >>>> as
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> zookeeper
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> quorums and machines with more resources
> as
> > > >> >> >> > regionserver.
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> From the log, I still see the pause
> around
> > 1
> > > >> >> minute.
> > > >> >> >> > >> >>>> > > > >>>> I enabled the GC logging,  please see
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> for details.
> > > >> >> >> > >> >>>> > > > >>>> In general I don't see many pauses in the
> > GC.
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> What is more interesting, 1 hour after
> the
> > > 1st
> > > >> >> >> > >> regionserver
> > > >> >> >> > >> >>>> > > > >>>> disconnected,
> > > >> >> >> > >> >>>> > > > >>>> the master log complained about too many
> > open
> > > >> >> files.
> > > >> >> >> > This
> > > >> >> >> > >> >>>> didn't
> > > >> >> >> > >> >>>> > > > happen
> > > >> >> >> > >> >>>> > > > >>>> before.
> > > >> >> >> > >> >>>> > > > >>>> I checked the system OS setup as well as
> > the
> > > >> >> >> > limits.conf.
> > > >> >> >> > >> I
> > > >> >> >> > >> >>>> also
> > > >> >> >> > >> >>>> > > > checked
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> the
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> running processes. The total open files
> > don't
> > > >> >> reach
> > > >> >> >> the
> > > >> >> >> > >> limit.
> > > >> >> >> > >> >>>> I
> > > >> >> >> > >> >>>> > am
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> confused
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> a bit.
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> Please see the following master log.
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > > >> >> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> > > >> >> >> > >> >>>> > > > >>>> RegionManager.metaScanner scan of 4658
> > row(s)
> > > >> of
> > > >> >> meta
> > > >> >> >> > >> region
> > > >> >> >> > >> >>>> > > {server:
> > > >> >> >> > >> >>>> > > > >>>> 192.168.100.128:60021, regionname:
> > > .META.,,1,
> > > >> >> >> startKey:
> > > >> >> >> > >> <>}
> > > >> >> >> > >> >>>> > > complete
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > > >> >> >> > >> >>>> > > > org.apache.hadoop.hbase.master.BaseScanner:
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> All
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> 1 .META. region(s) scanned
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
> > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> Got
> > > >> >> >> > >> >>>> > > > ping
> > > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c
> > > after
> > > >> 3ms
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > java.io.IOException:
> > > >> Bad
> > > >> >> >> connect
> > > >> >> >> > >> ack
> > > >> >> >> > >> >>>> with
> > > >> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_-2808245019291145247_5478039
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > java.io.EOFException
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_1074853606841896259_5478048
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
> > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> Got
> > > >> >> >> > >> >>>> > > > ping
> > > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c
> > after
> > > >> 1ms
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
> > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> Got
> > > >> >> >> > >> >>>> > > > ping
> > > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c
> > > after
> > > >> 0ms
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
> > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> Got
> > > >> >> >> > >> >>>> > > > ping
> > > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c
> > after
> > > >> 0ms
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > java.io.IOException:
> > > >> Bad
> > > >> >> >> connect
> > > >> >> >> > >> ack
> > > >> >> >> > >> >>>> with
> > > >> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_-6642544517082142289_5478063
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > >> java.net.SocketException:
> > > >> >> Too
> > > >> >> >> > many
> > > >> >> >> > >> >>>> open
> > > >> >> >> > >> >>>> > > files
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_2057511041109796090_5478063
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > >> java.net.SocketException:
> > > >> >> Too
> > > >> >> >> > many
> > > >> >> >> > >> >>>> open
> > > >> >> >> > >> >>>> > > files
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_8219260302213892894_5478064
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > >> java.net.SocketException:
> > > >> >> Too
> > > >> >> >> > many
> > > >> >> >> > >> >>>> open
> > > >> >> >> > >> >>>> > > files
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_1669205542853067709_5478235
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > >> java.net.SocketException:
> > > >> >> Too
> > > >> >> >> > many
> > > >> >> >> > >> >>>> open
> > > >> >> >> > >> >>>> > > files
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_9128897691346270351_5478237
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > >> java.net.SocketException:
> > > >> >> Too
> > > >> >> >> > many
> > > >> >> >> > >> >>>> open
> > > >> >> >> > >> >>>> > > files
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_1116845144864123018_5478240
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
> > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> Got
> > > >> >> >> > >> >>>> > > > ping
> > > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x424eebf1c10004c
> > > after
> > > >> 0ms
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
> > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> Got
> > > >> >> >> > >> >>>> > > > ping
> > > >> >> >> > >> >>>> > > > >>>> response for sessionid:0x24eebf1043003c
> > after
> > > >> 0ms
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> DataStreamer
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> Exception: java.io.IOException: Too many
> > open
> > > >> >> files
> > > >> >> >> > >> >>>> > > > >>>>      at sun.nio.ch.IOUtil.initPipe(Native
> > > >> Method)
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > >
> > > >> >> >> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>>
> > > >> >> >>
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>>
> > java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > >> >> >> > >> >>>> > > > >>>>      at
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > Error
> > > >> >> >> > >> >>>> > > > >>>> Recovery for block
> > > >> blk_8148813491785406356_5478475
> > > >> >> >> bad
> > > >> >> >> > >> >>>> datanode[0]
> > > >> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > Error
> > > >> >> >> > >> >>>> > > > >>>> Recovery for block
> > > >> blk_8148813491785406356_5478475
> > > >> >> in
> > > >> >> >> > >> pipeline
> > > >> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010,
> > 192.168.100.134:50010
> > > ,
> > > >> >> >> > >> >>>> > 192.168.100.122:50010
> > > >> >> >> > >> >>>> > > :
> > > >> >> >> > >> >>>> > > > >>>> bad
> > > >> >> >> > >> >>>> > > > >>>> datanode 192.168.100.123:50010
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > Exception
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>> in
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > >> java.net.SocketException:
> > > >> >> Too
> > > >> >> >> > many
> > > >> >> >> > >> >>>> open
> > > >> >> >> > >> >>>> > > files
> > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > >> >> >> > >> >>>> > > > >>>> block blk_2119727700857186236_5478498
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu
> > Zhong
> > > <
> > > >> >> >> > >> >>>> > > > zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > > >>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>  Will do.
> > > >> >> >> > >> >>>> > > > >>>>>
> > > >> >> >> > >> >>>> > > > >>>>> thanks
> > > >> >> >> > >> >>>> > > > >>>>> zhenyu
> > > >> >> >> > >> >>>> > > > >>>>>
> > > >> >> >> > >> >>>> > > > >>>>>
> > > >> >> >> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <
> > > >> >> >> > stack@duboce.net
> > > >> >> >> > >> >
> > > >> >> >> > >> >>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>>
> > > >> >> >> > >> >>>> > > > >>>>>  Enable GC logging too on this next run
> > (see
> > > >> >> >> > >> hbase-env.sh).
> > > >> >> >> > >> >>>>  Lets
> > > >> >> >> > >> >>>> > > try
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>> and
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> get
> > > >> >> >> > >> >>>> > > > >>>>>> to the bottom of whats going on.
> > > >> >> >> > >> >>>> > > > >>>>>> Thanks,
> > > >> >> >> > >> >>>> > > > >>>>>> St.Ack
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM,
> Zhenyu
> > > >> Zhong <
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>> zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>  I can switch the boxes that run
> > zookeeper
> > > >> with
> > > >> >> the
> > > >> >> >> > ones
> > > >> >> >> > >> >>>> that
> > > >> >> >> > >> >>>> > run
> > > >> >> >> > >> >>>> > > > >>>>>>> regionservers.
> > > >> >> >> > >> >>>> > > > >>>>>>> I will see the results later.
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> FYI. The node does have the 10 minutes
> > > >> >> >> > >> >>>> > zookeeper.session.timeout
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>> value
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> in
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> place.
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> thanks
> > > >> >> >> > >> >>>> > > > >>>>>>> zhenyu
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack
> <
> > > >> >> >> > >> stack@duboce.net>
> > > >> >> >> > >> >>>> > wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM,
> > Zhenyu
> > > >> Zhong
> > > >> >> <
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> In my cluster, half of the cluster
> > have
> > > 2
> > > >> >> disks
> > > >> >> >> > 400GB
> > > >> >> >> > >> >>>> each
> > > >> >> >> > >> >>>> > per
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> machine,
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> and
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> half of the cluster have 6 disks per
> > > >> machine.
> > > >> >> >> >  Maybe
> > > >> >> >> > >> we
> > > >> >> >> > >> >>>> > should
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> run
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> zookeeper
> > > >> >> >> > >> >>>> > > > >>>>>>>>> on the machines with 2 disks and RS
> on
> > > >> >> machines
> > > >> >> >> > with
> > > >> >> >> > >> 6
> > > >> >> >> > >> >>>> disks?
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> That would make most sense only in
> the
> > > >> below,
> > > >> >> it
> > > >> >> >> > >> looks
> > > >> >> >> > >> >>>> like
> > > >> >> >> > >> >>>> > the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> RS
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> that
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> had
> > > >> >> >> > >> >>>> > > > >>>>>>>> issue had 4 disks?
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>  BTW, the 10 minutes
> > > >> zookeeper.session.timeout
> > > >> >> >> has
> > > >> >> >> > >> been
> > > >> >> >> > >> >>>> set
> > > >> >> >> > >> >>>> > > during
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> experiment.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> And for sure this node had it in
> place?
> > > >> >> >> > >> >>>> > > > >>>>>>>> St.Ack
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> thanks
> > > >> >> >> > >> >>>> > > > >>>>>>>>> zhenyu
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM,
> stack
> > <
> > > >> >> >> > >> stack@duboce.net
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>> > > wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM,
> > Zhenyu
> > > >> >> Zhong <
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  Though I experienced 2
> regionserver
> > > >> >> >> disconnection
> > > >> >> >> > >> this
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> morning,
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> it
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> looks
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> better from the regionserver log.
> > > (please
> > > >> >> see
> > > >> >> >> the
> > > >> >> >> > >> >>>> following
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> log)
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> http://pastebin.com/m496dbfae
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of the
> > > >> >> >> > >> >>>> > regionserver(192.168.100.116)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> during
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> time it disconnected.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Device:            tps
> Blk_read/s
> > > >> >> >> Blk_wrtn/s
> > > >> >> >> > >> >>>> Blk_read
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> Blk_wrtn
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> sda             105.31      5458.83
> > > >> >> 19503.64
> > > >> >> >> > >> >>>> 9043873239
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> 32312473676
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> sda1              2.90         6.64
> > > >> >>  99.25
> > > >> >> >> > >> >>>> 10993934
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> 164433464
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> sda2              1.72        23.76
> > > >> >>  30.25
> > > >> >> >> > >> >>>> 39365817
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> 50124008
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> sda3              0.30         0.38
> > > >> >> 3.58
> > > >> >> >> > >> >>>> 624930
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> 5923000
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> sda4            100.39      5428.06
> > > >> >> 19370.56
> > > >> >> >> > >> >>>> 8992888270
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> 32091993204
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k
> > > >> blocks/second
> > > >> >> >> would
> > > >> >> >> > >> seem to
> > > >> >> >> > >> >>>> be
> > > >> >> >> > >> >>>> > > high
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> but
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> its
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> one
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> disk only and its not being shared
> by
> > > zk
> > > >> >> >> anymore
> > > >> >> >> > so
> > > >> >> >> > >> >>>> > shouldn't
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> matter?
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper
> quorum
> > > >> server
> > > >> >> >> that
> > > >> >> >> > the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> tried
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> to
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> connect according to the log.
> > However,
> > > I
> > > >> >> don't
> > > >> >> >> see
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> 192.168.100.116
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> in
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper quorum
> > > >> server
> > > >> >> that
> > > >> >> >> > the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> tried
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> to
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> connect.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Would that be a problem?
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Is that because the ephemeral
> node
> > > for
> > > >> the
> > > >> >> >> > >> >>>> regionserver
> > > >> >> >> > >> >>>> > had
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> evaporated?
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> Lost
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time you
> went
> > to
> > > >> >> look?
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Zookeeper version: 3.2.1-808558,
> > built
> > > >> on
> > > >> >> >> > >> 08/27/2009
> > > >> >> >> > >> >>>> 18:48
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> GMT
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> Clients:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
> > > >> >> >> > >> [1](queued=0,recved=26,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
> > > >> >> >> > >> [1](queued=0,recved=964,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
> > > >> >> >> > >> [1](queued=0,recved=3266,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
> > > >> >> >> > >> [1](queued=0,recved=233,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
> > > >> >> >> > >> [1](queued=0,recved=2,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
> > > >> >> >> > >> [1](queued=0,recved=295,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
> > > >> >> >> > >> [1](queued=0,recved=2290,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
> > > >> >> >> > >> [1](queued=0,recved=0,sent=0)
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max: 0/17/6333
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Received: 47111
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Sent: 0
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Outstanding: 0
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Mode: leader
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Node count: 23
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency is
> > pretty
> > > >> bad
> > > >> >> but
> > > >> >> >> > >> should
> > > >> >> >> > >> >>>> be
> > > >> >> >> > >> >>>> > well
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> within
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> zk
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> session timeout.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the zk
> > client
> > > >> side
> > > >> >> of
> > > >> >> >> the
> > > >> >> >> > >> >>>> session;
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> i.e.
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> in
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> RS.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> You could enable GC logging as J-D
> > > >> suggested
> > > >> >> to
> > > >> >> >> > see
> > > >> >> >> > >> if
> > > >> >> >> > >> >>>> you
> > > >> >> >> > >> >>>> > > have
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> any
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> big
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session timeout.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> When the RS went down, it didn't
> look
> > > too
> > > >> >> >> heavily
> > > >> >> >> > >> >>>> loaded:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> >  org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > >> >> >> > >> >>>> Dump of
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> metrics:
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>  request=1.5166667, regions=322,
> > > >> stores=657,
> > > >> >> >> > >> >>>> storefiles=631,
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61,
> > > >> memstoreSize=1472,
> > > >> >> >> > >> >>>> usedHeap=2819,
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> maxHeap=4079,
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
> > > >> >> >> > blockCacheFree=197395984,
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The
> number
> > of
> > > >> >> store
> > > >> >> >> > files
> > > >> >> >> > >> >>>> seems
> > > >> >> >> > >> >>>> > > fine.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> Not
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> too
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> much memory used.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see the
> Lease
> > > >> Still
> > > >> >> Held
> > > >> >> >> > >> >>>> exception.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> This
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> happens
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> when the RS does its regular report
> to
> > > the
> > > >> >> >> master
> > > >> >> >> > but
> > > >> >> >> > >> the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> master
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> thinks
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> RS has since restarted.  It'll
> think
> > > this
> > > >> >> >> probably
> > > >> >> >> > >> >>>> because
> > > >> >> >> > >> >>>> > it
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> noticed
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> that
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone away
> > and
> > > it
> > > >> >> >> > considered
> > > >> >> >> > >> the
> > > >> >> >> > >> >>>> RS
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> dead.
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> Looking too at your  logs I see
> this
> > > gap
> > > >> in
> > > >> >> the
> > > >> >> >> zk
> > > >> >> >> > >> >>>> pinging:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> Got
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>  ping response for
> > > >> sessionid:0x224e55436ad0004
> > > >> >> >> after
> > > >> >> >> > >> 0ms
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> Got
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>  ping response for
> > > >> sessionid:0x24e55436a0007d
> > > >> >> >> after
> > > >> >> >> > >> 0ms
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> Where in the lines above it, its
> > > >> reporting
> > > >> >> >> about
> > > >> >> >> > >> every
> > > >> >> >> > >> >>>> ten
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> seconds,
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> here
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> there is a big gap.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or something
> that
> > > >> will
> > > >> >> let
> > > >> >> >> you
> > > >> >> >> > >> look
> > > >> >> >> > >> >>>> more
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> into
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> what
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> was
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> happening on this machine around
> this
> > > >> time?
> > > >> >>  Is
> > > >> >> >> > the
> > > >> >> >> > >> >>>> machine
> > > >> >> >> > >> >>>> > > OK?
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> It
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> looks
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> lightly loaded and you have your
> > cluster
> > > >> >> nicely
> > > >> >> >> > laid
> > > >> >> >> > >> out.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> Something
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> odd
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> is
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> going on.  What about things like
> the
> > > >> write
> > > >> >> >> speed
> > > >> >> >> > to
> > > >> >> >> > >> >>>> disk?
> > > >> >> >> > >> >>>> >  In
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> the
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> past
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> strange issues have been explained by
> > > >> >> incorrectly
> > > >> >> >> > set
> > > >> >> >> > >> BIOS
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> which
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> made
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> disks
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their proper
> speed.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> St.Ack
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Best,
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> zhenyu
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM,
> > > Zhenyu
> > > >> >> Zhong
> > > >> >> >> <
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> zhongresearch@gmail.com
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Stack
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for your
> > > >> comments.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper
> monitoring
> > > >> script
> > > >> >> on
> > > >> >> >> my
> > > >> >> >> > >> >>>> cluster
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> and
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>> let
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>> it
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>> run
> > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>> overnight to see the result.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread when
> I
> > > get
> > > >> >> >> anything.
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> thanks
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> zhenyu
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM,
> > > stack
> > > >> <
> > > >> >> >> > >> >>>> stack@duboce.net>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>> wrote:
> > > >> >> >> > >> >>>> > > > >>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>  I see these in your log too:
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018
> DEBUG
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > > >> >> sessionid:0x424dfd908c50009
> > > >> >> >> > >> after
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 4544ms
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > > >> >> sessionid:0x24dfd90c810002
> > > >> >> >> > after
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 4548ms
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > > >> >> sessionid:0x424dfd908c50009
> > > >> >> >> > >> after
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 9030ms
> > > >> >> >> > >> >>>> > > > >>>
> > > >> >> >> > >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
> > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > >> >> >> > >> >>>> > > > >>>>>>>
> > > >> >> >> > >> >>>> > > >
> > > >> >> >> > >> >>>> > >
> > > >> >> >> > >> >>>> >
> > > >> >> >> > >> >>>>
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>>
> > > >> >> >> > >> >>
> > > >> >> >> > >> >
> > > >> >> >> > >>
> > > >> >> >> > >
> > > >> >> >> > >
> > > >> >> >> >
> > > >> >> >>
> > > >> >> >
> > > >> >> >
> > > >> >>
> > > >> >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message