hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhenyu Zhong <zhongresea...@gmail.com>
Subject Re: regionserver disconnection
Date Fri, 13 Nov 2009 21:58:36 GMT
Does HBase/Hadoop have to run on kernel 2.6.27 or jdk 1.6? It seems only
that kernel provides epoll resource configuration.

This is the first time I saw this error when I use machines with less
resource for zookeeper.

Probably, I should change it back.


zhenyu


On Fri, Nov 13, 2009 at 4:37 PM, Zhenyu Zhong <zhongresearch@gmail.com>wrote:

> The ulimit file descriptors was set to fs.file-max =1578334, also in the
> limits.conf the value was set to 32768.
> So these are way higher than the open descriptors for the running
> processes.
>
> thanks
> zhenyu
>
>
>
> On Fri, Nov 13, 2009 at 4:33 PM, Stack <saint.ack@gmail.com> wrote:
>
>> You upped the ulimit file descriptors as per the getting started doc?
>>
>>
>>
>> On Nov 13, 2009, at 1:26 PM, Zhenyu Zhong <zhongresearch@gmail.com>
>> wrote:
>>
>>  Thanks a lot.
>>>
>>>
>>> Bad news is my kernel is still 2.6.26.
>>> But it was not a problem before.
>>>
>>> Very strange.
>>>
>>> zhenyu
>>>
>>> On Fri, Nov 13, 2009 at 4:16 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>> >wrote:
>>>
>>>  Looks like
>>>>
>>>> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
>>>>
>>>> J-D
>>>>
>>>> On Fri, Nov 13, 2009 at 1:12 PM, Zhenyu Zhong <zhongresearch@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> After I re-organize the cluster, the experiment ran into problem faster
>>>>>
>>>> than
>>>>
>>>>> before.
>>>>>
>>>>> Basically the changes are to use machines with less resources as
>>>>>
>>>> zookeeper
>>>>
>>>>> quorums and machines with more resources as regionserver.
>>>>>
>>>>> From the log, I still see the pause around 1 minute.
>>>>> I enabled the GC logging,  please see
>>>>>
>>>>> http://pastebin.com/m1d4ce0f1
>>>>>
>>>>> for details.
>>>>> In general I don't see many pauses in the GC.
>>>>>
>>>>> What is more interesting, 1 hour after the 1st regionserver
>>>>> disconnected,
>>>>> the master log complained about too many open files. This didn't happen
>>>>> before.
>>>>> I checked the system OS setup as well as the limits.conf. I also
>>>>> checked
>>>>>
>>>> the
>>>>
>>>>> running processes. The total open files don't reach the limit. I am
>>>>>
>>>> confused
>>>>
>>>>> a bit.
>>>>>
>>>>>
>>>>> Please see the following master log.
>>>>>
>>>>> 2009-11-13 20:06:02,114 INFO
>>>>> org.apache.hadoop.hbase.master.BaseScanner:
>>>>> RegionManager.metaScanner scan of 4658 row(s) of meta region {server:
>>>>> 192.168.100.128:60021, regionname: .META.,,1, startKey: <>} complete
>>>>> 2009-11-13 20:06:02,114 INFO
>>>>> org.apache.hadoop.hbase.master.BaseScanner:
>>>>>
>>>> All
>>>>
>>>>> 1 .META. region(s) scanned
>>>>> 2009-11-13 20:06:07,677 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x424eebf1c10004c after 3ms
>>>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.io.IOException: Bad connect ack with
>>>>> firstBadLink 192.168.100.123:50010
>>>>> 2009-11-13 20:06:08,178 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_-2808245019291145247_5478039
>>>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.io.EOFException
>>>>> 2009-11-13 20:06:09,682 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_1074853606841896259_5478048
>>>>> 2009-11-13 20:06:10,334 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x24eebf1043003c after 1ms
>>>>> 2009-11-13 20:06:21,018 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x424eebf1c10004c after 0ms
>>>>> 2009-11-13 20:06:23,674 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x24eebf1043003c after 0ms
>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.io.IOException: Bad connect ack with
>>>>> firstBadLink 192.168.100.123:50010
>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_-6642544517082142289_5478063
>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>> 2009-11-13 20:06:24,828 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_2057511041109796090_5478063
>>>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>> 2009-11-13 20:06:24,928 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_8219260302213892894_5478064
>>>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>> 2009-11-13 20:06:30,855 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_1669205542853067709_5478235
>>>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>> 2009-11-13 20:06:30,905 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_9128897691346270351_5478237
>>>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>> 2009-11-13 20:06:30,955 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_1116845144864123018_5478240
>>>>> 2009-11-13 20:06:34,372 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x424eebf1c10004c after 0ms
>>>>> 2009-11-13 20:06:37,034 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>>>>> response for sessionid:0x24eebf1043003c after 0ms
>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>>
>>>> DataStreamer
>>>>
>>>>> Exception: java.io.IOException: Too many open files
>>>>>      at sun.nio.ch.IOUtil.initPipe(Native Method)
>>>>>      at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>>>>>      at
>>>>>
>>>>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
>>>>
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
>>>>
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
>>>>
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>>
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>>>
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>>>
>>>>>      at
>>>>>
>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>>>
>>>>>      at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
>>>>
>>>>>
>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>> Recovery for block blk_8148813491785406356_5478475 bad datanode[0]
>>>>> 192.168.100.123:50010
>>>>> 2009-11-13 20:06:37,235 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>> Recovery for block blk_8148813491785406356_5478475 in pipeline
>>>>> 192.168.100.123:50010, 192.168.100.134:50010, 192.168.100.122:50010:
>>>>> bad
>>>>> datanode 192.168.100.123:50010
>>>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Exception
>>>>>
>>>> in
>>>>
>>>>> createBlockOutputStream java.net.SocketException: Too many open files
>>>>> 2009-11-13 20:06:37,436 INFO org.apache.hadoop.hdfs.DFSClient:
>>>>> Abandoning
>>>>> block blk_2119727700857186236_5478498
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 12, 2009 at 4:21 PM, Zhenyu Zhong <zhongresearch@gmail.com
>>>>> wrote:
>>>>>
>>>>>  Will do.
>>>>>>
>>>>>> thanks
>>>>>> zhenyu
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 12, 2009 at 3:33 PM, stack <stack@duboce.net> wrote:
>>>>>>
>>>>>>  Enable GC logging too on this next run (see hbase-env.sh).  Lets
try
>>>>>>>
>>>>>> and
>>>>
>>>>>  get
>>>>>>> to the bottom of whats going on.
>>>>>>> Thanks,
>>>>>>> St.Ack
>>>>>>>
>>>>>>> On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <
>>>>>>>
>>>>>> zhongresearch@gmail.com
>>>>
>>>>>  wrote:
>>>>>>>>
>>>>>>>
>>>>>>>  I can switch the boxes that run zookeeper with the ones that
run
>>>>>>>> regionservers.
>>>>>>>> I will see the results later.
>>>>>>>>
>>>>>>>> FYI. The node does have the 10 minutes zookeeper.session.timeout
>>>>>>>>
>>>>>>> value
>>>>
>>>>>  in
>>>>>>>
>>>>>>>> place.
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> zhenyu
>>>>>>>>
>>>>>>>> On Thu, Nov 12, 2009 at 3:21 PM, stack <stack@duboce.net>
wrote:
>>>>>>>>
>>>>>>>>  On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong <
>>>>>>>>>
>>>>>>>> zhongresearch@gmail.com
>>>>>>>
>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> In my cluster, half of the cluster have 2 disks 400GB
each per
>>>>>>>>>>
>>>>>>>>> machine,
>>>>>>>
>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> half of the cluster have 6 disks per machine.  Maybe
we should
>>>>>>>>>>
>>>>>>>>> run
>>>>
>>>>>  zookeeper
>>>>>>>>>> on the machines with 2 disks and RS on machines with
6 disks?
>>>>>>>>>>
>>>>>>>>>> That would make most sense only in the below, it
looks like the
>>>>>>>>>>
>>>>>>>>> RS
>>>>
>>>>>  that
>>>>>>>
>>>>>>>> had
>>>>>>>>> issue had 4 disks?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  BTW, the 10 minutes zookeeper.session.timeout has been
set during
>>>>>>>>>>
>>>>>>>>> the
>>>>>>>
>>>>>>>>  experiment.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> And for sure this node had it in place?
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>> zhenyu
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 12, 2009 at 2:08 PM, stack <stack@duboce.net>
wrote:
>>>>>>>>>>
>>>>>>>>>>  On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong <
>>>>>>>>>>>
>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Though I experienced 2 regionserver disconnection
this
>>>>>>>>>>>>
>>>>>>>>>>> morning,
>>>>
>>>>>  it
>>>>>>>
>>>>>>>>  looks
>>>>>>>>>>
>>>>>>>>>>> better from the regionserver log. (please see
the following
>>>>>>>>>>>>
>>>>>>>>>>> log)
>>>>
>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> http://pastebin.com/m496dbfae
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I checked diskIO for one of the regionserver(192.168.100.116)
>>>>>>>>>>>>
>>>>>>>>>>> during
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> time it disconnected.
>>>>>>>>>>>>
>>>>>>>>>>>> Time: 03:04:58 AM
>>>>>>>>>>>> Device:            tps   Blk_read/s   Blk_wrtn/s
  Blk_read
>>>>>>>>>>>>
>>>>>>>>>>> Blk_wrtn
>>>>>>>>>
>>>>>>>>>>  sda             105.31      5458.83     19503.64
9043873239
>>>>>>>>>>>>
>>>>>>>>>>> 32312473676
>>>>>>>>>
>>>>>>>>>>  sda1              2.90         6.64        99.25
  10993934
>>>>>>>>>>>>
>>>>>>>>>>> 164433464
>>>>>>>>>
>>>>>>>>>>  sda2              1.72        23.76        30.25
  39365817
>>>>>>>>>>>>
>>>>>>>>>>> 50124008
>>>>>>>>>
>>>>>>>>>>  sda3              0.30         0.38         3.58
    624930
>>>>>>>>>>>>
>>>>>>>>>>> 5923000
>>>>>>>>>
>>>>>>>>>>  sda4            100.39      5428.06     19370.56
8992888270
>>>>>>>>>>>>
>>>>>>>>>>> 32091993204
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Is this high for you?  20k blocks/second
would seem to be high
>>>>>>>>>>>
>>>>>>>>>> but
>>>>
>>>>>  its
>>>>>>>>
>>>>>>>>> one
>>>>>>>>>>
>>>>>>>>>>> disk only and its not being shared by zk anymore
so shouldn't
>>>>>>>>>>>
>>>>>>>>>> matter?
>>>>>>>
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I also checked the zookeeper quorum server that
the
>>>>>>>>>>>>
>>>>>>>>>>> regionserver
>>>>
>>>>>  tried
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> connect according to the log. However, I don't
see
>>>>>>>>>>>>
>>>>>>>>>>> 192.168.100.116
>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> client list of the zookeeper quorum server that
the
>>>>>>>>>>>>
>>>>>>>>>>> regionserver
>>>>
>>>>>  tried
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> connect.
>>>>>>>>>>>> Would that be a problem?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Is that because the ephemeral node for the
regionserver had
>>>>>>>>>>>
>>>>>>>>>> evaporated?
>>>>>>>>
>>>>>>>>>  Lost
>>>>>>>>>>> its lease w/ zk by the time you went to look?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
>>>>>>>>>>>> Zookeeper version: 3.2.1-808558, built on
08/27/2009 18:48
>>>>>>>>>>>>
>>>>>>>>>>> GMT
>>>>
>>>>>   Clients:
>>>>>>>>>>>> /192.168.100.127:43045[1](queued=0,recved=26,sent=0)
>>>>>>>>>>>> /192.168.100.131:39091[1](queued=0,recved=964,sent=0)
>>>>>>>>>>>> /192.168.100.124:35961[1](queued=0,recved=3266,sent=0)
>>>>>>>>>>>> /192.168.100.123:47935[1](queued=0,recved=233,sent=0)
>>>>>>>>>>>> /192.168.100.125:46931[1](queued=0,recved=2,sent=0)
>>>>>>>>>>>> /192.168.100.118:54924[1](queued=0,recved=295,sent=0)
>>>>>>>>>>>> /192.168.100.118:41390[1](queued=0,recved=2290,sent=0)
>>>>>>>>>>>> /192.168.100.136:42243[1](queued=0,recved=0,sent=0)
>>>>>>>>>>>>
>>>>>>>>>>>> Latency min/avg/max: 0/17/6333
>>>>>>>>>>>> Received: 47111
>>>>>>>>>>>> Sent: 0
>>>>>>>>>>>> Outstanding: 0
>>>>>>>>>>>> Zxid: 0x77000083f4
>>>>>>>>>>>> Mode: leader
>>>>>>>>>>>> Node count: 23
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That 6 second maximum latency is pretty bad but
should be well
>>>>>>>>>>>
>>>>>>>>>> within
>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> zk
>>>>>>>>>>> session timeout.
>>>>>>>>>>>
>>>>>>>>>>> So, problem is likely on the zk client side of
the session;
>>>>>>>>>>>
>>>>>>>>>> i.e.
>>>>
>>>>>  in
>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>>> RS.
>>>>>>>>>>
>>>>>>>>>>> You could enable GC logging as J-D suggested
to see if you have
>>>>>>>>>>>
>>>>>>>>>> any
>>>>>>>
>>>>>>>> big
>>>>>>>>
>>>>>>>>>  pauses, pauses > zk session timeout.
>>>>>>>>>>>
>>>>>>>>>>> When the RS went down, it didn't look too heavily
loaded:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  1. 2009-11-12 15:04:52,830 INFO
>>>>>>>>>>>  org.apache.hadoop.hbase.regionserver.HRegionServer:
Dump of
>>>>>>>>>>>
>>>>>>>>>> metrics:
>>>>>>>>
>>>>>>>>>   request=1.5166667, regions=322, stores=657, storefiles=631,
>>>>>>>>>>>  storefileIndexSize=61, memstoreSize=1472, usedHeap=2819,
>>>>>>>>>>>
>>>>>>>>>> maxHeap=4079,
>>>>>>>>>
>>>>>>>>>>  blockCacheSize=658110960, blockCacheFree=197395984,
>>>>>>>>>>>
>>>>>>>>>> blockCacheCount=9903,
>>>>>>>>>>
>>>>>>>>>>>  blockCacheHitRatio=99
>>>>>>>>>>>
>>>>>>>>>>> Its serving a few reads?  The number of store
files seems fine.
>>>>>>>>>>>
>>>>>>>>>> Not
>>>>>>>
>>>>>>>> too
>>>>>>>>>
>>>>>>>>>> much memory used.
>>>>>>>>>>>
>>>>>>>>>>> Looking at the logs, I see the Lease Still Held
exception.
>>>>>>>>>>>
>>>>>>>>>> This
>>>>
>>>>>  happens
>>>>>>>>>
>>>>>>>>>> when the RS does its regular report to the master
but the
>>>>>>>>>>>
>>>>>>>>>> master
>>>>
>>>>>  thinks
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> RS has since restarted.  It'll think this probably
because it
>>>>>>>>>>>
>>>>>>>>>> noticed
>>>>>>>
>>>>>>>>  that
>>>>>>>>>>
>>>>>>>>>>> the RS's znode in zk had gone away and it considered
the RS
>>>>>>>>>>>
>>>>>>>>>> dead.
>>>>
>>>>>
>>>>>>>>>>> Looking too at your  logs I see this gap in the
zk pinging:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  1. 2009-11-12 15:03:39,325 DEBUG
>>>>>>>>>>>
>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>
>>>>>>>> Got
>>>>>>>>
>>>>>>>>>   ping response for sessionid:0x224e55436ad0004 after
0ms
>>>>>>>>>>>  2. 2009-11-12 15:03:43,113 DEBUG
>>>>>>>>>>>
>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>
>>>>>>>> Got
>>>>>>>>
>>>>>>>>>   ping response for sessionid:0x24e55436a0007d after
0ms
>>>>>>>>>>>
>>>>>>>>>>> Where in the lines above it, its reporting about
every ten
>>>>>>>>>>>
>>>>>>>>>> seconds,
>>>>>>>
>>>>>>>> here
>>>>>>>>>
>>>>>>>>>> there is a big gap.
>>>>>>>>>>>
>>>>>>>>>>> Do you have ganglia or something that will let
you look more
>>>>>>>>>>>
>>>>>>>>>> into
>>>>
>>>>>  what
>>>>>>>>
>>>>>>>>> was
>>>>>>>>>>
>>>>>>>>>>> happening on this machine around this time? 
Is the machine OK?
>>>>>>>>>>>
>>>>>>>>>> It
>>>>>>>
>>>>>>>> looks
>>>>>>>>>
>>>>>>>>>> lightly loaded and you have your cluster nicely laid
out.
>>>>>>>>>>>
>>>>>>>>>> Something
>>>>>>>
>>>>>>>> odd
>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>>> going on.  What about things like the write speed
to disk?  In
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>>  past
>>>>>>>>
>>>>>>>>>  strange issues have been explained by incorrectly set
BIOS
>>>>>>>>>>>
>>>>>>>>>> which
>>>>
>>>>>  made
>>>>>>>
>>>>>>>>  disks
>>>>>>>>>>
>>>>>>>>>>> run at 1/100th of their proper speed.
>>>>>>>>>>>
>>>>>>>>>>> St.Ack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong
<
>>>>>>>>>>>>
>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Stack
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am very appreciated for your comments.
>>>>>>>>>>>>> I will use the zookeeper monitoring script
on my cluster
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>
>>>>>  let
>>>>>>>
>>>>>>>> it
>>>>>>>>
>>>>>>>>> run
>>>>>>>>>>
>>>>>>>>>>>  overnight to see the result.
>>>>>>>>>>>>> I will follow up the thread when I get
anything.
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks
>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:52 PM, stack
<stack@duboce.net>
>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>>  I see these in your log too:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  1. 2009-11-11 04:27:19,018 DEBUG
>>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>
>>>>>>>>>> Got
>>>>>>>>>>>
>>>>>>>>>>>>   ping response for sessionid:0x424dfd908c50009
after
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 4544ms
>>>>
>>>>>    2. 2009-11-11 04:27:19,018 DEBUG
>>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>
>>>>>>>>>> Got
>>>>>>>>>>>
>>>>>>>>>>>>   ping response for sessionid:0x24dfd90c810002
after
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 4548ms
>>>>
>>>>>    3. 2009-11-11 04:27:43,960 DEBUG
>>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>
>>>>>>>>>> Got
>>>>>>>>>>>
>>>>>>>>>>>>   ping response for sessionid:0x424dfd908c50009
after
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 9030ms
>>>>
>>>>>    4. 2009-11-11 04:27:43,960 DEBUG
>>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.
>>>>>>>>>
>>>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message