hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Hung <YTHu...@winbond.com>
Subject What cause region server to timeout other than long gc?
Date Wed, 23 Oct 2013 03:41:26 GMT
Hi All,

Today I have 1 of 9 region servers down because of zookeeper timeout, this is the log:
2013-10-23 07:41:34,139 INFO org.apache.hadoop.hbase.regionserver.Store: Starting compaction
of 4 file(s) in cf of MES_PERF_LOG_TIME,\x00\x04\x00\x00\x01A\x9D\xDD\xD9\x8D\x00\x00\x08\xD0fcap2\x00\x00=\xFC,1381922811424.cc6325e7896e3844d124321c315b02b0.
into tmpdir=hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp,
seqid=2655801231, totalSize=923.8m
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/3ae8adddd77f4a669b93eba2068f5778,
keycount=34504344, bloomtype=NONE, size=501.3m, encoding=NONE
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/055361f74fb644f1b3eb48f5da7e3ec0,
keycount=18581704, bloomtype=NONE, size=269.5m, encoding=NONE
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/a92cf779dca44843ab4c5130cca68d4f,
keycount=9953404, bloomtype=NONE, size=143.9m, encoding=NONE
2013-10-23 07:41:34,139 DEBUG org.apache.hadoop.hbase.regionserver.Compactor: Compacting hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/cf/51b89f361eb64f2f8c894a5c1516c604,
keycount=664008, bloomtype=NONE, size=9.2m, encoding=NONE
2013-10-23 07:41:34,182 DEBUG org.apache.hadoop.hbase.util.FSUtils: Creating file=hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp/16a28f8fc9f049c0a8b014502faf6339
with permission=rwxrwxrwx
2013-10-23 07:41:34,190 DEBUG org.apache.hadoop.hbase.io.hfile.HFileWriterV2: Initialized
with CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false]
[cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false]
2013-10-23 07:41:34,191 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family
Bloom filter type for hdfs://fchddn1.ctfab.com:9000/hbase/MES_PERF_LOG_TIME/cc6325e7896e3844d124321c315b02b0/.tmp/16a28f8fc9f049c0a8b014502faf6339:
CompoundBloomFilterWriter
2013-10-23 07:43:16,793 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 51848ms instead
of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-10-23 07:43:16,799 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 60261ms for sessionid 0x1412996eb5d2170, closing socket connection
and attempting reconnect
2013-10-23 07:43:16,800 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 60184ms for sessionid 0x1412996eb5d217e, closing socket connection
and attempting reconnect
2013-10-23 07:43:16,807 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50480,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b17c1aa),
rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.16.10.181:40987","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2013-10-23 07:43:16,807 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50159,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@71f59d1a),
rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.16.10.181:40023","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2013-10-23 07:43:16,808 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":50480,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@380e6d06),
rpc version=1, client version=29, methodsFingerPrint=-1368823753","client":"10.16.10.181:40963","starttimems":1382485346321,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2013-10-23 07:43:18,154 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server fchddn1.ctfab.com/10.16.10.181:2222. Will not attempt to authenticate using SASL (Unable
to locate a login configuration)
2013-10-23 07:43:18,155 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to fchddn1.ctfab.com/10.16.10.181:2222, initiating session
2013-10-23 07:43:18,201 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, will automatically reconnect when needed.
2013-10-23 07:43:18,201 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
ZK session expired. This disconnect could have been caused by a network partition or a long-running
GC pause, either way it's recommended that you verify your environment.
2013-10-23 07:43:18,230 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2013-10-23 07:43:18,230 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper
service, session 0x1412996eb5d217e has expired, closing socket connection
2013-10-23 07:43:18,804 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server fchddn1.ctfab.com/10.16.10.181:2222. Will not attempt to authenticate using SASL (Unable
to locate a login configuration)
2013-10-23 07:43:18,805 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to fchddn1.ctfab.com/10.16.10.181:2222, initiating session
2013-10-23 07:43:18,805 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper
service, session 0x1412996eb5d2170 has expired, closing socket connection
2013-10-23 07:43:18,805 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server fchddn2.ctfab.com,60020,1382059486345: regionserver:60020-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170
regionserver:60020-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170-0x1412996eb5d2170
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:363)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:282)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)


The message mention that this caused by a long garbage collection, but when I go to see the
gc log, I cannot find any long running gc:
2013-10-23T07:42:16.396+0800: 425852.718: [GC 425852.718: [ParNew: 211937K->5070K(235968K),
0.0106610 secs] 2412876K->2206256K(4801752K), 0.0107230 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:16.760+0800: 425853.082: [GC 425853.082: [ParNew: 214826K->4962K(235968K),
0.0108200 secs] 2416013K->2206428K(4801752K), 0.0109040 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:17.101+0800: 425853.422: [GC 425853.422: [ParNew: 214754K->8823K(235968K),
0.0111210 secs] 2416220K->2210549K(4801752K), 0.0111890 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:17.461+0800: 425853.783: [GC 425853.783: [ParNew: 218615K->10874K(235968K),
0.0117700 secs] 2420341K->2212905K(4801752K), 0.0118380 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:17.827+0800: 425854.149: [GC 425854.149: [ParNew: 220666K->10443K(235968K),
0.0120950 secs] 2422697K->2214840K(4801752K), 0.0121640 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:18.216+0800: 425854.537: [GC 425854.538: [ParNew: 220235K->10855K(235968K),
0.0115060 secs] 2424632K->2215478K(4801752K), 0.0115940 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:18.592+0800: 425854.914: [GC 425854.914: [ParNew: 220711K->2730K(235968K),
0.0116310 secs] 2425333K->2209654K(4801752K), 0.0117080 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:18.962+0800: 425855.284: [GC 425855.284: [ParNew: 212522K->2292K(235968K),
0.0108730 secs] 2419446K->2209540K(4801752K), 0.0109640 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:19.322+0800: 425855.644: [GC 425855.644: [ParNew: 212084K->3115K(235968K),
0.0106050 secs] 2419332K->2210656K(4801752K), 0.0106740 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:19.722+0800: 425856.044: [GC 425856.044: [ParNew: 212907K->2373K(235968K),
0.0107250 secs] 2420448K->2210234K(4801752K), 0.0108060 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:20.258+0800: 425856.579: [GC 425856.579: [ParNew: 212165K->2999K(235968K),
0.0107150 secs] 2420026K->2211194K(4801752K), 0.0107950 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:20.640+0800: 425856.962: [GC 425856.962: [ParNew: 212791K->2687K(235968K),
0.0107090 secs] 2420986K->2211198K(4801752K), 0.0107890 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:21.008+0800: 425857.330: [GC 425857.330: [ParNew: 212479K->10731K(235968K),
0.0124240 secs] 2420990K->2219520K(4801752K), 0.0125040 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:21.390+0800: 425857.711: [GC 425857.711: [ParNew: 220523K->11031K(235968K),
0.0117490 secs] 2429312K->2220131K(4801752K), 0.0118380 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:21.737+0800: 425858.059: [GC 425858.059: [ParNew: 220829K->11034K(235968K),
0.0116510 secs] 2429928K->2220524K(4801752K), 0.0117290 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:22.088+0800: 425858.410: [GC 425858.410: [ParNew: 220762K->8764K(235968K),
0.0182030 secs] 2430252K->2218531K(4801752K), 0.0182760 secs] [Times: user=0.05 sys=0.00,
real=0.02 secs]
2013-10-23T07:42:22.452+0800: 425858.774: [GC 425858.774: [ParNew: 218398K->3125K(235968K),
0.0109820 secs] 2428166K->2215240K(4801752K), 0.0110600 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:22.806+0800: 425859.128: [GC 425859.128: [ParNew: 212917K->3093K(235968K),
0.0129060 secs] 2425032K->2215725K(4801752K), 0.0129740 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:23.191+0800: 425859.513: [GC 425859.513: [ParNew: 212885K->3037K(235968K),
0.0105990 secs] 2425517K->2215997K(4801752K), 0.0106680 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:23.547+0800: 425859.869: [GC 425859.869: [ParNew: 212829K->2592K(235968K),
0.0155460 secs] 2425789K->2215818K(4801752K), 0.0156380 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:23.913+0800: 425860.235: [GC 425860.235: [ParNew: 212384K->6389K(235968K),
0.0185530 secs] 2425610K->2219920K(4801752K), 0.0186300 secs] [Times: user=0.06 sys=0.00,
real=0.02 secs]
2013-10-23T07:42:23.932+0800: 425860.254: [GC 425860.254: [ParNew: 6398K->4591K(235968K),
0.0101310 secs] 2219929K->2218390K(4801752K), 0.0101880 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:24.281+0800: 425860.603: [GC 425860.603: [ParNew: 214383K->10297K(235968K),
0.0113230 secs] 2428182K->2224387K(4801752K), 0.0114040 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:24.607+0800: 425860.929: [GC 425860.929: [ParNew: 220089K->10532K(235968K),
0.0131330 secs] 2434179K->2224894K(4801752K), 0.0132140 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:25.013+0800: 425861.335: [GC 425861.335: [ParNew: 220324K->6430K(235968K),
0.0117640 secs] 2434686K->2223155K(4801752K), 0.0118420 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:25.534+0800: 425861.856: [GC 425861.856: [ParNew: 216222K->13486K(235968K),
0.0117460 secs] 2432947K->2230212K(4801752K), 0.0118280 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:42:25.969+0800: 425862.290: [GC 425862.290: [ParNew: 223236K->7543K(235968K),
0.0119530 secs] 2439961K->2226646K(4801752K), 0.0120310 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:43:16.467+0800: 425912.789: [GC 425912.789: [ParNew: 217335K->10714K(235968K),
0.0131220 secs] 2436438K->2230095K(4801752K), 0.0132030 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:43:17.177+0800: 425913.499: [GC 425913.499: [ParNew: 220506K->3502K(235968K),
0.0114700 secs] 2439887K->2225224K(4801752K), 0.0115550 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:43:17.557+0800: 425913.879: [GC 425913.879: [ParNew: 213294K->2873K(235968K),
0.0110120 secs] 2435016K->2224997K(4801752K), 0.0110940 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:43:17.898+0800: 425914.220: [GC 425914.220: [ParNew: 212665K->2783K(235968K),
0.0182350 secs] 2434789K->2225246K(4801752K), 0.0183390 secs] [Times: user=0.05 sys=0.00,
real=0.02 secs]
2013-10-23T07:43:18.303+0800: 425914.624: [GC 425914.625: [ParNew: 212541K->3134K(235968K),
0.0108970 secs] 2435004K->2225888K(4801752K), 0.0110010 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
2013-10-23T07:43:18.691+0800: 425915.013: [GC 425915.013: [ParNew: 212926K->2851K(235968K),
0.0114990 secs] 2435680K->2226095K(4801752K), 0.0115790 secs] [Times: user=0.04 sys=0.00,
real=0.01 secs]
2013-10-23T07:43:19.331+0800: 425915.653: [GC 425915.653: [ParNew: 212643K->2503K(235968K),
0.0105020 secs] 2435887K->2226096K(4801752K), 0.0105700 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]
Heap
par new generation   total 235968K, used 86748K [0x0000000683e00000, 0x0000000693e00000, 0x0000000693e00000)
  eden space 209792K,  40% used [0x0000000683e00000, 0x0000000689045380, 0x0000000690ae0000)
  from space 26176K,   9% used [0x0000000692470000, 0x00000006926e1f70, 0x0000000693e00000)
  to   space 26176K,   0% used [0x0000000690ae0000, 0x0000000690ae0000, 0x0000000692470000)
concurrent mark-sweep generation total 4565784K, used 2223592K [0x0000000693e00000, 0x00000007aa8c6000,
0x00000007fae00000)
concurrent-mark-sweep perm gen total 43908K, used 26321K [0x00000007fae00000, 0x00000007fd8e1000,
0x0000000800000000)

Is there any other reason that can cause the region server to idle more than 60 seconds?

Best regards,
Henry

________________________________
The privileged confidential information contained in this email is intended for use only by
the addressees as indicated by the original sender of this email. If you are not the addressee
indicated in this email or are not responsible for delivery of the email to such a person,
please kindly reply to the sender indicating this fact and delete all copies of it from your
computer and network server immediately. Your cooperation is highly appreciated. It is advised
that any unauthorized use of confidential information of Winbond is strictly prohibited; and
any information in this email irrelevant to the official business of Winbond shall be deemed
as neither given nor endorsed by Winbond.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message