hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bo Zhang <bzh...@marinsoftware.com>
Subject Region inconsistent and RegionServer in Transition
Date Wed, 26 Jul 2017 05:16:42 GMT
Hello hbaseers,
We currently use hbase-1.0.0-cdh5.5.2 to create our hbase-cluster.
However, we met some problems yesterday.

We only stopped/started our hbase-cluster once, and then, there is an
initial error:
>Number of regions: 10
>Deployed on: prod-lex-datanode-lv-238.prod.marinsw.net,60020,1500950475647
prod-lex-datanode-lv-245.prod.marinsw.net,60020,1500950476164
prod-lex-datanode-lv-247.prod.marinsw.net,60020,1500950476370
prod-lex-datanode-lv-292.prod.marinsw.net,60020,1500950475711
prod-lex-datanode-lv-294.prod.marinsw.net,60020,1500950475833
prod-lex-datanode-lv-297.prod.marinsw.net,60020,1500950475948
prod-lex-datanode-lv-302.prod.marinsw.net,60020,1500950475835
prod-lex-datanode-lv-303.prod.marinsw.net,60020,1500950477303
>888 inconsistencies detected.
>Status: INCONSISTENT

Then, we restarted hbase-cluster again and tried to use "hbck -fix" to fix
the inconsistency problem. But we received an error:
>2017-07-25 03:52:06,341 WARN org.apache.hadoop.hbase.master.RegionStates:
Failed to open/close 0015506030f086780f6154b4cace7c6a on
prod-lex-datanode-lv-295.prod.marinsw.net,60020,1500953992312, set to
FAILED_CLOSE

Meanwhile, region servers are in “transition”.

In this case, we had to stop the cluster and want to use offlineMetaRepair
to solve them.
At the same time, we can't stop region servers, and have to manually kill
the processes.
Here is the logs:

>2017-07-25 05:50:32,233 INFO
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer: loading
config
2017-07-25 05:50:32,278 INFO org.apache.hadoop.hbase.master.RegionStates:
Transition
{1588230740 state=OFFLINE, ts=1500961832240, server=null}

to
{1588230740 state=OPEN, ts=1500961832278, server=
prod-lex-datanode-lv-235.prod.marinsw.net,60020,1500960907808}

>2017-07-25 05:50:32,279 INFO org.apache.hadoop.hbase.master.ServerManager:
AssignmentManager hasn't finished failover cleanup; waiting
>2017-07-25 05:50:32,280 INFO org.apache.hadoop.hbase.master.HMaster:
hbase:meta assigned=0, rit=false, location=
prod-lex-datanode-lv-235.prod.marinsw.net,60020,1500960907808
>2017-07-25 05:50:32,433 INFO
org.apache.hadoop.hbase.MetaMigrationConvertingToPB: META already up-to
date with PB serialization
>2017-07-25 05:50:32,782 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Found regions out on
cluster or in RIT; presuming failover
>2017-07-25 05:50:32,834 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Joined the cluster in
400ms, failover=true
>2017-07-25 05:50:32,905 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931
before assignment; region count=0
>2017-07-25 05:50:32,908 INFO
org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [
prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931]
>2017-07-25 05:50:32,910 INFO
org.apache.hadoop.hbase.master.SplitLogManager: hdfs://prod-lex/hbase/WALs/
prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931-splitting is
empty dir, no logs to split
>2017-07-25 05:50:32,911 INFO
org.apache.hadoop.hbase.master.SplitLogManager: started splitting 0 logs in
[hdfs://prod-lex/hbase/WALs/prod-lex-datanode-lv-255.prod.marinsw.net
,60020,1449178169931-splitting] for [
prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931]
>2017-07-25 05:50:32,917 WARN
org.apache.hadoop.hbase.master.SplitLogManager: returning success without
actually splitting and deleting all the log files in path
hdfs://prod-lex/hbase/WALs/prod-lex-datanode-lv-255.prod.marinsw.net
,60020,1449178169931-splitting
>2017-07-25 05:50:32,917 INFO
org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more
than or equal to) 0 bytes in 0 log files in [hdfs://prod-lex/hbase/WALs/
prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931-splitting] in
6ms
>2017-07-25 05:50:32,918 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 0
region(s) that prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931
was carrying (and 0 regions(s) that were opening on this server)
>2017-07-25 05:50:32,918 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
processing of shutdown of prod-lex-datanode-lv-255.prod.marinsw.net
,60020,1449178169931


At last, we had to use snapshot,  to delete hbase znodes for zookeeper, and
to full restart the whole cluster (hdfs, zookeeper, hbase along with other
required services Hive and Oozie).

Although we got hbase-cluster back up, but we still don't know what cause
the problems, and need some suggestions and explains to avoid the problems
happen again.

Do you have any idea why the restart of hbase-cluster will cause
inconsistency and "transition" problems?

And is there a better (or smarter) way to solve them?

Any suggestion and idea is welcome.

Thank you so much in advance.

++

Bo ZHANG

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message