hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Master aborts on start up - URGENT
Date Sun, 28 Jul 2013 01:02:39 GMT
Can you collect region server log from sjc1-eng-perf-g1-grid03.carrieriq.com?

You can pastebin portion of region server log related to
usertable,user6,1374971740436.73e01b52a570febc16833d7cc4f7ca48. after
anonymization.

Cheers

On Sat, Jul 27, 2013 at 5:47 PM, Vladimir Rodionov
<vrodionov@carrieriq.com>wrote:

> Nope. this seems to be very serious issue
>
> When I tried to recreate 'usertable' I got the same issue again:
>
>
> 2013-07-28 00:35:40,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x54022944d180000 Creating (or updating) unassigned node for
> a386becc8860c810e33bb9c9d81482bc with OFFLINE state
> 2013-07-28 00:35:40,747 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> IPC Server Responder
> 2013-07-28 00:35:40,747 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan
> for usertable,user,1374971740436.cf772fe9e49bc911024b442914a15f67.
> destination server is sjc1-eng-perf-g1-grid04.carrieriq.com
> ,60020,1374969681440
> 2013-07-28 00:35:40,748 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition
> plan was found (or we are ignoring an existing plan) for
> usertable,user,1374971740436.cf772fe9e49bc911024b442914a15f67. so generated
> a random one;
> hri=usertable,user,1374971740436.cf772fe9e49bc911024b442914a15f67., src=,
> dest=sjc1-eng-perf-g1-grid19.carrieriq.com,60020,1374969681450; 20
> (online=20, available=19) available servers
> 2013-07-28 00:35:40,748 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60010
> 2013-07-28 00:35:40,749 DEBUG
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
> event for 16938dcb9c3bb52a46ffb7b10fab3c57
> 2013-07-28 00:35:40,749 FATAL org.apache.hadoop.hbase.master.HMaster:
> Master server abort: loaded coprocessors are: []
> 2013-07-28 00:35:40,749 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
> was=usertable,user7,1374971740436.16938dcb9c3bb52a46ffb7b10fab3c57.
> state=CLOSED, ts=1374971740713, server=
> sjc1-eng-perf-g1-grid01.carrieriq.com,60020,1374969681434
> 2013-07-28 00:35:40,749 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x54022944d180000 Creating (or updating) unassigned node for
> 16938dcb9c3bb52a46ffb7b10fab3c57 with OFFLINE state
> 2013-07-28 00:35:40,749 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unexpected state :
> usertable,user6,1374971740436.73e01b52a570febc16833d7cc4f7ca48.
> state=PENDING_OPEN, ts=1374971740749, server=
> sjc1-eng-perf-g1-grid03.carrieriq.com,60020,1374969681445 .. Cannot
> transit it to OFFLINE.
> java.lang.IllegalStateException: Unexpected state :
> usertable,user6,1374971740436.73e01b52a570febc16833d7cc4f7ca48.
> state=PENDING_OPEN, ts=1374971740749, server=
> sjc1-eng-perf-g1-grid03.carrieriq.com,60020,1374969681445 .. Cannot
> transit it to OFFLINE.
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1820)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1659)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
>         at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2013-07-28 00:35:40,749 INFO org.apache.hadoop.hbase.master.HMaster:
> Aborting
>
>
> Master aborted.
>
> This is what I ran:
>
> create 'usertable', { NAME=>'cf', VERSIONS=> 1, COMPRESSION => 'SNAPPY',
> BLOCKCACHE => true}, { SPLITS => ['user', 'user05',
> 'user1','user15','user2','user25','user3','user35','user4','user45','user5','user55','user6','user65','user7','user75','user8','user85','user9','user95'
> ]}
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Vladimir Rodionov
> Sent: Saturday, July 27, 2013 5:08 PM
> To: dev@hbase.apache.org
> Subject: RE: Master aborts on start up - URGENT
>
> OK, I managed to fix the issue and minimize the damage.
>
> The reason why OfflineMetaRepair failed to fix .META. was because there
> were inconsistencies in one of the tables
> and the tool refused to do META repair. I had to physically remove this
> table in HDFS and then I re-ran the tool
> and successfully repaired META.
>
>
>
> table and
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Vladimir Rodionov
> Sent: Saturday, July 27, 2013 4:21 PM
> To: dev@hbase.apache.org
> Subject: Master aborts on start up - URGENT
>
> This may be related to :
>
> https://issues.apache.org/jira/browse/HBASE-8912
>
>
> It has started when I tried to install and run YCSB. I have created
> 'usertable' and then tried to modify it couple times (added COMPRESSION),
> HBase (0.94.6) stopped working (Master could not finish initialization)
>
> I stopped the cluster and physically removed /hbase/usertable directory as
> well as all ZK local stores. Restarted. No success.
>
> I manually ran OfflineMetaRepair. Restarted. No success. This is  FATAL
> error in Master's log file.
>
> For some reason, OfflineMetaRepair did not fix missing 'usertable'.
>
> Please, advise. This is a development cluster with a large volume of data.
>
>
>
> 2013-07-27 23:08:56,504 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region
> TMO_NOV_INDEX-UPLOADS,38,1360181215845.2553b53773e3cb9030c3248768a3b0ca.
> has been deleted.
> 2013-07-27 23:08:56,504 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the
> region
> TMO_NOV_INDEX-UPLOADS,38,1360181215845.2553b53773e3cb9030c3248768a3b0ca.
> that was online on sjc1-eng-perf-g1-grid06.carrieriq.com
> ,60020,1374966494222
> 2013-07-27 23:08:56,504 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unexpected state :
> usertable,,1374962208806.249881162b6ad6d084b30507283f98b8.
> state=PENDING_OPEN, ts=1374966536502, server=
> sjc1-eng-perf-g1-grid14.carrieriq.com,60020,1374966494232 .. Cannot
> transit it to OFFLINE.
> java.lang.IllegalStateException: Unexpected state :
> usertable,,1374962208806.249881162b6ad6d084b30507283f98b8.
> state=PENDING_OPEN, ts=1374966536502, server=
> sjc1-eng-perf-g1-grid14.carrieriq.com,60020,1374966494232 .. Cannot
> transit it to OFFLINE.
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1820)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1659)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
>         at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2013-07-27 23:08:56,504 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region
> TMO_NOV_INDEX-UPLOADS,46,1360181215846.6f2a2eb3924ba5cb6ed22f966e6356e8.
> has been deleted.
> 2013-07-27 23:08:56,505 INFO org.apache.hadoop.hbase.master.HMaster:
> Aborting
>
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: stack (JIRA) [jira@apache.org]
> Sent: Saturday, July 27, 2013 3:21 PM
> To: dev@hbase.apache.org
> Subject: [jira] [Created] (HBASE-9063)
> TestAssignmentManagerOnCluster.testSSHWhenDisablingTableRegionsInOpeningOrPendingOpenState
> fails
>
> stack created HBASE-9063:
> ----------------------------
>
>              Summary:
> TestAssignmentManagerOnCluster.testSSHWhenDisablingTableRegionsInOpeningOrPendingOpenState
> fails
>                  Key: HBASE-9063
>                  URL: https://issues.apache.org/jira/browse/HBASE-9063
>              Project: HBase
>           Issue Type: Bug
>           Components: test
>             Reporter: stack
>             Assignee: Jimmy Xiang
>
>
>
> https://builds.apache.org/job/hbase-0.95-on-hadoop2/200/testReport/org.apache.hadoop.hbase.master/TestAssignmentManagerOnCluster/testSSHWhenDisablingTableRegionsInOpeningOrPendingOpenState/
>
> {code}java.lang.NullPointerException
>         at
> org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:1314)
>         at
> org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testSSHWhenDisablingTableRegionsInOpeningOrPendingOpenState(TestAssignmentManagerOnCluster.java:482)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>         at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>         at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>         at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>         at
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74){code}
>
> Hope you don't mind my assigning it to you Jimmy.  Thought you might be
> interested.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message