hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Never ending transtionning regions.
Date Sun, 24 Feb 2013 14:43:30 GMT
Removing user.

What I did yesterday is:
- Merged a table to have big regions
- Altered the table to have those regions splitted.
- Ran a major_compact
- Stopped HBase before all of that end.

I tried again yesterday evening but was not able to reproduce.

I will try again today and keep the list posted.

2013/2/23 Kevin O'dell <kevin.odell@cloudera.com>

> +Dev
>
> I think number 1 we fix what ever is leaving regions in this state.  I
> think we could put logic into hbck for this.
>
> On Sat, Feb 23, 2013 at 7:36 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Kevin,
> >
> > I stopped HBase to merge some regions so I already had to deal with the
> > downtime. But with the online merge coming it's very good to know the
> > online way to do it.
> >
> > Now, is there an automated way to do it? In HBCK? Maybe we can check each
> > region if there is links, check that those links exist, and if not, we
> > remove them? Or it will be to risky?
> >
> > JM
> >
> >
> >
> >
> >
> > 2013/2/23 Kevin O'dell <kevin.odell@cloudera.com>
> >
> > > JM,
> > >
> > >   Here is what I am seeing:
> > >
> > > 2013-02-23 15:46:14,630 ERROR
> > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> > open
> > > of
> > >
> > >
> >
> region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > starting to roll back the global memstore size.
> > >
> > > If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen
> some
> > > pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you
> would
> > > see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
> > > 2ebfef593a3d715b59b85670909182c9
> > > would have been empty from a bad split.  What I do is to delete the
> > > pointers that don't reference any storefiles.  Then you can clear the
> > > unassigned folder in zkCli.  Finally, run an unassign on the RITs.
>  This
> > > way there is no down time and you don't have to drop any tables.
> > >
> > >
> > > On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > Thanks for taking the time to reply.
> > > >
> > > > Here is a bigger extract of the logs. I don't see another path in the
> > > logs.
> > > >
> > > > http://pastebin.com/uMxGyjKm
> > > >
> > > > I can send you the entire log if you want (42Mo)
> > > >
> > > > What I did is I merged many regions together, then altered the table
> to
> > > set
> > > > the max_filesize and started a major_compaction to get the table
> > > splitted.
> > > >
> > > > To fix the issue I had to drop one working table, and ran -repair
> > > multiple
> > > > times. Now it's fixed, but I still have the logs.
> > > >
> > > > I'm redoing all the steps I did. Many I will face the issue again. If
> > I'm
> > > > able to reproduce, we might be able to figure where the issue is...
> > > >
> > > > JM
> > > >
> > > > 2013/2/23 Kevin O'dell <kevin.odell@cloudera.com>
> > > >
> > > > > JM,
> > > > >
> > > > >   How are you doing today?  Right before the file does not exist
> > should
> > > > be
> > > > > another path.  Can you let me know if in that path there are a
> > pointers
> > > > > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory
> may
> > > > > already exist.  I have seen this a couple times now and am trying
> to
> > > > ferret
> > > > > out a root cause to open a JIRA with.  I suspect we have a split
> code
> > > bug
> > > > > in .92+
> > > > >
> > > > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have 2 regions transitionning from servers to servers for
15
> > > minutes
> > > > > now.
> > > > > >
> > > > > > I have nothing in the master logs about those 2 regions but
on
> the
> > > > region
> > > > > > server logs I have some files notfound2013-02-23 16:02:07,347
> ERROR
> > > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> > > Failed
> > > > > open
> > > > > > of
> > > > region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > > > starting to roll back the global memstore size.
> > > > > > java.io.IOException: java.io.IOException:
> > > > java.io.FileNotFoundException:
> > > > > > File does not exist:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > > > > >     at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > > > > >     at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > > >     at java.lang.Thread.run(Thread.java:722)
> > > > > > Caused by: java.io.IOException: java.io.FileNotFoundException:
> File
> > > > does
> > > > > > not exist:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > >     at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > > > > >     at
> > > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > > > > >     at
> > > > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > > > > >     at
> > > > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > > > > >     at
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > >     at
> > > > > >
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > > >     at
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > >     ... 3 more
> > > > > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > > > > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > > > > >     at
> > > > > >
> > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > > > > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > > > > >     at
> > > > > >
> > > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > > > > >     at
> > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > > > > >     at
> > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > > > > >     ... 8 more
> > > > > > 2013-02-23 16:02:07,370 WARN
> > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> > > > unassigned
> > > > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from
> RS_ZK_REGION_OPENING
> > > to
> > > > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version
> > > 6586
> > > > > not
> > > > > > the expected version 6585
> > > > > >
> > > > > >
> > > > > > If I try hbck -fix, this is bringing the master down:
> > > > > > 2013-02-23 16:03:01,419 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > BalanceSwitch=false
> > > > > > 2013-02-23 16:03:03,067 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Master server abort: loaded coprocessors are: []
> > > > > > 2013-02-23 16:03:03,068 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Unexpected state :
> > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > > state=PENDING_OPEN, ts=1361653383067,
> > > server=node2,60020,1361653023303
> > > > ..
> > > > > > Cannot transit it to OFFLINE.
> > > > > > java.lang.IllegalStateException: Unexpected state :
> > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > > state=PENDING_OPEN, ts=1361653383067,
> > > server=node2,60020,1361653023303
> > > > ..
> > > > > > Cannot transit it to OFFLINE.
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > > > > >     at
> > > > > >
> > > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > > > > >     at
> > > org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > > > > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > > > 2013-02-23 16:03:03,069 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Aborting
> > > > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > server on 60000
> > > > > > 2013-02-23 16:03:03,069 INFO
> > > > > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > > > > node3,60000,1361653064588-CatalogJanitor exiting
> > > > > > 2013-02-23 16:03:03,069 INFO
> > > org.apache.hadoop.hbase.master.HMaster$2:
> > > > > > node3,60000,1361653064588-BalancerChore exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 5 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 4 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 8 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO
> > > > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > > > > 2013-02-23 16:03:03,070 INFO
> > > > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > > > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > > > > 2013-02-23 16:03:03,070 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Stopping infoServer
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > IPC Server Responder
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > REPL
> > > > IPC
> > > > > > Server handler 1 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > REPL
> > > > IPC
> > > > > > Server handler 2 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > Responder, call isMasterRunning(), rpc version=1, client
> > version=29,
> > > > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output
> error
> > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 3 on 60000 caught a ClosedChannelException, this means
> that
> > > the
> > > > > > server was processing a request but the client went away. The
> error
> > > > > message
> > > > > > was: null
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 3 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 1 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > > > > SelectChannelConnector@0.0.0.0:60010
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > IPC Server Responder
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 6 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 7 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 0 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 2 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > IPC Server listener on 60000
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 9 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > REPL
> > > > IPC
> > > > > > Server handler 0 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,287 INFO
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > > > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > > > > node3,60000,1361653064588.timerUpdater exiting
> > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > > > > node3,60000,1361653064588.timeoutMonitor exiting
> > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > > > > 2013-02-23 16:03:03,468 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > HMaster main thread exiting
> > > > > > 2013-02-23 16:03:03,469 ERROR
> > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> start
> > > > master
> > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > > > > >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > > > > >     at
> > org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > > > > >
> > > > > > I'm running with 0.94.5 +
> > > > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>.
I
> > > don't
> > > > > > think the 2 patchs are related to this issue.
> > > > > >
> > > > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY"
> > > without
> > > > > any
> > > > > > issue.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > > does exist in the FS.
> > > > > >
> > > > > > What I don't understand is why is the master going down? And
how
> > can
> > > I
> > > > > fix
> > > > > > that?
> > > > > >
> > > > > > I will try to create the missing directory and see the results...
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > JM
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kevin O'Dell
> > > > > Customer Operations Engineer, Cloudera
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Customer Operations Engineer, Cloudera
> > >
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message