hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Flakey table disable/enable [WAS -> Re: Table disabled but all regions still online?]
Date Wed, 18 Nov 2009 19:13:32 GMT
There's a team evaluating HBase in Trend that raised this very issue today. This is the test
as described:
"We execute the following step via Java API: 

      a. 
create many tables (about 1000 tables), each table have 10 columns and 20 rows 
(value length is 60-100 bytes) 
      b. delete some tables 
(about 10 tables) of these existent tables 
      c. create 
some new tables (about 10 tables), each table have 10 columns and 20 rows (value 
length is 60-100 bytes) 
      d. repeat step b and step 
c 
     Execute these step about 6-10 hours, one of these tables will 
not be able to disabled."
The test cluster is an 8 node setup. This is 0.20.2 RC1. 

They have a wedged table available for examination. I have not gone on yet and looked around
or tried anything like close_region etc. If you want to go on to the cluster and have a look
around, I can arrange that. 

My suggestion was to avoid using temporary tables in HBase like one might use with a RDBMS
-- create one or maybe just a few tables for containing temporary values, use TTLs as appropriate,
and prepend strings to keys for example foo_key_1, bar_key_1, etc. such that it's equivalent
to storing key_1 in temp tables foo and bar. 

I do think making enable/disable table less flaky in 0.20 is worth some effort. I think few
(if any) of us using HBase in production disable or delete tables unless for some exceptional
reason, but evaluators try it -- perhaps because they are used to creating and dropping temporary
tables on the RDBMS all the time -- and then become concerned. 

   - Andy




________________________________
From: stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Sent: Wed, November 18, 2009 10:41:16 AM
Subject: Flakey table disable/enable [WAS -> Re: Table disabled but all  regions still
online?]

On Wed, Nov 18, 2009 at 8:10 AM, Jochen Frey <jochen@scoutlabs.com> wrote:
..

>
> However, at the same time all there regions are still online, which I can
> verify by way of the web interface as well as the command line interface (>
> 400 regions).
>
> This has happened at least twice by now. The first time I was able to "fix"
> it by restarting HDFS, the second time restarting didn't fix it.
>
>
In 0.20.x hbase, enable/disable of tables is unreliable as written.  It will
work when tables are small or we're in a unit test context where
configuration makes messaging more lively but it quickly turns flakey if
your table has any more than a few regions.

Currently, the way it works is to message the master to run a processing of
all regions that make up a table.  The client waits under a timeout
continually checking for all regions are offline but if table is large,
client will often timeout before master finishes.  Master is running process
in a worker thread in-series updating .META. table and flagging
RegionServers one at a time that they need to close a region on disable.
Closing a region entails flushing memstore so can take a while.  Running in
the master context is sort of necessary because regions may be in a state of
transition and master is the place where this is kept so it knows how to
intercept region transitions in case where being asked to online/offline.

The master is being rewritten for hbase 0.21.  This is one area that is
being completely redone.  See
http://wiki.apache.org/hadoop/Hbase/MasterRewrite for the high-level design
sketch and then
https://issues.apache.org/jira/browse/HBASE-1730"Near-instantaneous
online schema and table state updates" for explicit
discussion of how we're to do table state transistions.

Enable/disable has been flakey for a while (See
https://issues.apache.org/jira/browse/HBASE-1636).  My understanding is that
it will work eventually if you keep trying (maybe this is wrong?)  so I've
always thought it down on the list of priorities and something we've
scheduled to fix properly in 0.21.  But you are the second fellow who has
raised enable/disable as a problem during an evaluation and I'm a little
concerned that flakey enable/disable is earning us a black mark.  If its
important, I hope folks will flag it so.  In 0.20.x context, we could hack
up a script to run the table enable/disable in parallel.  It'd scan .META.,
sort by servers, write close messages to each regionserver and rewrite the
table .META.  It could then just wait till all report disabled perhaps
resignalling if necessary.  If you just want to kill the table, such a
script may already exist for you.  See
https://issues.apache.org/jira/browse/HBASE-1872.

Thanks,
St.Ack



> The first time this happened, we had a lot going on (rolling restart of the
> hbase nodes), hdfs balancer running. The second time I found the following
> exception in the master log (below). Can anyone shed some light on this or
> tell me what additional information would be helpful for debugging?
>
> Thanks so much!
> Jochen
>
>
> 2009-11-17 20:59:12,751 INFO org.apache.hadoop.hbase.master.ServerManager:
> 8
> region servers, 0 dead, average load 50.25
> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.10.0.177:60020,
> regionname: -ROOT-,,0, startKey: <>}
> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 10.10.0.189:60020,
> regionname: .META.,,1, startKey: <>}
> 2009-11-17 20:59:13,620 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.10.0.177:60020, regionname: -ROOT-,,0, startKey: <>} complete
> 2009-11-17 20:59:13,622 WARN org.apache.hadoop.hbase.master.BaseScanner:
> Scan one META region: {server: 10.10.0.189:60020, regionname: .META.,,1,
> startKey: <>}
> java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>        at
>
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>        at $Proxy6.openScanner(Unknown Source)
>        at
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
>        at
>
> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>        at
>
> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>        at
> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> 2009-11-17 20:59:13,623 INFO org.apache.hadoop.hbase.master.BaseScanner:
> All
> 1 .META. region(s) scanned
> d
>
>
> --
> Jochen Frey . CTO
> Scout Labs
> 415.366.0450
> www.scoutlabs.com
>



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message