hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <saint....@gmail.com>
Subject Re: Flakey table disable/enable [WAS -> Re: Table disabled but all regions still online?]
Date Thu, 19 Nov 2009 01:28:44 GMT
I am torn.  I sort of want to just fix it right in 0.21 but your tm  
team writing such a test would indicate this an impotant feature and  
maybe we should not wait?



On Nov 18, 2009, at 11:13 AM, Andrew Purtell <apurtell@apache.org>  
wrote:

> There's a team evaluating HBase in Trend that raised this very issue  
> today. This is the test as described:
> "We execute the following step via Java API:
>
>      a.
> create many tables (about 1000 tables), each table have 10 columns  
> and 20 rows
> (value length is 60-100 bytes)
>      b. delete some tables
> (about 10 tables) of these existent tables
>      c. create
> some new tables (about 10 tables), each table have 10 columns and 20  
> rows (value
> length is 60-100 bytes)
>      d. repeat step b and step
> c
>     Execute these step about 6-10 hours, one of these tables will
> not be able to disabled."
> The test cluster is an 8 node setup. This is 0.20.2 RC1.
>
> They have a wedged table available for examination. I have not gone  
> on yet and looked around or tried anything like close_region etc. If  
> you want to go on to the cluster and have a look around, I can  
> arrange that.
>
> My suggestion was to avoid using temporary tables in HBase like one  
> might use with a RDBMS -- create one or maybe just a few tables for  
> containing temporary values, use TTLs as appropriate, and prepend  
> strings to keys for example foo_key_1, bar_key_1, etc. such that  
> it's equivalent to storing key_1 in temp tables foo and bar.
>
> I do think making enable/disable table less flaky in 0.20 is worth  
> some effort. I think few (if any) of us using HBase in production  
> disable or delete tables unless for some exceptional reason, but  
> evaluators try it -- perhaps because they are used to creating and  
> dropping temporary tables on the RDBMS all the time -- and then  
> become concerned.
>
>   - Andy
>
>
>
>
> ________________________________
> From: stack <stack@duboce.net>
> To: hbase-user@hadoop.apache.org
> Sent: Wed, November 18, 2009 10:41:16 AM
> Subject: Flakey table disable/enable [WAS -> Re: Table disabled but  
> all  regions still online?]
>
> On Wed, Nov 18, 2009 at 8:10 AM, Jochen Frey <jochen@scoutlabs.com>  
> wrote:
> ..
>
>>
>> However, at the same time all there regions are still online, which  
>> I can
>> verify by way of the web interface as well as the command line  
>> interface (>
>> 400 regions).
>>
>> This has happened at least twice by now. The first time I was able  
>> to "fix"
>> it by restarting HDFS, the second time restarting didn't fix it.
>>
>>
> In 0.20.x hbase, enable/disable of tables is unreliable as written.   
> It will
> work when tables are small or we're in a unit test context where
> configuration makes messaging more lively but it quickly turns  
> flakey if
> your table has any more than a few regions.
>
> Currently, the way it works is to message the master to run a  
> processing of
> all regions that make up a table.  The client waits under a timeout
> continually checking for all regions are offline but if table is  
> large,
> client will often timeout before master finishes.  Master is running  
> process
> in a worker thread in-series updating .META. table and flagging
> RegionServers one at a time that they need to close a region on  
> disable.
> Closing a region entails flushing memstore so can take a while.   
> Running in
> the master context is sort of necessary because regions may be in a  
> state of
> transition and master is the place where this is kept so it knows  
> how to
> intercept region transitions in case where being asked to online/ 
> offline.
>
> The master is being rewritten for hbase 0.21.  This is one area that  
> is
> being completely redone.  See
> http://wiki.apache.org/hadoop/Hbase/MasterRewrite for the high-level  
> design
> sketch and then
> https://issues.apache.org/jira/browse/HBASE-1730"Near-instantaneous
> online schema and table state updates" for explicit
> discussion of how we're to do table state transistions.
>
> Enable/disable has been flakey for a while (See
> https://issues.apache.org/jira/browse/HBASE-1636).  My understanding  
> is that
> it will work eventually if you keep trying (maybe this is wrong?)   
> so I've
> always thought it down on the list of priorities and something we've
> scheduled to fix properly in 0.21.  But you are the second fellow  
> who has
> raised enable/disable as a problem during an evaluation and I'm a  
> little
> concerned that flakey enable/disable is earning us a black mark.  If  
> its
> important, I hope folks will flag it so.  In 0.20.x context, we  
> could hack
> up a script to run the table enable/disable in parallel.  It'd  
> scan .META.,
> sort by servers, write close messages to each regionserver and  
> rewrite the
> table .META.  It could then just wait till all report disabled perhaps
> resignalling if necessary.  If you just want to kill the table, such a
> script may already exist for you.  See
> https://issues.apache.org/jira/browse/HBASE-1872.
>
> Thanks,
> St.Ack
>
>
>
>> The first time this happened, we had a lot going on (rolling  
>> restart of the
>> hbase nodes), hdfs balancer running. The second time I found the  
>> following
>> exception in the master log (below). Can anyone shed some light on  
>> this or
>> tell me what additional information would be helpful for debugging?
>>
>> Thanks so much!
>> Jochen
>>
>>
>> 2009-11-17 20:59:12,751 INFO  
>> org.apache.hadoop.hbase.master.ServerManager:
>> 8
>> region servers, 0 dead, average load 50.25
>> 2009-11-17 20:59:13,611 INFO  
>> org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.rootScanner scanning meta region {server:  
>> 10.10.0.177:60020,
>> regionname: -ROOT-,,0, startKey: <>}
>> 2009-11-17 20:59:13,611 INFO  
>> org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.metaScanner scanning meta region {server:  
>> 10.10.0.189:60020,
>> regionname: .META.,,1, startKey: <>}
>> 2009-11-17 20:59:13,620 INFO  
>> org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
>> 10.10.0.177:60020, regionname: -ROOT-,,0, startKey: <>} complete
>> 2009-11-17 20:59:13,622 WARN  
>> org.apache.hadoop.hbase.master.BaseScanner:
>> Scan one META region: {server: 10.10.0.189:60020,  
>> regionname: .META.,,1,
>> startKey: <>}
>> java.net.ConnectException: Connection refused
>>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>       at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java: 
>> 574)
>>       at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.connect 
>> (SocketIOWithTimeout.java:206)
>>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>>       at
>>
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams 
>> (HBaseClient.java:308)
>>       at
>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection 
>> (HBaseClient.java:831)
>>       at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
>>       at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java: 
>> 328)
>>       at $Proxy6.openScanner(Unknown Source)
>>       at
>> org.apache.hadoop.hbase.master.BaseScanner.scanRegion 
>> (BaseScanner.java:160)
>>       at
>>
>> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion 
>> (MetaScanner.java:73)
>>       at
>>
>> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan 
>> (MetaScanner.java:129)
>>       at
>> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java: 
>> 136)
>>       at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>> 2009-11-17 20:59:13,623 INFO  
>> org.apache.hadoop.hbase.master.BaseScanner:
>> All
>> 1 .META. region(s) scanned
>> d
>>
>>
>> --
>> Jochen Frey . CTO
>> Scout Labs
>> 415.366.0450
>> www.scoutlabs.com
>>
>
>
>

Mime
View raw message