hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Flakey table disable/enable [WAS -> Re: Table disabled but all regions still online?]
Date Thu, 19 Nov 2009 01:57:23 GMT
I will follow up with them. Could be a functionality test that is not indicative of a core
requirement.

    - Andy





________________________________
From: Stack <saint.ack@gmail.com>
To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
Sent: Wed, November 18, 2009 5:28:44 PM
Subject: Re: Flakey table disable/enable [WAS -> Re: Table disabled but all  regions still
online?]

I am torn.  I sort of want to just fix it right in 0.21 but your tm team writing such a test
would indicate this an impotant feature and maybe we should not wait?



On Nov 18, 2009, at 11:13 AM, Andrew Purtell <apurtell@apache.org> wrote:

> There's a team evaluating HBase in Trend that raised this very issue today. This is the
test as described:
> "We execute the following step via Java API:
> 
>      a.
> create many tables (about 1000 tables), each table have 10 columns and 20 rows
> (value length is 60-100 bytes)
>      b. delete some tables
> (about 10 tables) of these existent tables
>      c. create
> some new tables (about 10 tables), each table have 10 columns and 20 rows (value
> length is 60-100 bytes)
>      d. repeat step b and step
> c
>     Execute these step about 6-10 hours, one of these tables will
> not be able to disabled."
> The test cluster is an 8 node setup. This is 0.20.2 RC1.
> 
> They have a wedged table available for examination. I have not gone on yet and looked
around or tried anything like close_region etc. If you want to go on to the cluster and have
a look around, I can arrange that.
> 
> My suggestion was to avoid using temporary tables in HBase like one might use with a
RDBMS -- create one or maybe just a few tables for containing temporary values, use TTLs as
appropriate, and prepend strings to keys for example foo_key_1, bar_key_1, etc. such that
it's equivalent to storing key_1 in temp tables foo and bar.
> 
> I do think making enable/disable table less flaky in 0.20 is worth some effort. I think
few (if any) of us using HBase in production disable or delete tables unless for some exceptional
reason, but evaluators try it -- perhaps because they are used to creating and dropping temporary
tables on the RDBMS all the time -- and then become concerned.
> 
>   - Andy
> 
> 
> 
> 
> ________________________________
> From: stack <stack@duboce.net>
> To: hbase-user@hadoop.apache.org
> Sent: Wed, November 18, 2009 10:41:16 AM
> Subject: Flakey table disable/enable [WAS -> Re: Table disabled but all  regions still
online?]
> 
> On Wed, Nov 18, 2009 at 8:10 AM, Jochen Frey <jochen@scoutlabs.com> wrote:
> ..
> 
>> 
>> However, at the same time all there regions are still online, which I can
>> verify by way of the web interface as well as the command line interface (>
>> 400 regions).
>> 
>> This has happened at least twice by now. The first time I was able to "fix"
>> it by restarting HDFS, the second time restarting didn't fix it.
>> 
>> 
> In 0.20.x hbase, enable/disable of tables is unreliable as written.  It will
> work when tables are small or we're in a unit test context where
> configuration makes messaging more lively but it quickly turns flakey if
> your table has any more than a few regions.
> 
> Currently, the way it works is to message the master to run a processing of
> all regions that make up a table.  The client waits under a timeout
> continually checking for all regions are offline but if table is large,
> client will often timeout before master finishes.  Master is running process
> in a worker thread in-series updating .META. table and flagging
> RegionServers one at a time that they need to close a region on disable.
> Closing a region entails flushing memstore so can take a while.  Running in
> the master context is sort of necessary because regions may be in a state of
> transition and master is the place where this is kept so it knows how to
> intercept region transitions in case where being asked to online/offline.
> 
> The master is being rewritten for hbase 0.21.  This is one area that is
> being completely redone.  See
> http://wiki.apache.org/hadoop/Hbase/MasterRewrite for the high-level design
> sketch and then
> https://issues.apache.org/jira/browse/HBASE-1730"Near-instantaneous
> online schema and table state updates" for explicit
> discussion of how we're to do table state transistions.
> 
> Enable/disable has been flakey for a while (See
> https://issues.apache.org/jira/browse/HBASE-1636).  My understanding is that
> it will work eventually if you keep trying (maybe this is wrong?)  so I've
> always thought it down on the list of priorities and something we've
> scheduled to fix properly in 0.21.  But you are the second fellow who has
> raised enable/disable as a problem during an evaluation and I'm a little
> concerned that flakey enable/disable is earning us a black mark.  If its
> important, I hope folks will flag it so.  In 0.20.x context, we could hack
> up a script to run the table enable/disable in parallel.  It'd scan .META.,
> sort by servers, write close messages to each regionserver and rewrite the
> table .META.  It could then just wait till all report disabled perhaps
> resignalling if necessary.  If you just want to kill the table, such a
> script may already exist for you.  See
> https://issues.apache.org/jira/browse/HBASE-1872.
> 
> Thanks,
> St.Ack
> 
> 
> 
>> The first time this happened, we had a lot going on (rolling restart of the
>> hbase nodes), hdfs balancer running. The second time I found the following
>> exception in the master log (below). Can anyone shed some light on this or
>> tell me what additional information would be helpful for debugging?
>> 
>> Thanks so much!
>> Jochen
>> 
>> 
>> 2009-11-17 20:59:12,751 INFO org.apache.hadoop.hbase.master.ServerManager:
>> 8
>> region servers, 0 dead, average load 50.25
>> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.rootScanner scanning meta region {server: 10.10.0.177:60020,
>> regionname: -ROOT-,,0, startKey: <>}
>> 2009-11-17 20:59:13,611 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.metaScanner scanning meta region {server: 10.10.0.189:60020,
>> regionname: .META.,,1, startKey: <>}
>> 2009-11-17 20:59:13,620 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
>> 10.10.0.177:60020, regionname: -ROOT-,,0, startKey: <>} complete
>> 2009-11-17 20:59:13,622 WARN org.apache.hadoop.hbase.master.BaseScanner:
>> Scan one META region: {server: 10.10.0.189:60020, regionname: .META.,,1,
>> startKey: <>}
>> java.net.ConnectException: Connection refused
>>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>       at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>       at
>> 
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>>       at
>> 
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
>>       at
>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
>>       at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
>>       at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>>       at $Proxy6.openScanner(Unknown Source)
>>       at
>> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
>>       at
>> 
>> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>       at
>> 
>> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>       at
>> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
>>       at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>> 2009-11-17 20:59:13,623 INFO org.apache.hadoop.hbase.master.BaseScanner:
>> All
>> 1 .META. region(s) scanned
>> d
>> 
>> 
>> --
>> Jochen Frey . CTO
>> Scout Labs
>> 415.366.0450
>> www.scoutlabs.com
>> 
> 
> 
> 



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message