hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Disabling a table taking very long time
Date Tue, 01 Mar 2011 01:17:50 GMT
I didn't get most of the information I was looking for... but I do
have something to work with. So I asked whether the metaScanner was
making progress or not, and from that log I think I see it does:

> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
> Rowscanned=261, rowsOffline=3
> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
> regions to be disabled from myTable

Only 3 rows are offline, whereas we expect all the region to be
offline. I expect you see those lines a lot? If so, one thing I'd like
to see is the result of this command:

scan '.META.', {STARTROW => "myTable,,", LIMIT => 261}

It's going to be big. Then grep in the result for the string SPLIT,
and please post back here the lines that match.

J-D

On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <nanhengwu@gmail.com> wrote:
> I issued two more disable commands and looked at the .META RS. The
> last log entry looks like:
>
> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
> .dat.1298935752660, entries=13527, calcsize=3839046, filesize=2560143.
> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>
> I issued the command at: 00:34:59, the client debug message has
> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping
> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19.
> Then all of a sudden it outputs the following:
>
> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location server 10.146.14.38:60020,
> location region name .META.,,1
> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode
> /hbase/root-region-server got 10.146.14.38:60020
> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found
> ROOT at 10.146.14.38:60020
> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Cached
> location for .META.,,1 is 10.146.14.38:60020
> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
> Rowscanned=261, rowsOffline=3
> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all
> regions to be disabled from myTable
> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all
> regions to be disabled from myTable
>
> Here's the full thread dump of the .META RS from 00:34:59 to  00:47:31
> http://pastebin.com/Ywwx8afQ
>
> I hope I compiled some useful information here. I am pretty lost at
> this point, I would really appreciate your help!
>
> Thanks!
>
>
>
> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>> Ok so no debug from the disable command? BTW you should output your GC
>> somewhere else like the .out file or a completely separate file.
>>
>> J-D
>>
>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>>> I see, so I should jstack the .META region. I'll do that.
>>>
>>> The master log pretty much looks like this: should I grep for
>>> something specific?
>>>
>>> 11/02/28 22:52:56 INFO master.BaseScanner: RegionManager.rootScanner
>>> scan of 1 row(s) of meta region complete
>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew:
>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159
>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00,
>>> real=0.00 secs]
>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew:
>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163
>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00,
>>> real=0.00 secs]
>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew:
>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164
>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00,
>>> real=0.01 secs]
>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew:
>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134
>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00,
>>> real=0.01 secs]
>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew:
>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166
>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00,
>>> real=0.01 secs]
>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew:
>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129
>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00,
>>> real=0.01 secs]
>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 dead,
>>> average load 760.2222222222222
>>>
>>>
>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>> Are you able to tell if that call in metaScanner is hanging or it's
>>>> making multiple calls to the .META. region?
>>>>
>>>> If former, then jstack the region server that hosts .META. and see
>>>> where it's blocked.
>>>>
>>>> if latter, then it means your .META. region is slow? Again, what's
>>>> going on on the RS that hosts .META.?
>>>>
>>>> Finally, what's the master's log like during that time?
>>>>
>>>> J-D
>>>>
>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>>>>> J-D,
>>>>>
>>>>>  Thanks so much for your help so far! I sent disable commands on 4
>>>>> rather small tables and they got stuck for a long time again, so I
>>>>> took jstack of the master. From what I can tell, all disableTable
>>>>> calls are blocked by a meta scanner thread (sample log below). At the
>>>>> moment there were no other requests to the server at all. How should
I
>>>>> investigate this further? If it helps, here's some background: I have
>>>>> several datasets, each of them is in a separate table; Our data
>>>>> pipeline produces a new version of each dataset everyday and only the
>>>>> lastest version should be used. This is how the data is loaded: for
>>>>> each dataset 1) run a MR job and outputs HFiles 2) call loadTable.rb
>>>>> to create a new table 3) disable and drop the previous version. As a
>>>>> result some calls to load table and drop table would overlap. Please
>>>>> let me know if something stands out to you as a potential culprit.
>>>>> Thanks!
>>>>>
>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6
>>>>>
>>>>> Thread 47 (IPC Server handler 13 on 60000):
>>>>>  State: BLOCKED
>>>>>  Blocked count: 3801
>>>>>  Waited count: 72719
>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>  Stack:
>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>
>>>>>
>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>  State: WAITING
>>>>>  Blocked count: 1526058
>>>>>  Waited count: 1488998
>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>  Stack:
>>>>>    java.lang.Object.wait(Native Method)
>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>    $Proxy1.get(Unknown Source)
>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans
>>>>> <jdcryans@apache.org> wrote:
>>>>>> An hour to disable? That doesn't sound right at all :)
>>>>>>
>>>>>> I would approach this problem like I generally do with HBase issue,
>>>>>> first check the master log for any weirdness regarding my problem
(in
>>>>>> this case, grep for the table name).
>>>>>>
>>>>>> Then I would look the region server log(s) of the nodes that were
>>>>>> hosting regions from that table. You should see the steps taken to
>>>>>> disable the regions (starting to close, flush, region completely
>>>>>> closed).
>>>>>>
>>>>>> If you are able to do it while it's taking a very long time to
>>>>>> disable, try to jstack the process the seems to be hanging.
>>>>>>
>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 that
>>>>>> almost prevent disabling a table (or re-enabling) if any region
>>>>>> recently split and the parent wasn't cleaned yet from .META. and
that
>>>>>> is fixed in 0.90.1
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu <nanhengwu@gmail.com>
wrote:
>>>>>>> I think you are right, maybe in the long run I need to re-architect
my
>>>>>>> system so that it doesn't need to create new and delete old tables
all
>>>>>>> the time. In the short term I am having a really hard time with
the
>>>>>>> disabling function, I ran a disable command on a very small table
>>>>>>> (probably dozen of MBs in size) and are no client using the cluster
at
>>>>>>> all, and that took about 1 hour to complete! The weird thing
is on the
>>>>>>> web UI only the region server carrying the META table has non-zero
>>>>>>> requests, all other RS have 0 requests the entire time. I would
think
>>>>>>> they should get some request to flush the memstore at least.
I *am*
>>>>>>> using the same RS nodes for some map reduce job at the time and
top
>>>>>>> shows the memory usage is almost full on the META region. Would
you
>>>>>>> have some idea of what I should investigate?
>>>>>>> Thanks so much.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message