hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-495) No server address listed in .META.
Date Fri, 07 Mar 2008 02:04:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576014#action_12576014
] 

stack commented on HBASE-495:
-----------------------------

Here is story that I have so far.

RegionServer gets hung on DFS ('Call queue overflow discarding oldest call batchUpdate').
Michael B notices it and shutsdown the regionserver (44.221).
The server is restarted.
Tries to check in w/ master but the master says lease still exists
HRS has no pause facility so in a tight loop writes 400k lines in 15seconds about the master's
saying the lease exists when it tries to check in w/ master (HBASE-496)
Eventually the old HRS lease expires.
Master gives new HRS a region.
HRS tries to deploy the region.  Skips 2M lines worth of edits (HBASE-472)
Region eventually opens.
Master gives the HRS more regions to open.
Meantime the region w/ all the skipped edits tries to do a compaction and runs into DFS issue:
NotReplicatedYetException. Compaction is aborted.
Others of the new regions try to compact.  Fail again in DFS.  Here are what the fails are
like:
{code}
2464394 2008-03-06 01:13:59,299 WARN org.apache.hadoop.fs.DFSClient: NotReplicatedYetException
sleeping /hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
retries left 2
2464395 2008-03-06 01:14:00,902 INFO org.apache.hadoop.fs.DFSClient: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.LeaseExpiredException: No lease on /hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
2464396         at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1157)
2464397         at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1095)
2464398         at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:310)
2464399         at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
2464400         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
2464401         at java.lang.reflect.Method.invoke(Unknown Source)
2464402         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
2464403         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)
2464404 
2464405         at org.apache.hadoop.ipc.Client.call(Client.java:512)
2464406         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
2464407         at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
2464408         at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
2464409         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
2464410         at java.lang.reflect.Method.invoke(Unknown Source)
2464411         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
2464412         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
2464413         at org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
2464414         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2065)
2464415         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:1958)
2464416         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:1479)
2464417         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1593)
2464418 
2464419 2008-03-06 01:14:01,029 WARN org.apache.hadoop.fs.DFSClient: NotReplicatedYetException
sleeping /hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
retries left 1
2464420 2008-03-06 01:14:04,231 WARN org.apache.hadoop.fs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease
on /hbase/aa0-005-2.u.powerset.com/enwiki_080103/compaction.dir/123835725/page/mapfiles/7474986258048984189/data
2464421         at org.apache.hadoop.dfs.FSNamesystem.checkLease(FSNamesystem.java:1157)
2464422         at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1095)
2464423         at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:310)
2464424         at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
2464425         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
2464426         at java.lang.reflect.Method.invoke(Unknown Source)
2464427         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
2464428         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)
2464429 
2464430 2008-03-06 01:14:04,231 WARN org.apache.hadoop.fs.DFSClient: Error Recovery for block
blk_1794752555243844791 bad datanode[0]
2464431 2008-03-06 01:14:04,232 ERROR org.apache.hadoop.hbase.HRegionServer: Compaction failed
for region enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204766010394
2464432 java.io.IOException: Could not get block locations. Aborting...
2464433         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1824)
2464434         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479)
2464435         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571)
{code}

More regions to open from master.  Now the open messages are for the same region... here is
illustration:

{code}
...
2464453 2008-03-06 01:16:41,090 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_080103,cD-17MphmZfwXnZVdtKy1k==,1199852162634
2464454 2008-03-06 01:28:32,670 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464455 2008-03-06 01:29:29,718 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464456 2008-03-06 01:29:35,722 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464457 2008-03-06 01:29:35,722 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204766010394
2464458 2008-03-06 01:29:41,728 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464459 2008-03-06 01:29:47,734 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464460 2008-03-06 01:29:53,740 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464461 2008-03-06 01:29:59,746 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
2464462 2008-03-06 01:30:05,752 INFO org.apache.hadoop.hbase.HRegionServer: MSG_REGION_OPEN
: enwiki_071018,75WX3Q0b857NBV8HfO7PC-==,1197675176778
...
{code}

Regionserver should shut itself down if its failing to open a region because of DFS issues
-- if it can recognize them as that.

Meantime, on the server, its stuck in the shutdown loop:

{code}
...
4981501 2008-03-06 01:28:00,016 DEBUG org.apache.hadoop.hbase.HMaster: process server shutdown
scanning root region on XX.XX.XX.92 finished HMaster
4981502 2008-03-06 01:28:00,016 DEBUG org.apache.hadoop.hbase.HMaster: numberOfMetaRegions:
1, onlineMetaRegions.size(): 1
4981503 2008-03-06 01:28:00,016 DEBUG org.apache.hadoop.hbase.HMaster: process server shutdown
scanning .META.,,1 on XX.XX.XX.96:60020 HMaster
4981504 2008-03-06 01:28:00,021 DEBUG org.apache.hadoop.hbase.HMaster: shutdown scanner looking
at enwiki_071018,,1199837878882
4981505 2008-03-06 01:28:00,021 DEBUG org.apache.hadoop.hbase.HMaster: Server name XX.XX.XX.226:60020
is not same as XX.XX.XX.221:60020: Passing
...
{code}

Above goes on for 10M lines over about 30 minutes.  Problem is this bit of code in regionServerStartup:

{code}
    HServerInfo storedInfo = serversToServerInfo.remove(s);
    if (storedInfo != null && !closed.get()) {
      // The startup message was from a known server with the same name.
      // Timeout the old one right away.
      HServerAddress root = rootRegionLocation.get();
      if (root != null && root.equals(storedInfo.getServerAddress())) {
        unassignRootRegion();
      } 
      delayedToDoQueue.put(new ProcessServerShutdown(storedInfo));
    } 
{code}

Don't put if server already has a shutdown queued.

OK.  Two fixes needed for this issue (at least): Regionservers should shut down if DFS probs.
and don't queue a shutdown if one already queued.

> No server address listed in .META.
> ----------------------------------
>
>                 Key: HBASE-495
>                 URL: https://issues.apache.org/jira/browse/HBASE-495
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.16.0
>            Reporter: stack
>             Fix For: 0.1.0, 0.2.0
>
>
> Michael Bieniosek manufactured the following in a 0.16.0 install:
> {code}
> 08/03/06 17:52:02 DEBUG hbase.HTable: Advancing internal scanner to startKey g80Fi5WZHlzLqGzErrAd7V==
> 08/03/06 17:52:02 DEBUG hbase.HConnectionManager$TableServers: reloading table servers
because: No server address listed in .META. for region enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> 08/03/06 17:52:12 DEBUG hbase.HConnectionManager$TableServers: reloading table servers
because: No server address listed in .META. for region enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> 08/03/06 17:52:22 DEBUG hbase.HConnectionManager$TableServers: reloading table servers
because: No server address listed in .META. for region enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
> org.apache.hadoop.hbase.NoServerForRegionException: No server address listed in .META.
for region enwiki_080103,g80Fi5WZHlzLqGzErrAd7V==,1204768636421
>         at org.apache.hadoop.hbase.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:449)
>         at org.apache.hadoop.hbase.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:346)
>         at org.apache.hadoop.hbase.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:309)
>         at org.apache.hadoop.hbase.HTable.getRegionLocation(HTable.java:103)
>         at org.apache.hadoop.hbase.HTable$ClientScanner.nextScanner(HTable.java:854)
>         at org.apache.hadoop.hbase.HTable$ClientScanner.next(HTable.java:915)
>         at org.apache.hadoop.hbase.hql.SelectCommand.scanPrint(SelectCommand.java:233)
>         at org.apache.hadoop.hbase.hql.SelectCommand.execute(SelectCommand.java:100)
>         at org.apache.hadoop.hbase.hql.HQLClient.executeQuery(HQLClient.java:50)
>         at org.apache.hadoop.hbase.Shell.main(Shell.java:114)
> {code}
> When I look in the .META., I see that the above region range has multiple mentions...
: one offlined, two that have startcodes and servers associated and about 5 others that are
just HRIs.  Table is broke.  At least need the merge of overlapping regions tool to fix. 
Digging more....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message