hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2866) Region permanently offlined
Date Fri, 23 Jul 2010 06:04:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891483#action_12891483
] 

stack commented on HBASE-2866:
------------------------------

@Kannan Thanks.  Looking at master and at code, my thought is that the fixup code didn't run
because that region is stuck in transition.   Here is where we'd skip out starting at about
#562 in BaseScanner:

{code}
    synchronized (this.master.getRegionManager()) {
      /* We don't assign regions that are offline, in transition or were on
       * a dead server. Regions that were on a dead server will get reassigned
       * by ProcessServerShutdown
       */
      if (info.isOffline() ||
        this.master.getRegionManager().regionIsInTransition(info.getRegionNameAsString())
||
         // St.Ack ^^^^^^^^^^ My guess is we are in here^^^^^^^^^
          (serverName != null && this.master.getServerManager().isDead(serverName)))
{
        return;
      }
{code}

I think 'status' in shell:

{code}
hbase(main):003:0> status 'detailed'
version 0.89.0-SNAPSHOT
0 regionsInTransition
1 live servers
    192.168.1.157:49248 1279864501042
        requests=0, regions=3, usedHeap=32, maxHeap=994
        .META.,,1
            stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
        x,,1279864569260.65c4857477eb31bff0fafae4797a90d8.
            stores=1, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
        -ROOT-,,0
            stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
0 dead servers
{code}

@Karthik Give me a clue as to what you are thinking and I'll have a go at fixing this one
if you don't have the time boss.

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. Occasionally, such
a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.:
Daughters; test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from test024.test.xyz.com,60020,1279780567744;
1 of 1                                                                                   
                                                                                         
                       
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating
UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating
UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning
region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While
updating UNASSIGNED region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED
zNode test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
MSG_REGION_OPEN: test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater:
Updating ZNode /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING]
expected version = 0
> 2010-07-22 01:26:00,952 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got
ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error
opening test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode
= BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first one is the
parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" and "info:serverstartcode"
columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 \x00\x00\x01)\xF9BL`@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, value=REGION
=> {NAME =>
>  58624.8d5490bfc166c687657  'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.',
STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', ENCODED
=> 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', FAMILIES
=> [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',
VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE
=> '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the version
mismatch, and aborted opening the region). He'll add details to the JIRA. But what we aren't
clear about at this stage is why the base scanner didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not
served by any region server  but is listed in META to be on server null
> ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not
served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message