hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed
Date Thu, 26 May 2011 17:02:10 GMT
What's going on in the master? When a region times out in OPENING this
is what the TimeoutMonitor does:

                  // Attempt to transition node into OFFLINE
                  try {
                    data = new RegionTransitionData(
                      EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(),
                      master.getServerName());
                    if (ZKUtil.setData(watcher, node, data.getBytes(),
                        stat.getVersion())) {
                      // Node is now OFFLINE, let's trigger another assignment
                      ZKUtil.getDataAndWatch(watcher, node); // re-set the watch
                      LOG.info("Successfully transitioned region=" +
                          regionInfo.getRegionNameAsString() + " into OFFLINE" +
                          " and forcing a new assignment");
                      assigns.put(regionState.getRegion(), Boolean.TRUE);
                    }

So the node should be moved into OFFLINE, looks like it's not happening?

J-D

On Thu, May 26, 2011 at 12:15 AM, bijieshan <bijieshan@huawei.com> wrote:
> It caused the region couldn't been open anymore, for it has fallen into an loop of opening
operations, but failed for each time. The Balancer would skip for the region still remain
in RIT. So the regions looked un-balance between the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening, update the
state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening Msg again.
Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an unexpected state.
So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>        return;
>      }
>      /************************************************************************/
>      /*********IOException happened, region is null***************************/
>      /************************************************************************/
>      region = openRegion();
>      /************************************************************************/
>      /*********(region == null) is true, so return directly*******************/
>      /************************************************************************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> OpenRegionHandler#openRegion
>    HRegion region = null;
>    try {
>      /************************************************************************/
>      /*********IOException happened here.. ***********************************/
>      /************************************************************************/
>        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>        new CancelableProgressable() {
>          public boolean progress() {
>            return tickleOpening("open_region_progress");
>          }
>        });
>    } catch (IOException e) {
>      LOG.error("Failed open of region=" +
>        this.regionInfo.getRegionNameAsString(), e);
>    }
>
>    return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>        at $Proxy6.getFileInfo(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>        at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c
Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c
Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING
set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c
Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c
Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING
set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c
Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c
Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING
set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>

Mime
View raw message