hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed
Date Thu, 26 May 2011 11:15:54 GMT
Hi, Jieshan:
You pasted logs from two region servers, right ?

>> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.
regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer
exists, encodedName=d7555a12586e6c788ca55017224b5a51

Was d7555a12586e6c788ca55017224b5a51 the same region as the one in first log
snippet where IOE occurred ?

In IOE catch block of OpenRegionHandler#openRegion, I see:
      // We failed open.  Let our znode expire in regions-in-transition and
      // Master will assign elsewhere.  Presumes nothing to close.

Did the node d7555a12586e6c788ca55017224b5a51 never expire ?

Thanks

On Thu, May 26, 2011 at 12:15 AM, bijieshan <bijieshan@huawei.com> wrote:

> It caused the region couldn't been open anymore, for it has fallen into an
> loop of opening operations, but failed for each time. The Balancer would
> skip for the region still remain in RIT. So the regions looked un-balance
> between the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening,
> update the state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the
> opening Msg again. Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an
> unexpected state. So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>        return;
>      }
>
>  /************************************************************************/
>      /*********IOException happened, region is
> null***************************/
>
>  /************************************************************************/
>      region = openRegion();
>
>  /************************************************************************/
>      /*********(region == null) is true, so return
> directly*******************/
>
>  /************************************************************************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> OpenRegionHandler#openRegion
>    HRegion region = null;
>    try {
>
>  /************************************************************************/
>      /*********IOException happened here..
> ***********************************/
>
>  /************************************************************************/
>        region = HRegion.openHRegion(this.regionInfo,
> this.rsServices.getWAL(),
>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>        new CancelableProgressable() {
>          public boolean progress() {
>            return tickleOpening("open_region_progress");
>          }
>        });
>    } catch (IOException e) {
>      LOG.error("Failed open of region=" +
>        this.regionInfo.getRegionNameAsString(), e);
>    }
>
>    return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
> of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
>        at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>        at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>        at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>        at $Proxy6.getFileInfo(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>        at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>        at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>        at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for
> region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for
> region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for
> region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message