hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16270) When using region replicas, get thousands of occurrences of UnexpectedStateException: Current snapshot id is -1
Date Tue, 02 Aug 2016 01:54:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403209#comment-15403209

Enis Soztutar commented on HBASE-16270:

Thanks [~rayokota] for digging deeper. 

bq. You can see at this link where the call to HRegion.internalFlushcache is outside the synchronized
block for HRegion.writestate (however, my analysis may be wrong as this was based on just
a few minutes of digging).
We are still guarding that code with writeState before and after the call. Moreover, region
replicas have writeState.writesEnabled=false set, so normal HRegion.flushCache() and HRegion.close()
etc calls should not be exercising that code paths. Only through the replayWALFlush() should
be exercising the flush prepare + flush commit sequence which is already protected. 

Is there any way that we can reproduce the issue, and get the full logs for the region? That
would really help in the analysis. 

> When using region replicas, get thousands of occurrences of UnexpectedStateException:
Current snapshot id is -1
> ---------------------------------------------------------------------------------------------------------------
>                 Key: HBASE-16270
>                 URL: https://issues.apache.org/jira/browse/HBASE-16270
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.1.2
>            Reporter: Robert Yokota
>         Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
> We have an HBase (1.1.2) production cluster with 58 region servers and a staging cluster
with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 3600000
> hbase.master.loadbalancer.class = org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 30000
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the following
error below in the HBase logs.  In one instance the error occurred over 70K times.  At this
time, these region servers would see 10x write traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount
metric) and in some instances would crash.
> At the same time, we would see secondary regions move and then go into the "reads are
disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot method
might be called more than once in succession. The first call would set snapshotId to -1, but
the second call would throw an exception.  It seems the second call should just return if
the snapshotId is already -1, rather than throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, RpcRetryingCaller{globalStartTime=1469090201543, pause=100,
retries=35}, org.apache.hadoop.hbase.regionserver.UnexpectedStateException: org.apache.hadoop.hbase.regionserver.UnexpectedStateException:
Current snapshot id is -1,passed 1469085004304
>         at org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
>         at org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
>         at org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
>         at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
>         at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)

This message was sent by Atlassian JIRA

View raw message