hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking
Date Wed, 28 Sep 2016 05:14:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528433#comment-15528433
] 

Jerry He commented on HBASE-16721:
----------------------------------

I had a similar problem on a customer production cluster recently.

The WALs for one of the region servers (server 11) kept on accumulating, and these LOG info
repeatedly showed up.

{code}
2016-09-03 14:37:15,989 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Too many hlogs:
logs=817, maxlogs=32; forcing flush of 2 regions(s): 1b86c057f80721d4fde43a303f63ebde, 32d36d4864259dc9d984326bf27dcc5e
2016-09-03 14:37:15,990 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule
flush of 1b86c057f80721d4fde43a303f63ebde, region=null, requester=null
2016-09-03 14:37:15,990 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule
flush of 32d36d4864259dc9d984326bf27dcc5e, region=null, requester=null
{code}

It turned out that the two regions were opened and hosted on other region servers, not on
this region server.
After manually moving the complaining regions from other region servers to server 11, then
server 11 was able to finish the flush.  The wal files for server 11 came down right after
that.

I didn't had a chance to look into what the root cause was.  Some of the region servers had
crashed before that.

> Concurrency issue in WAL unflushed seqId tracking
> -------------------------------------------------
>
>                 Key: HBASE-16721
>                 URL: https://issues.apache.org/jira/browse/HBASE-16721
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4
>
>
> I'm inspecting an interesting case where in a production cluster, some regionservers
ends up accumulating hundreds of WAL files, even with force flushes going on due to max logs.
This happened multiple times on the cluster, but not on other clusters. The cluster has periodic
memstore flusher disabled, however, this still does not explain why the force flush of regions
due to max limit is not working. I think the periodic memstore flusher just masks the underlying
problem, which is why we do not see this in other clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] wal.FSHLog: Too
many wals: logs=33, maxlogs=32; forcing flush of 1 regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] regionserver.LogRoller:
Failed to schedule flush of d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] wal.FSHLog: Too
many wals: logs=721, maxlogs=32; forcing flush of 1 regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] regionserver.LogRoller:
Failed to schedule flush of d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already split some
time ago, and was able to flush its data and split without any problems. However, the FSHLog
still thinks that there is some unflushed data for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message