hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10673) [replication] Recovering a large number of queues can cause OOME
Date Mon, 30 Jun 2014 16:32:24 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047818#comment-14047818

Lars Hofhansl commented on HBASE-10673:

Just came across this. Seems serious.

> [replication] Recovering a large number of queues can cause OOME
> ----------------------------------------------------------------
>                 Key: HBASE-10673
>                 URL: https://issues.apache.org/jira/browse/HBASE-10673
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.94.5
>            Reporter: Ian Friedman
> Recently we had an accidental shutdown of approx 400 RegionServers in our cluster. Our
cluster runs master-master replication to our secondary datacenter. Once we got the nodes
sorted out and added back to the cluster, we noticed RSes continuing to die from OOMExceptions.
Investigating the heap dumps from the dead RSes I found them full of WALEdits, and a lot of
ReplicationSource threads generated by the NodeFailoverWorker replicating a lot of queues
for the dead RSes.
> After doing some digging into the code and with a lot of help from [~jdcryans] we found
that it is possible for the node failover process in a very few regionservers to pick up the
majority of queues when a lot of RSes die simultaneously. This causes a lot of HLogs to be
opened and read simultaneously, thus overflowing the heap.
> as per [~jdcryans]:
> {quote}
> Well there's a race to grab the queues to recover, and the surviving region servers are
just scanning the znode with all the RS to try to determine which ones are dead (and when
it finds a dead one it forks off a NodeFailoverWorker) so that if one RS got the notification
that one RS died but in reality a bunch of them died, it could win a lot of those races.
> My guess is after that one died, its queues were redistributed, and maybe you had another
hero RS, but as more and more died the queues eventually got manageable since a single RS
cannot win all the races. Something like this:
> - 400 die, one RS manages to grab 350
> - That RS dies, one other RS manages to grab 300 of the 350 (the other queues going to
other region servers)
> - That second RS dies, hero #3 grabs 250
> etc.
> ...
> We don't have the concept of a limit of recovered queues, less so the concept of respecting
it. Aggregating the queues into one is one solution, but it would perform poorly as show with
your use case where one region server would be responsible to replicate the data for 300 others.
It would scale better if the queues were more evenly distributed (and it was the original
intent, except that in practice the dynamics are different).
> {quote}

This message was sent by Atlassian JIRA

View raw message