hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Friedman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-10673) [replication] Recovering a large number of queues can cause OOME
Date Tue, 04 Mar 2014 21:51:59 GMT

     [ https://issues.apache.org/jira/browse/HBASE-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ian Friedman updated HBASE-10673:
---------------------------------

    Description: 
Recently we had an accidental shutdown of approx 400 RegionServers in our cluster. Our cluster
runs master-master replication to our secondary datacenter. Once we got the nodes sorted out
and added back to the cluster, we noticed RSes continuing to die from OOMExceptions. Investigating
the heap dumps from the dead RSes I found them full of WALEdits, and a lot of ReplicationSource
threads generated by the NodeFailoverWorker replicating a lot of queues for the dead RSes.

After doing some digging into the code and with a lot of help from [~jdcryans] we found that
it is possible for the node failover process in a very few regionservers to pick up the majority
of queues when a lot of RSes die simultaneously. This causes a lot of HLogs to be opened and
read simultaneously, thus overflowing the heap.

as per [~jdcryans]:
{quote}
Well there's a race to grab the queues to recover, and the surviving region servers are just
scanning the znode with all the RS to try to determine which ones are dead (and when it finds
a dead one it forks off a NodeFailoverWorker) so that if one RS got the notification that
one RS died but in reality a bunch of them died, it could win a lot of those races.

My guess is after that one died, its queues were redistributed, and maybe you had another
hero RS, but as more and more died the queues eventually got manageable since a single RS
cannot win all the races. Something like this:

- 400 die, one RS manages to grab 350
- That RS dies, one other RS manages to grab 300 of the 350 (the other queues going to other
region servers)
- That second RS dies, hero #3 grabs 250
etc.

...

We don't have the concept of a limit of recovered queues, less so the concept of respecting
it. Aggregating the queues into one is one solution, but it would perform poorly as show with
your use case where one region server would be responsible to replicate the data for 300 others.
It would scale better if the queues were more evenly distributed (and it was the original
intent, except that in practice the dynamics are different).
{quote}

  was:
Recently we had an accidental shutdown of approx 400 RegionServers in our cluster. Our cluster
runs master-master replication to our seconday datacenter. Once we got the nodes sorted out
and added back to the cluster, we noticed RSes continuing to die from OOMExceptions. Investigating
the heap dumps from the dead RSes I found them full of WALEdits, and a lot of ReplicationSource
threads generated by the NodeFailoverWorker replicating a lot of queues for the dead RSes.

After doing some digging into the code and with a lot of help from [~jdcryans] we found that
it is possible for the node failover process in a very few regionservers to pick up the majority
of queues when a lot of RSes die simultaneously. This causes a lot of HLogs to be opened and
read simultaneously, thus overflowing the heap.

as per [~jdcryans]:
{quote}
Well there's a race to grab the queues to recover, and the surviving region servers are just
scanning the znode with all the RS to try to determine which ones are dead (and when it finds
a dead one it forks off a NodeFailoverWorker) so that if one RS got the notification that
one RS died but in reality a bunch of them died, it could win a lot of those races.

My guess is after that one died, its queues were redistributed, and maybe you had another
hero RS, but as more and more died the queues eventually got manageable since a single RS
cannot win all the races. Something like this:

- 400 die, one RS manages to grab 350
- That RS dies, one other RS manages to grab 300 of the 350 (the other queues going to other
region servers)
- That second RS dies, hero #3 grabs 250
etc.

...

We don't have the concept of a limit of recovered queues, less so the concept of respecting
it. Aggregating the queues into one is one solution, but it would perform poorly as show with
your use case where one region server would be responsible to replicate the data for 300 others.
It would scale better if the queues were more evenly distributed (and it was the original
intent, except that in practice the dynamics are different).
{quote}


> [replication] Recovering a large number of queues can cause OOME
> ----------------------------------------------------------------
>
>                 Key: HBASE-10673
>                 URL: https://issues.apache.org/jira/browse/HBASE-10673
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.94.5
>            Reporter: Ian Friedman
>
> Recently we had an accidental shutdown of approx 400 RegionServers in our cluster. Our
cluster runs master-master replication to our secondary datacenter. Once we got the nodes
sorted out and added back to the cluster, we noticed RSes continuing to die from OOMExceptions.
Investigating the heap dumps from the dead RSes I found them full of WALEdits, and a lot of
ReplicationSource threads generated by the NodeFailoverWorker replicating a lot of queues
for the dead RSes.
> After doing some digging into the code and with a lot of help from [~jdcryans] we found
that it is possible for the node failover process in a very few regionservers to pick up the
majority of queues when a lot of RSes die simultaneously. This causes a lot of HLogs to be
opened and read simultaneously, thus overflowing the heap.
> as per [~jdcryans]:
> {quote}
> Well there's a race to grab the queues to recover, and the surviving region servers are
just scanning the znode with all the RS to try to determine which ones are dead (and when
it finds a dead one it forks off a NodeFailoverWorker) so that if one RS got the notification
that one RS died but in reality a bunch of them died, it could win a lot of those races.
> My guess is after that one died, its queues were redistributed, and maybe you had another
hero RS, but as more and more died the queues eventually got manageable since a single RS
cannot win all the races. Something like this:
> - 400 die, one RS manages to grab 350
> - That RS dies, one other RS manages to grab 300 of the 350 (the other queues going to
other region servers)
> - That second RS dies, hero #3 grabs 250
> etc.
> ...
> We don't have the concept of a limit of recovered queues, less so the concept of respecting
it. Aggregating the queues into one is one solution, but it would perform poorly as show with
your use case where one region server would be responsible to replicate the data for 300 others.
It would scale better if the queues were more evenly distributed (and it was the original
intent, except that in practice the dynamics are different).
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message