hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-10312) Flooding the cluster with administrative actions leads to collapse
Date Fri, 11 Apr 2014 23:14:17 GMT

     [ https://issues.apache.org/jira/browse/HBASE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jean-Daniel Cryans updated HBASE-10312:

    Attachment: HBASE-10312.java

While running {{TestHRegion}}, I saw that it's failing about 50% of the time on {{testgetHDFSBlocksDistribution}}
because the mini cluster shuts down while it's being initialized. Digging led me to find that
{{testWritesWhileGetting}} is flushing like mad and completely swamps {{TaskMonitor}}, so
much that {{purgeExpiredTasks}} could block for seconds on sublisting. This blocking is preventing
the region servers from starting their RPC server fast enough and in the mean time the master
gives up on trying to assign meta (WTF!) and then it just sits there doing nothing until the
{{HMaster}} creation times out. And this is why the cluster is shutting down when trying to
boot up.

The patch I'm attaching makes {{TestHRegion}} work 100% of the time by using a {{CircularFifoBuffer}}
([~stack]'s idea). I'm positive that it also fixes your issue, [~apurtell].

> Flooding the cluster with administrative actions leads to collapse
> ------------------------------------------------------------------
>                 Key: HBASE-10312
>                 URL: https://issues.apache.org/jira/browse/HBASE-10312
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>             Fix For: 0.99.0
>         Attachments: HBASE-10312.java
> Steps to reproduce:
> 1. Start a cluster.
> 2. Start an ingest process.
> 3. In the HBase shell, do this:
> {noformat}
> while true do
>    flush 'table'
> end
> {noformat}
> We should reject abuse via administrative requests like this.
> What happens on the cluster is the requests back up, leading to lots of these:
> {noformat}
> 2014-01-10 18:55:55,293 WARN  [Priority.RpcServer.handler=2,port=8120] monitoring.TaskMonitor:
Too many actions in action monitor! Purging some.
> {noformat}
> At this point we could lower a gate on further requests for actions until the backlog
> Continuing, all of the regionservers will eventually die with a StackOverflowError of
unknown origin because, stack overflow:
> {noformat}
> 2014-01-10 19:02:02,783 ERROR [Priority.RpcServer.handler=3,port=8120] ipc.RpcServer:
Unexpected throwable object java.lang.StackOverflowError
>         at java.util.ArrayList$SubList.add(ArrayList.java:965)
> [...]
> {noformat}

This message was sent by Atlassian JIRA

View raw message