hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
Date Mon, 25 Aug 2014 00:24:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108647#comment-14108647
] 

Hadoop QA commented on HBASE-11813:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12664068/catch_all_exceptions.txt
  against trunk revision .
  ATTACHMENT ID: 12664068

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include any new or modified
tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    {color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10556//console

This message is automatically generated.

> CellScanner#advance may infinitely recurse
> ------------------------------------------
>
>                 Key: HBASE-11813
>                 URL: https://issues.apache.org/jira/browse/HBASE-11813
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.99.0, 2.0.0, 0.98.6
>
>         Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, catch_all_exceptions.txt
>
>
> On user@hbase, johannes.schaback@visual-meta.com reported:
> {quote}
> we face a serious issue with our HBase production cluster for two days now. Every couple
minutes, a random RegionServer gets stuck and does not process any requests. In addition this
causes the other RegionServers to freeze within a minute which brings down the entire cluster.
Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.
> {quote}
> Subsequent troubleshooting reveals that RPC is getting stuck because we are losing RPC
handlers. In the .out files we have this:
> {noformat}
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
>         at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>         at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>         at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>         at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> [...]
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
> java.lang.StackOverflowError​
> {noformat}
> That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:
> {code}
> ​    return new CellScanner() {
>       private final Iterator<? extends CellScannable> iterator = cellScannerables.iterator();
>       private CellScanner cellScanner = null;
>       @Override
>       public Cell current() {
>         return this.cellScanner != null? this.cellScanner.current(): null;
>       }
>       @Override
>       public boolean advance() throws IOException {
>         if (this.cellScanner == null) {
>           if (!this.iterator.hasNext()) return false;
>           this.cellScanner = this.iterator.next().cellScanner();
>         }
>         if (this.cellScanner.advance()) return true;
>         this.cellScanner = null;
> --->        return advance();
>       }
>     };
> {code}
> That final return statement is the immediate problem.
> We should also fix this so the RegionServer aborts if it loses a handler to an Error.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message