Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5481DE497 for ; Sat, 2 Mar 2013 23:03:13 +0000 (UTC) Received: (qmail 39859 invoked by uid 500); 2 Mar 2013 23:03:13 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 39820 invoked by uid 500); 2 Mar 2013 23:03:13 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 39811 invoked by uid 99); 2 Mar 2013 23:03:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Mar 2013 23:03:13 +0000 Date: Sat, 2 Mar 2013 23:03:12 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-7982) TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591555#comment-13591555 ] stack commented on HBASE-7982: ------------------------------ This is a blocker because looks like an issue in replication and could be one of the reasons we fail a test copying test results. > TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes? > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-7982 > URL: https://issues.apache.org/jira/browse/HBASE-7982 > Project: HBase > Issue Type: Bug > Reporter: stack > Priority: Blocker > > I was trying to look at why the odd time Hudson OOMEs trying to make a report on 0.95 build #4 https://builds.apache.org/job/hbase-0.95/4/console: > {code} > ERROR: Failed to archive test reports > hudson.util.IOException2: remote file operation failed: /home/jenkins/jenkins-slave/workspace/hbase-0.95 at hudson.remoting.Channel@151a4e3e:ubuntu3 > at hudson.FilePath.act(FilePath.java:861) > at hudson.FilePath.act(FilePath.java:838) > at hudson.tasks.junit.JUnitParser.parse(JUnitParser.java:87) > at > ... > Caused by: java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57) > at java.nio.CharBuffer.allocate(CharBuffer.java:329) > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792) > at java.nio.charset.Charset.decode(Charset.java:791) > at hudson.tasks.junit.SuiteResult.(SuiteResult.java:215) > ... > {code} > We are trying to allocate a big buffer and failing. > Looking at reports being generated, we have quite a few that are > 10MB in size: > {code} > durruti:0.95 stack$ find hbase-* -type f -size +10000k -exec ls -la {} \; > -rw-r--r--@ 1 stack staff 11126492 Feb 27 06:14 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.backup.TestHFileArchiving-output.txt > -rw-r--r--@ 1 stack staff 13296009 Feb 27 05:47 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestFromClientSide3-output.txt > -rw-r--r--@ 1 stack staff 10541898 Feb 27 05:47 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiParallel-output.txt > -rw-r--r--@ 1 stack staff 25344601 Feb 27 05:51 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient-output.txt > -rw-r--r--@ 1 stack staff 17966969 Feb 27 06:12 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction-output.txt > -rw-r--r--@ 1 stack staff 17699068 Feb 27 06:09 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit-output.txt > -rw-r--r--@ 1 stack staff 17701832 Feb 27 06:07 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed-output.txt > -rw-r--r--@ 1 stack staff 717853709 Feb 27 06:17 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailover-output.txt > -rw-r--r--@ 1 stack staff 563616793 Feb 27 06:17 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt > {code} > ... with TestReplicationQueueFailover* being order of magnitude bigger than the others. > Looking in the test I see both spewing between 800 and 900 thousand lines in about a minute. Here is their fixation: > {code} > 8908998 2013-02-27 06:17:48,176 ERROR [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog. > 8908999 java.io.IOException: Filesystem closed > 8909000 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) > 8909001 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) > 8909002 ,...at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) > 8909003 ,...at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > 8909004 ,...at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) > 8909005 ,...at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) > 8909006 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) > 8909007 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) > 8909008 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) > 8909009 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) > 8909010 ,...at java.lang.Thread.run(Thread.java:722) > 8909011 2013-02-27 06:17:48,176 FATAL [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] wal.FSHLog(1140): Could not sync. Requesting close of hlog > 8909012 java.io.IOException: Filesystem closed > 8909013 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) > 8909014 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) > 8909015 ,...at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) > 8909016 ,...at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > 8909017 ,...at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) > 8909018 ,...at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) > 8909019 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) > 8909020 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) > 8909021 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) > 8909022 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) > 8909023 ,...at java.lang.Thread.run(Thread.java:722) > ... > {code} > These tests are 'succeeding'? > I also see in both: > {code} > 3891 java.lang.NullPointerException > 3892 ,...at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.seek(SequenceFileLogReader.java:261) > 3893 ,...at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.seek(ReplicationHLogReaderManager.java:103) > 3894 ,...at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:415) > 3895 ,...at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:333) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira