Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EEF9DDF2E for ; Tue, 5 Mar 2013 02:53:13 +0000 (UTC) Received: (qmail 33862 invoked by uid 500); 5 Mar 2013 02:53:13 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 33815 invoked by uid 500); 5 Mar 2013 02:53:13 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 33645 invoked by uid 99); 5 Mar 2013 02:53:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2013 02:53:13 +0000 Date: Tue, 5 Mar 2013 02:53:13 +0000 (UTC) From: "Devaraj Das (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-7982) TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592985#comment-13592985 ] Devaraj Das commented on HBASE-7982: ------------------------------------ I and [~jeffreyz] was looking at the logs, and it seemed like an RS didn't successfully abort. It was kind of aborted, that led to the filesystem closing, but then it was trying to do some filesystem operation and continuously failing. Jeffrey is looking at what is preventing the RS from really aborting. {noformat} 2013-02-26 09:56:22,351 ERROR [RegionServer:1;juno.apache.org,53545,1361872469173.logSyncer] wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) at java.lang.Thread.run(Thread.java:722) {noformat} > TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes? > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-7982 > URL: https://issues.apache.org/jira/browse/HBASE-7982 > Project: HBase > Issue Type: Bug > Components: build > Reporter: stack > Assignee: Jeffrey Zhong > Priority: Blocker > Attachments: hbase-7982-NPE.patch > > > I was trying to look at why the odd time Hudson OOMEs trying to make a report on 0.95 build #4 https://builds.apache.org/job/hbase-0.95/4/console: > {code} > ERROR: Failed to archive test reports > hudson.util.IOException2: remote file operation failed: /home/jenkins/jenkins-slave/workspace/hbase-0.95 at hudson.remoting.Channel@151a4e3e:ubuntu3 > at hudson.FilePath.act(FilePath.java:861) > at hudson.FilePath.act(FilePath.java:838) > at hudson.tasks.junit.JUnitParser.parse(JUnitParser.java:87) > at > ... > Caused by: java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57) > at java.nio.CharBuffer.allocate(CharBuffer.java:329) > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792) > at java.nio.charset.Charset.decode(Charset.java:791) > at hudson.tasks.junit.SuiteResult.(SuiteResult.java:215) > ... > {code} > We are trying to allocate a big buffer and failing. > Looking at reports being generated, we have quite a few that are > 10MB in size: > {code} > durruti:0.95 stack$ find hbase-* -type f -size +10000k -exec ls -la {} \; > -rw-r--r--@ 1 stack staff 11126492 Feb 27 06:14 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.backup.TestHFileArchiving-output.txt > -rw-r--r--@ 1 stack staff 13296009 Feb 27 05:47 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestFromClientSide3-output.txt > -rw-r--r--@ 1 stack staff 10541898 Feb 27 05:47 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiParallel-output.txt > -rw-r--r--@ 1 stack staff 25344601 Feb 27 05:51 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient-output.txt > -rw-r--r--@ 1 stack staff 17966969 Feb 27 06:12 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction-output.txt > -rw-r--r--@ 1 stack staff 17699068 Feb 27 06:09 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit-output.txt > -rw-r--r--@ 1 stack staff 17701832 Feb 27 06:07 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed-output.txt > -rw-r--r--@ 1 stack staff 717853709 Feb 27 06:17 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailover-output.txt > -rw-r--r--@ 1 stack staff 563616793 Feb 27 06:17 hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt > {code} > ... with TestReplicationQueueFailover* being order of magnitude bigger than the others. > Looking in the test I see both spewing between 800 and 900 thousand lines in about a minute. Here is their fixation: > {code} > 8908998 2013-02-27 06:17:48,176 ERROR [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog. > 8908999 java.io.IOException: Filesystem closed > 8909000 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) > 8909001 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) > 8909002 ,...at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) > 8909003 ,...at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > 8909004 ,...at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) > 8909005 ,...at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) > 8909006 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) > 8909007 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) > 8909008 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) > 8909009 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) > 8909010 ,...at java.lang.Thread.run(Thread.java:722) > 8909011 2013-02-27 06:17:48,176 FATAL [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] wal.FSHLog(1140): Could not sync. Requesting close of hlog > 8909012 java.io.IOException: Filesystem closed > 8909013 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319) > 8909014 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78) > 8909015 ,...at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843) > 8909016 ,...at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > 8909017 ,...at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999) > 8909018 ,...at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248) > 8909019 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120) > 8909020 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058) > 8909021 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228) > 8909022 ,...at org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010) > 8909023 ,...at java.lang.Thread.run(Thread.java:722) > ... > {code} > These tests are 'succeeding'? > I also see in both: > {code} > 3891 java.lang.NullPointerException > 3892 ,...at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.seek(SequenceFileLogReader.java:261) > 3893 ,...at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.seek(ReplicationHLogReaderManager.java:103) > 3894 ,...at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:415) > 3895 ,...at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:333) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira