From common-issues-return-159255-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Oct 10 16:09:07 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3A64A180672 for ; Wed, 10 Oct 2018 16:09:06 +0200 (CEST) Received: (qmail 55554 invoked by uid 500); 10 Oct 2018 14:09:05 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 55543 invoked by uid 99); 10 Oct 2018 14:09:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Oct 2018 14:09:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C91AD181107 for ; Wed, 10 Oct 2018 14:09:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id qB0tAuJ147N0 for ; Wed, 10 Oct 2018 14:09:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 35DDF5F42E for ; Wed, 10 Oct 2018 14:09:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 68627E02EB for ; Wed, 10 Oct 2018 14:09:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 246762481B for ; Wed, 10 Oct 2018 14:09:00 +0000 (UTC) Date: Wed, 10 Oct 2018 14:09:00 +0000 (UTC) From: "Steve Loughran (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (HADOOP-15819) S3A integration test failures: FileSystem is closed! - without parallel test run MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644940#comment-16644940 ] Steve Loughran edited comment on HADOOP-15819 at 10/10/18 2:08 PM: ------------------------------------------------------------------- bq. The FS cache really feels inherently broken in the parallel tests case, which is why I initially liked the idea of disabling caching for the tests. parallel tests run in their own JVMs, the main issue there is that you need to be confident that tests aren't writing to the same local/remote paths. At the same time, I've seen old test instances get recycled, which makes me thing that the parallel runner fields work out to instantiated test runners as they complete individual test cases. So reuse does happen, it just happens a lot more in sequential runs. Tests which want special configs of the FS can't handle recycled classes, hence the need for nee filesystems & close after, but I don't see why other tests should be closing much. * HADOOP-13131 added the close, along with {{S3ATestUtils.createTestFilesystem()}}, which does create filesystems that need to be closed. * But I don't see that creation happening much, especially given {{S3AContract}}'s FS is just from a get(). * Many tests call {{S3ATestUtils.disableFilesystemCaching(conf)}} before FileSystem.get, which guarantees unique instances Which makes me think: yes, closing the FS in teardown is overkill except in the special case of "creates a new filesystem() either explicilty or implicitly. As [~mackrorysd] says: surprising this hasn't surfaced before. But to fix it means that it should be done properly. * IF tests were always closed and new ones created (i.e. no caching), test cases run way, way faster. * those tests which do need their own FS instance can close it in teardown, and set it up. * And those tests which absolutely must have FS.get() Return their specific filesystem must: (a) enable caching and (b) remove their FS from the cache in teardown (e.g. FileSystem.closeAll) This is probably going to force a review of all the tests, maybe have some method in AbstractS3ATestBase {code} protected boolean uniqueFilesystemInstance() { return false; } {code} then # if true, in createConfiguration() call {{disableFilesystemCaching}} # if true in teardown: close the FS. Next: * go through all uses of the {{disableFilesystemCaching}}, and in those tests have {{uniqueFilesystemInstance}} return true. * Look at uses of {{S3ATestUtils.createTestFilesystem()}} & make sure they are closing this after This is going to span all the tests. Joy was (Author: stevel@apache.org): bq. The FS cache really feels inherently broken in the parallel tests case, which is why I initially liked the idea of disabling caching for the tests. parallel tests run in their own JVMs, the main issue there is that you need to be confident that tests aren't writing to the same local/remote paths. At the same time, I've seen old test instances get recycled, which makes me thing that the parallel runner fields work out to instantiated test runners as they complete individual test cases. So reuse does happen, it just happens a lot more in sequential runs. Tests which want special configs of the FS can't handle recycled classes, hence the need for nee filesystems & close after, but I don't see why other tests should be closing much. * HADOOP-13131 added the close, along with {{S3ATestUtils.createTestFilesystem()}}, which does create filesystems that need to be closed. * But I don't see that creation happening much, especially given {{S3AContract}}'s FS is just from a get(). * Many tests call {{S3ATestUtils.disableFilesystemCaching(conf)}} before FileSystem.get, which guarantees unique instances Which makes me think: yes, closing the FS in teardown is overkill except in the special case of "creates a new filesystem() either explicilty or implicitly. As [~mackrorysd] says: surprising this hasn't surfaced before. But to fix it means that it should be done properly. * IF tests were always closed and new ones created (i.e. no catching), test cases run way, way faster. * those tests which do need their own FS instance can close it in teardown, and set it up. * And those tests which absolutely must have FS.get() Return their specific filesystem must: (a) enable caching and (b) remove their FS from the cache in teardown (e.g. FileSystem.closeAll) This is probably going to force a review of all the tests, maybe have some method in AbstractS3ATestBase {code} protected boolean uniqueFilesystemInstance() { return false; } {code} then # if true, in createConfiguration() call {{disableFilesystemCaching)) # if true in teardown: close the FS. Next: * go through all uses of the {{disableFilesystemCaching)), and in those tests have {{uniqueFilesystemInstance}} return true. * Look at uses of {{S3ATestUtils.createTestFilesystem()}} & make sure they are closing this after This is going to span all the tests. Joy > S3A integration test failures: FileSystem is closed! - without parallel test run > -------------------------------------------------------------------------------- > > Key: HADOOP-15819 > URL: https://issues.apache.org/jira/browse/HADOOP-15819 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 3.1.1 > Reporter: Gabor Bota > Assignee: Gabor Bota > Priority: Critical > Attachments: S3ACloseEnforcedFileSystem.java, closed_fs_closers_example_5klines.log.zip > > > Running the integration tests for hadoop-aws {{mvn -Dscale verify}} against Amazon AWS S3 (eu-west-1, us-west-1, with no s3guard) we see a lot of these failures: > {noformat} > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.408 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.staging.integration.ITDirectoryCommitMRJob > [ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITDirectoryCommitMRJob) Time elapsed: 0.027 s <<< ERROR! > java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed! > [ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 4.345 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJob > [ERROR] testStagingDirectory(org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJob) Time elapsed: 0.021 s <<< ERROR! > java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed! > [ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJob) Time elapsed: 0.022 s <<< ERROR! > java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed! > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.489 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJobBadDest > [ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJobBadDest) Time elapsed: 0.023 s <<< ERROR! > java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed! > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.695 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.magic.ITMagicCommitMRJob > [ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.magic.ITMagicCommitMRJob) Time elapsed: 0.039 s <<< ERROR! > java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed! > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.015 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.ITestS3ACommitterFactory > [ERROR] testEverything(org.apache.hadoop.fs.s3a.commit.ITestS3ACommitterFactory) Time elapsed: 0.014 s <<< ERROR! > java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed! > {noformat} > The big issue is that the tests are running in a serial manner - no test is running on top of the other - so we should not see that the tests are failing like this. The issue could be in how we handle org.apache.hadoop.fs.FileSystem#CACHE - the tests should use the same S3AFileSystem so if A test uses a FileSystem and closes it in teardown then B test will get the same FileSystem object from the cache and try to use it, but it is closed. > We see this a lot in our downstream testing too. It's not possible to tell that the failed regression test result is an implementation issue in the runtime code or a test implementation problem. > I've checked when and what closes the S3AFileSystem with a sightly modified version of S3AFileSystem which logs the closers of the fs if an error should occur. I'll attach this modified java file for reference. See the next example of the result when it's running: > {noformat} > 2018-10-04 00:52:25,596 [Thread-4201] ERROR s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:checkIfClosed(74)) - Use after close(): > java.lang.RuntimeException: Using closed FS!. > at org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.checkIfClosed(S3ACloseEnforcedFileSystem.java:73) > at org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.mkdirs(S3ACloseEnforcedFileSystem.java:474) > at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.mkdirs(AbstractFSContractTestBase.java:338) > at org.apache.hadoop.fs.contract.AbstractFSContractTestBase.setup(AbstractFSContractTestBase.java:193) > at org.apache.hadoop.fs.s3a.ITestS3AClosedFS.setup(ITestS3AClosedFS.java:40) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > 2018-10-04 00:52:25,596 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(61)) - FS was closed 6 times by: > 2018-10-04 00:52:25,596 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(64)) - ----------------- close() #1 --------------- > 2018-10-04 00:52:25,596 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(65)) - java.lang.RuntimeException: close() called here > at org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.close(S3ACloseEnforcedFileSystem.java:83) > at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280) > at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298) > at org.apache.hadoop.fs.s3a.AbstractS3ATestBase.teardown(AbstractS3ATestBase.java:55) > at org.apache.hadoop.fs.s3a.commit.AbstractCommitITest.teardown(AbstractCommitITest.java:206) > at org.apache.hadoop.fs.s3a.auth.ITestAssumedRoleCommitOperations.teardown(ITestAssumedRoleCommitOperations.java:90) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > 2018-10-04 00:52:25,596 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(64)) - ----------------- close() #2 --------------- > 2018-10-04 00:52:25,596 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(65)) - java.lang.RuntimeException: close() called here > at org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.close(S3ACloseEnforcedFileSystem.java:83) > at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280) > at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298) > at org.apache.hadoop.fs.s3a.AbstractS3ATestBase.teardown(AbstractS3ATestBase.java:55) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > 2018-10-04 00:52:25,597 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(64)) - ----------------- close() #3 --------------- > 2018-10-04 00:52:25,598 [Thread-4201] WARN s3a.S3ACloseEnforcedFileSystem (S3ACloseEnforcedFileSystem.java:logClosers(65)) - java.lang.RuntimeException: close() called here > at org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.close(S3ACloseEnforcedFileSystem.java:83) > at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280) > at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298) > at org.apache.hadoop.fs.s3a.AbstractS3ATestBase.teardown(AbstractS3ATestBase.java:55) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > (...) > {noformat} > From the logs it seems that the closers are the teardown methods of the abstract test classes, so the same FileSystem object gets reused between tests. > To run a test with the modified fs the following should be in the config: > {noformat} > > fs.s3a.impl > org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem > The implementation class of the S3A Filesystem > > {noformat} > I file this jira to solve this issue, and start a conversation if needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org