From issues-return-95484-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Thu Apr 18 16:05:05 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id BA37018061A for ; Thu, 18 Apr 2019 18:05:04 +0200 (CEST) Received: (qmail 21649 invoked by uid 500); 18 Apr 2019 16:05:01 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 21629 invoked by uid 99); 18 Apr 2019 16:05:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2019 16:05:01 +0000 Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5A124E2899 for ; Thu, 18 Apr 2019 16:05:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1147F25807 for ; Thu, 18 Apr 2019 16:05:00 +0000 (UTC) Date: Thu, 18 Apr 2019 16:05:00 +0000 (UTC) From: "Alexander Belyak (JIRA)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (IGNITE-11783) Open file limit for deb distribution MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Alexander Belyak created IGNITE-11783: ----------------------------------------- Summary: Open file limit for deb distribution Key: IGNITE-11783 URL: https://issues.apache.org/jira/browse/IGNITE-11783 Project: Ignite Issue Type: Bug Components: persistence Affects Versions: 2.7 Environment: ubuntu-16.04 Reporter: Alexander Belyak Step to reproduce: 1) Install ignite from deb package on ubuntu 16.04 2) Start with persistence 3) Create 5 caches (or one with 4000+ partitions) Error text: {noformat} [18:29:44,369][INFO][exchange-worker-#43][GridCacheDatabaseSharedManager] Restoring partition state for local groups [cntPartStateWal=0, lastCheckpointId=bd24ff23-da6f-46e5-bafd-b643db3870d4] [18:29:51,864][SEVERE][exchange-worker-#43][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureH andler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to initialize partition file: /usr/s hare/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin]] class org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to initialize partition file: /usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_ VERTEX_TBL/part-913.bin at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:444) at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.ensure(FilePageStore.java:650) at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.ensure(FilePageStoreManager.java:712) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2472) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2419) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1628) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1302) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1453) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:806) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2667) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2539) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.file.FileSystemException: /usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin: Too many open files at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:196) at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:248) at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:301) at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.(AsyncFileIO.java:57) at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:53) at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:416) ... 12 more {noformat} It happen because systemd service description (/etc/systemd/system/apache-ignite@.service) didn't contain {noformat} LimitNOFILE=500000 (possible with) LimitNPROC=500000 {noformat} see: https://fredrikaverpil.github.io/2016/04/27/systemd-and-resource-limits/ Possible, installation script should also add: * "fs.file-max = 2097152" to "/etc/sysctl.conf" * into /etc/security/limits.conf: {noformat} * hard nofile 500000 * soft nofile 500000 root hard nofile 500000 root soft nofile 500000 {noformat} see: https://easyengine.io/tutorials/linux/increase-open-files-limit And it will be amazing if ignite start process check file limits and print link to documentation page if: 1) persistence enabled 2) limits below some value (<=4096) 3) limits below total number of partition in current node And one more thing - if ignite get "Too many open files" exception in the middle of rebalancing - it will be terrible situation, whole cluster just stop working. It can happen if each node have almost full limit and: * someone create additional cache * topology change (remove node) and each remaining nodes get more local partition. Can we remember limit on startup and check limit each time when are we going to create local partition? -- This message was sent by Atlassian JIRA (v7.6.3#76005)