ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Belyak (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-11783) Open file limit for deb distribution
Date Thu, 18 Apr 2019 16:05:00 GMT
Alexander Belyak created IGNITE-11783:
-----------------------------------------

             Summary: Open file limit for deb distribution
                 Key: IGNITE-11783
                 URL: https://issues.apache.org/jira/browse/IGNITE-11783
             Project: Ignite
          Issue Type: Bug
          Components: persistence
    Affects Versions: 2.7
         Environment: ubuntu-16.04
            Reporter: Alexander Belyak


Step to reproduce:
1) Install ignite from deb package on ubuntu 16.04
2) Start with persistence
3) Create 5 caches (or one with 4000+ partitions)
Error text:
{noformat}
[18:29:44,369][INFO][exchange-worker-#43][GridCacheDatabaseSharedManager] Restoring partition
state for local groups [cntPartStateWal=0, lastCheckpointId=bd24ff23-da6f-46e5-bafd-b643db3870d4]
[18:29:51,864][SEVERE][exchange-worker-#43][] Critical system error detected. Will be handled
accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureH
andler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=CRITICAL_ERROR,
err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to initialize partition
file: /usr/s
hare/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin]]
class org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to
initialize partition file: /usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_
VERTEX_TBL/part-913.bin
        at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:444)
        at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.ensure(FilePageStore.java:650)
        at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.ensure(FilePageStoreManager.java:712)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2472)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2419)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1628)
        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1302)
        at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1453)
        at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:806)
        at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2667)
        at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2539)
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.FileSystemException: /usr/share/apache-ignite/work/db/node00-f49af718-48da-4186-b664-62aca736bdc9/cache-SQL_PUBLIC_VERTEX_TBL/part-913.bin:
Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newAsynchronousFileChannel(UnixFileSystemProvider.java:196)
        at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:248)
        at java.nio.channels.AsynchronousFileChannel.open(AsynchronousFileChannel.java:301)
        at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIO.<init>(AsyncFileIO.java:57)
        at org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory.create(AsyncFileIOFactory.java:53)
        at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.init(FilePageStore.java:416)
        ... 12 more
{noformat}

It happen because systemd service description (/etc/systemd/system/apache-ignite@.service)
didn't contain
{noformat}
LimitNOFILE=500000
(possible with) LimitNPROC=500000
{noformat}
see: https://fredrikaverpil.github.io/2016/04/27/systemd-and-resource-limits/
Possible, installation script should also add:
*  "fs.file-max = 2097152" to "/etc/sysctl.conf" 
*  into /etc/security/limits.conf:
{noformat}
*         hard    nofile      500000
*         soft    nofile      500000
root      hard    nofile      500000
root      soft    nofile      500000
{noformat}
see: https://easyengine.io/tutorials/linux/increase-open-files-limit
And it will be amazing if ignite start process check file limits and print link to documentation
page if:
1) persistence enabled
2) limits below some value (<=4096)
3) limits below total number of partition in current node
And one more thing - if ignite get "Too many open files" exception in the middle of rebalancing
- it will be terrible situation, whole cluster just stop working. It can happen if each node
have almost full limit and:
* someone create additional cache
* topology change (remove node) and each remaining nodes get more local partition.
Can we remember limit on startup and check limit each time when are we going to create local
partition?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message