ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray <ray...@cisco.com>
Subject "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze
Date Wed, 25 Jul 2018 10:03:51 GMT
I have a three node Ignite 2.6 cluster setup with the following config. 

    <bean id="grid.cfg" 
class="org.apache.ignite.configuration.IgniteConfiguration"> 
        <property name="segmentationPolicy" value="RESTART_JVM"/> 
        <property name="peerClassLoadingEnabled" value="true"/> 
        <property name="failureDetectionTimeout" value="60000"/> 
        <property name="dataStorageConfiguration"> 
            <bean 
class="org.apache.ignite.configuration.DataStorageConfiguration"> 
            <property name="storagePath" value="/data/ignite/persistence"/> 
            <property name="walPath" value="/wal"/> 
            <property name="walArchivePath" value="/wal/archive"/> 
            <property name="defaultDataRegionConfiguration"> 
                <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration"> 
                    <property name="name" value="default_Region"/> 
                    <property name="initialSize" value="#{100L * 1024 * 1024 
* 1024}"/> 
                    <property name="maxSize" value="#{460L * 1024 * 1024 * 
1024}"/> 
                    <property name="persistenceEnabled" value="true"/> 
                    <property name="checkpointPageBufferSize" value="#{8L * 
1024 * 1024 * 1024}"/> 
                </bean> 
            </property> 
            <property name="walMode" value="BACKGROUND"/> 
            <property name="walFlushFrequency" value="5000"/> 
            <property name="checkpointFrequency" value="600000"/> 
            </bean> 
        </property> 
        <property name="discoverySpi"> 
                <bean 
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> 
                    <property name="localPort" value="49500"/> 
                    <property name="ipFinder"> 
                        <bean 
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"> 
                            <property name="addresses"> 
                                <list> 
                                <value>node1:49500</value> 
                                <value>node2:49500</value> 
                                <value>node3:49500</value> 
                                </list> 
                            </property> 
                        </bean> 
                    </property> 
                </bean> 
            </property> 
            <property name="gridLogger"> 
            <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger"> 
                <constructor-arg type="java.lang.String" 
value="config/ignite-log4j2.xml"/> 
            </bean> 
        </property> 
    </bean> 
</beans> 

And I used this command to start Ignite service on three nodes. 

./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC 
-J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch 
-J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps 
-J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime 
-J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log 
config/persistent-config.xml 

When I'm using Spark dataframe API to ingest data into this cluster, the
cluster freezes after some time and no new data can be ingested into Ignite.
Both the client(spark executor) and server are showing the "Unable to await
partitions release latch within timeout: ServerLatch" exception starts from
line 51834 in full log like this

[2018-07-25T09:45:42,177][WARN
][exchange-worker-#162][GridDhtPartitionsExchangeFuture] Unable to await
partitions release latch within timeout: ServerLatch [permits=2,
pendingAcks=[429edc2b-eb14-414f-a978-9bfe35443c8c,
6783732c-9a13-466f-800a-ad4c8d9be3bf], super=Completab        leLatch
[id=exchange, topVer=AffinityTopologyVersion [topVer=239, minorTopVer=0]]]

Here's the full log on server node having the exception.
07-25.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/07-25.zip>  





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Mime
View raw message