ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavel Kovalenko <jokse...@gmail.com>
Subject Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze
Date Wed, 25 Jul 2018 13:42:03 GMT
Hello Ray,

According to your attached log, It seems that you have some network
problems. Could you please also share logs from nodes with temporary ids =
[429edc2b-eb14-414f-a978-9bfe35443c8c, 6783732c-9a13-466f-800a-ad4c8d9be3bf].
The root cause should be on those nodes.

2018-07-25 13:03 GMT+03:00 Ray <rayliu@cisco.com>:

> I have a three node Ignite 2.6 cluster setup with the following config.
>
>     <bean id="grid.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>         <property name="peerClassLoadingEnabled" value="true"/>
>         <property name="failureDetectionTimeout" value="60000"/>
>         <property name="dataStorageConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>             <property name="storagePath" value="/data/ignite/persistence"/>
>
>             <property name="walPath" value="/wal"/>
>             <property name="walArchivePath" value="/wal/archive"/>
>             <property name="defaultDataRegionConfiguration">
>                 <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>                     <property name="name" value="default_Region"/>
>                     <property name="initialSize" value="#{100L * 1024 *
> 1024
> * 1024}"/>
>                     <property name="maxSize" value="#{460L * 1024 * 1024 *
> 1024}"/>
>                     <property name="persistenceEnabled" value="true"/>
>                     <property name="checkpointPageBufferSize" value="#{8L
> *
> 1024 * 1024 * 1024}"/>
>                 </bean>
>             </property>
>             <property name="walMode" value="BACKGROUND"/>
>             <property name="walFlushFrequency" value="5000"/>
>             <property name="checkpointFrequency" value="600000"/>
>             </bean>
>         </property>
>         <property name="discoverySpi">
>                 <bean
> class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
>                     <property name="localPort" value="49500"/>
>                     <property name="ipFinder">
>                         <bean
> class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
>
>                             <property name="addresses">
>                                 <list>
>                                 <value>node1:49500</value>
>                                 <value>node2:49500</value>
>                                 <value>node3:49500</value>
>                                 </list>
>                             </property>
>                         </bean>
>                     </property>
>                 </bean>
>             </property>
>             <property name="gridLogger">
>             <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
>                 <constructor-arg type="java.lang.String"
> value="config/ignite-log4j2.xml"/>
>             </bean>
>         </property>
>     </bean>
> </beans>
>
> And I used this command to start Ignite service on three nodes.
>
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCApplicationConcurrentTime
> -J-Xloggc:/spare/ignite/log/ignitegc-$(date +%Y_%m_%d-%H_%M).log
> config/persistent-config.xml
>
> When I'm using Spark dataframe API to ingest data into this cluster, the
> cluster freezes after some time and no new data can be ingested into
> Ignite.
> Both the client(spark executor) and server are showing the "Unable to await
> partitions release latch within timeout: ServerLatch" exception starts from
> line 51834 in full log like this
>
> [2018-07-25T09:45:42,177][WARN
> ][exchange-worker-#162][GridDhtPartitionsExchangeFuture] Unable to await
> partitions release latch within timeout: ServerLatch [permits=2,
> pendingAcks=[429edc2b-eb14-414f-a978-9bfe35443c8c,
> 6783732c-9a13-466f-800a-ad4c8d9be3bf], super=Completab        leLatch
> [id=exchange, topVer=AffinityTopologyVersion [topVer=239, minorTopVer=0]]]
>
> Here's the full log on server node having the exception.
> 07-25.zip
> <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/07-25.zip>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Mime
View raw message