geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEODE-654) Interaction of GII and LIFO eviction can cause high CPU and recovery issues
Date Sat, 12 Dec 2015 00:44:46 GMT

    [ https://issues.apache.org/jira/browse/GEODE-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053859#comment-15053859
] 

ASF subversion and git services commented on GEODE-654:
-------------------------------------------------------

Commit f7670e1688a95a026a0376724bfe2277cb16eb94 in incubator-geode's branch refs/heads/develop
from Vince Ford
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=f7670e1 ]

GEODE-654: Changes LIFO queue to remove touched entry

Removes any entry that has been touched while processing
the LIFO queue. Prevents issue when eviction gets stuck
during startup and GII and an entry can not be removed.


> Interaction of GII and LIFO eviction can cause high CPU and recovery issues
> ---------------------------------------------------------------------------
>
>                 Key: GEODE-654
>                 URL: https://issues.apache.org/jira/browse/GEODE-654
>             Project: Geode
>          Issue Type: Bug
>          Components: core
>            Reporter: Vincent Ford
>            Assignee: Vincent Ford
>
> This was seen in a customer engagement with GemFire but also effects Geode and the usage
of  LIFO eviction algorithm.  The GemFire support team led by David Wisler identified the
following pattern. In the specific case the GII was blocked from completing processing and
also blocked a put into the processing queue of a Gateway ( not part of Geode but would effect
any usage of LIFO eviction ).
> (** 1 **)
> "Pooled High Priority Message Processor 4" daemon prio=10 tid=0x00002ba27c008800 nid=0x3fc5
waiting for monitor entry [0x00002ba1894f2000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
> 	- waiting to lock <0x00000005b81413d0> (a com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> 1554	synchronized(mapEntry) { // bug #46042 must sync to make sure the tag goes with
the value
> 1555		VersionSource<?> id = stamp.getMemberID();
> 1556		if (id == null) { id = myId; }
> 1557			foundIds.add(id);
> 1558			// if the recipient passed a version vector, use it to filter out
> 1559			// entries the recipient already has
> (** 2 **)
> "pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for monitor entry
[0x00002ba18b4d0000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
> 	- waiting to lock <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
> 	at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
> 	- locked <0x00000005b81413d0> (a com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> Below it is clear that we are not getting out of the eviction loop, for some reason the
method "evictEntry" is returning 0 as the amount of evicted bytes and that's why following
 thread keeps swapping between BLOCKED and RUNNABLE and making no progress.
> [2015-09-08 12:39:09]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for monitor entry
[0x00002ba18b34b000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
> 	- locked <0x00000005b9b7c8c8> (a com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
> 	at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
> 	at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
> 	- locked <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> [2015-09-08 12:40:14]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable [0x00002ba18b34b000]
>    java.lang.Thread.State: RUNNABLE
> 	at com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
> 	- locked <0x00000005b8c55c90> (a com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
> 	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
> 	at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
> 	at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
> 	at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
> 	at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
> 	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
> 	- locked <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> From the AbstractLRURegionMap...
> 527		while (_getCCHelper().mustEvict(stats, _getOwner(), bytesToEvict)) {
> 528			LRUEntry removalEntry = (LRUEntry)_getLruList().getLRUEntry();
> 529			if (removalEntry != null) {
> 530				if (evictEntry(removalEntry, stats) != 0) {
> 531					if (debug) {
> 532						debugLogging("evicted entry key(2)=" + removalEntry.getKey()
> 533							+ " total entry size is now: " + getTotalEntrySize()
> 534							+ " bytesToEvict :" + bytesToEvict);
> 535			//                if (NewLRUClockHand.debug && logWriter != null) {
> 536			//                  _getLruList().dumpList(logWriter);
> 537			//                }
> 538					}
> 539					stats.incEvictions();
> 540					if (_isOwnerALocalRegion()) {
> 541						if (_getOwner() instanceof BucketRegion) {
> 542							((BucketRegion)_getOwner()).incEvictions(1);
> 543						}
> 544					}
> 545					if (debug)
> 546						debugLogging("evictions=" + stats.getEvictions());
> 547						_getCCHelper().afterEviction();
> 548
> 549				}
> 550			}
> 551
> 552			else {
> 553				if (debug && getTotalEntrySize() != 0) {
> 554					debugLogging("leaving evict loop early");
> 555				}
> 556
> 557				break;
> 558			}
> 559		}
> Summary---
> 1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm and others
may choose this option as well
> 2. Due to the above, when we mustEvict, we call getLRUEntry implementation of NewLIFEClockHand

> 3. Unfortunately, this LIFO based implementation doesn't seem to work in all cases during
gII. This implementation always returns simply the last entry in the queue without directly
removing it. It is possible to have the last element on the list (returned by getLRUEntry
for LIFO_MEMORY) be in a state that causes us to not evict any bytes, but then just continue
looping forever expecting state to change.
> 4. This appears to be a corner case race condition and will be difficult to test or reproduce
given the timings and data distribution required to encounter the issue. For example to encounter
the distributed system created by Geode must be in GII between members, have entry in queue
that can not be evicted ( part of transaction, certain internal tokens...), queue is in state
requiring eviction. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message