geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Ford (JIRA)" <j...@apache.org>
Subject [jira] [Created] (GEODE-654) Interaction of GII and LIFO eviction can cause high CPU and recovery issues
Date Thu, 10 Dec 2015 23:07:11 GMT
Vincent Ford created GEODE-654:
----------------------------------

             Summary: Interaction of GII and LIFO eviction can cause high CPU and recovery
issues
                 Key: GEODE-654
                 URL: https://issues.apache.org/jira/browse/GEODE-654
             Project: Geode
          Issue Type: Bug
          Components: core
            Reporter: Vincent Ford


This was seen in a customer engagement with GemFire but also effects Geode and the usage of
 LIFO eviction algorithm.  The GemFire support team led by David Wisler identified the following
pattern. In the specific case the GII was blocked from completing processing and also blocked
a put into the processing queue of a Gateway ( not part of Geode but would effect any usage
of LIFO eviction ).

(** 1 **)
"Pooled High Priority Message Processor 4" daemon prio=10 tid=0x00002ba27c008800 nid=0x3fc5
waiting for monitor entry [0x00002ba1894f2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
	- waiting to lock <0x00000005b81413d0> (a com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)

1554	synchronized(mapEntry) { // bug #46042 must sync to make sure the tag goes with the value
1555		VersionSource<?> id = stamp.getMemberID();
1556		if (id == null) { id = myId; }
1557			foundIds.add(id);
1558			// if the recipient passed a version vector, use it to filter out
1559			// entries the recipient already has

(** 2 **)
"pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for monitor entry [0x00002ba18b4d0000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
	- waiting to lock <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
	at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
	at com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
	- locked <0x00000005b81413d0> (a com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)

Below it is clear that we are not getting out of the eviction loop, for some reason the method
"evictEntry" is returning 0 as the amount of evicted bytes and that's why following  thread
keeps swapping between BLOCKED and RUNNABLE and making no progress.

[2015-09-08 12:39:09]
"pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for monitor entry [0x00002ba18b34b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
	- locked <0x00000005b9b7c8c8> (a com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
	at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
	at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
	at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
	at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
	at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
	at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
	- locked <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)

[2015-09-08 12:40:14]
"pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable [0x00002ba18b34b000]
   java.lang.Thread.State: RUNNABLE
	at com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
	- locked <0x00000005b8c55c90> (a com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
	at com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
	at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
	at com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
	at com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
	at com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
	at com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
	at com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
	at com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
	at com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
	at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
	- locked <0x00000005b7624058> (a com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)

>From the AbstractLRURegionMap...
527		while (_getCCHelper().mustEvict(stats, _getOwner(), bytesToEvict)) {
528			LRUEntry removalEntry = (LRUEntry)_getLruList().getLRUEntry();
529			if (removalEntry != null) {
530				if (evictEntry(removalEntry, stats) != 0) {
531					if (debug) {
532						debugLogging("evicted entry key(2)=" + removalEntry.getKey()
533							+ " total entry size is now: " + getTotalEntrySize()
534							+ " bytesToEvict :" + bytesToEvict);
535			//                if (NewLRUClockHand.debug && logWriter != null) {
536			//                  _getLruList().dumpList(logWriter);
537			//                }
538					}
539					stats.incEvictions();
540					if (_isOwnerALocalRegion()) {
541						if (_getOwner() instanceof BucketRegion) {
542							((BucketRegion)_getOwner()).incEvictions(1);
543						}
544					}
545					if (debug)
546						debugLogging("evictions=" + stats.getEvictions());
547						_getCCHelper().afterEviction();
548
549				}
550			}
551
552			else {
553				if (debug && getTotalEntrySize() != 0) {
554					debugLogging("leaving evict loop early");
555				}
556
557				break;
558			}
559		}

Summary---
1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm and others may
choose this option as well
2. Due to the above, when we mustEvict, we call getLRUEntry implementation of NewLIFEClockHand

3. Unfortunately, this LIFO based implementation doesn't seem to work in all cases during
gII. This implementation always returns simply the last entry in the queue without directly
removing it. It is possible to have the last element on the list (returned by getLRUEntry
for LIFO_MEMORY) be in a state that causes us to not evict any bytes, but then just continue
looping forever expecting state to change.
4. This appears to be a corner case race condition and will be difficult to test or reproduce
given the timings and data distribution required to encounter the issue. For example to encounter
the distributed system created by Geode must be in GII between members, have entry in queue
that can not be evicted ( part of transaction, certain internal tokens...), queue is in state
requiring eviction. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message