geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mangesh Deshmukh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (GEODE-3709) Geode Version: 1.1.1 In one of the project we a...
Date Mon, 23 Oct 2017 16:35:01 GMT

    [ https://issues.apache.org/jira/browse/GEODE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215335#comment-16215335
] 

Mangesh Deshmukh edited comment on GEODE-3709 at 10/23/17 4:34 PM:
-------------------------------------------------------------------

[^20171020.zip]

I accidentally deleted my comments while trying to attach the file to this. Will try to repeat
them here.

Attaching the logs, stats and tcpdump for all the nodes involved in the issue. Could not add
server side logs because of the size limit (after compressing it is about 128mb). Let me know
to upload it if needed.

Here is the setup:
dca-prd-rstc11 - This is a Geode client that creates entries for the region CircularOffers
and CircularStoreMap. We can see that the entries do get distributed across all Geode servers
as expected immediately
dca-prd-rstc12 - This is another Geode client that has registered interest in the above regions
for all the keys with receive_values=false. Therefore, we see that whenever entries are created
for these regions, we only see "invalidate" updates coming to this server
dca-prd-gdcs11 - This is a Geode server that maintains the subscription for dca-prd-rstc11.
Thus notifies dca-prd-rstc12 of any updates whenever new entries are created in these regions.

tcpdump Analysis:
- We can see that dca-prd-rstc12 starts getting updates at frame 39 and timestamp 10:59:54.279359
- These updates are fine until frame# 284750 (with intermittent retransmissions and tcp zero
windows)
- At frame# 284752, dca-prd-rstc12 sends an ack for the previous message
- The next message at frame #284756  from dca-prd-gdcs11 comes *delayed after 200ms*
- Thereafter the pattern seen in above 2 steps keeps on repeating
- We can confirm that it is indeed the Geode server that is delaying the response by comparing
tcpdump on both sides

I tried looking at server logs/code to see if there is any condition/config that triggers
this delayed response but was unsuccessful in putting my finger at anything specific. Would
appreciate your help in further investigating this issue.

This particular issue has become a breaking point for us because we see that clients have
different copies of data and serve different response for the same request. BTW, we have other
setups that uses Gemfire 7.x and see the same issue. I understand that this is a forum for
Geode only but thought it could be valuable information in troubleshooting the problem.



was (Author: mangeshd):
[^20171020.zip]

> Geode Version: 1.1.1    In one of the project we a...
> -----------------------------------------------------
>
>                 Key: GEODE-3709
>                 URL: https://issues.apache.org/jira/browse/GEODE-3709
>             Project: Geode
>          Issue Type: Improvement
>          Components: client queues
>            Reporter: Gregory Chase
>         Attachments: 20171006-logs-stats-tds.zip, 20171020.zip, CacheClientProxyStats_sentBytes.gif,
DistributionStats_receivedBytes_CacheClientProxyStats_sentBytes.gif, gf-rest-stats-12-05.gfs,
myStatisticsArchiveFile-04-01.gfs
>
>
> Geode Version: 1.1.1
> In one of the project we are using Geode. Here is a summary of how we use it.
> - Geode servers have multiple regions. 
> - Clients subscribe to the data from these regions.
> - Clients subscribe interest in all the entries, therefore they get updates about all
the entries from creation to modification to deletion.
> - One of the regions usually has 5-10 million entries with a TTL of 24 hours. Most entries
are added in an hour's span one after other. So when TTL kicks in, they are often destroyed
in an hour.
> Problem:
> Every now and then we observe following message: 
> 	Client queue for _gfe_non_durable_client_with_id_x.x.x.x(14229:loner):42754:e4266fc4_2_queue
client is full.
> This seems to happen when the TTL kicks in on the region with 5-10 million entries. Entries
start getting evicted (deleted); the updates (destroys) now must be sent to clients. We see
that the updates do happen for a while but suddenly the updates stop and the queue size starts
growing. This is becoming a major issue for smooth functioning of our production setup. Any
help will be much appreciated. 
> I did some ground work by downloading and looking at the code. I see reference to 2 issues
#37581, #51400. But I am unable to view actual JIRA tickets (needs login credentials) Hopefully,
it helps someone looking at the issue.
> Here is the pertinent code:
>    @Override
>     @edu.umd.cs.findbugs.annotations.SuppressWarnings("TLW_TWO_LOCK_WAIT")
>     void checkQueueSizeConstraint() throws InterruptedException {
>       if (this.haContainer instanceof HAContainerMap && isPrimary()) { // Fix
for bug 39413
>         if (Thread.interrupted())
>           throw new InterruptedException();
>         synchronized (this.putGuard) {
>           if (putPermits <= 0) {
>             synchronized (this.permitMon) {
>               if (reconcilePutPermits() <= 0) {
>                 if (region.getSystem().getConfig().getRemoveUnresponsiveClient()) {
>                   isClientSlowReciever = true;
>                 } else {
>                   try {
>                     long logFrequency = CacheClientNotifier.DEFAULT_LOG_FREQUENCY;
>                     CacheClientNotifier ccn = CacheClientNotifier.getInstance();
>                     if (ccn != null) { // check needed for junit tests
>                       logFrequency = ccn.getLogFrequency();
>                     }
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 0) {
>                       logger.warn(LocalizedMessage.create(
>                           LocalizedStrings.HARegionQueue_CLIENT_QUEUE_FOR_0_IS_FULL,
>                           new Object[] {region.getName()}));
>                       this.maxQueueSizeHitCount = 0;
>                     }
>                     ++this.maxQueueSizeHitCount;
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // TODO: wait called while holding two locks
>                     this.permitMon.wait(CacheClientNotifier.eventEnqueueWaitTime);
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // Fix for #51400. Allow the queue to grow beyond its
>                     // capacity/maxQueueSize, if it is taking a long time to
>                     // drain the queue, either due to a slower client or the
>                     // deadlock scenario mentioned in the ticket.
>                     reconcilePutPermits();
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 1) {
>                       logger.info(LocalizedMessage
>                           .create(LocalizedStrings.HARegionQueue_RESUMING_WITH_PROCESSING_PUTS));
>                     }
>                   } catch (InterruptedException ex) {
>                     // TODO: The line below is meaningless. Comment it out later
>                     this.permitMon.notifyAll();
>                     throw ex;
>                   }
>                 }
>               }
>             } // synchronized (this.permitMon)
>           } // if (putPermits <= 0)
>           --putPermits;
>         } // synchronized (this.putGuard)
>       }
>     }
> *Reporter*: Mangesh Deshmukh
> *E-mail*: [mailto:mdeshmukh@quotient.com]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message