geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mangesh Deshmukh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (GEODE-3709) Geode Version: 1.1.1 In one of the project we a...
Date Mon, 23 Oct 2017 16:00:01 GMT

    [ https://issues.apache.org/jira/browse/GEODE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215335#comment-16215335
] 

Mangesh Deshmukh edited comment on GEODE-3709 at 10/23/17 3:59 PM:
-------------------------------------------------------------------

[^20171020.zip]


was (Author: mangeshd):

Here are the detailed logs, stats and tcpdump for the problem. I could not include server
side log because it puts us over the limit of file size that can be attached to the ticket.
Let me knowhow you would want me to upload the same. 
Here is the setup:
dca-prd-rstc11 - This is a client that is creating entries in CircularOffers and CircularStoreMap
region. The entries are seen getting replicated across all servers as expected immediately
without any delays.
dca-prd-rstc12 - This is a client that has registered interest in all_keys for above 2 regions
with receive_values=false. Therefore, whenever entries are created, all you see is "invalidate"
notification from Geode server to this client.
dca-prd-gdcs11 - This is the Geode server that carries subscription for dca-prd-rstc12, thus
responsible for notifying dca-prd-rstc12 of any changes occurring on these regions.

tcpdump analysis:
- We can see that at frame number 39 and timestamp 10:59:54.279359 on dca-prd-rstc12, the
client starts receiving notifications.
- We also see that everything seems normal until frame 284750 with intermittent retransmission
and tcp zero windows
- dca-prd-rstc12 sends an "ack" at frame# 284752
- the next response from dca-prd-gdcs11 however is received only after 200ms
- the pattern mentioned in last 2 steps continues thereafter 
- it can be verified that it is indeed the geode server that is slow to response by comparing
both side tcpdump

I tried looking at server log and code to see of any condition/config that would trigger geode
server to add this constant delay factor. But was unsuccessful in determining the cause of
it. It would be great if you can further investigate and let us know about your findings.

This could be a potential breaking point for us because the clients are essentially not in
sync with the data and we tend to serve inconsistent data for the same request. BTW, we have
Gemfire 7.x as well and see similar issue in that setup as well. We really like Geode/Gemfire
as a product and it would be shame to let go of it because of this issue. 

> Geode Version: 1.1.1    In one of the project we a...
> -----------------------------------------------------
>
>                 Key: GEODE-3709
>                 URL: https://issues.apache.org/jira/browse/GEODE-3709
>             Project: Geode
>          Issue Type: Improvement
>          Components: client queues
>            Reporter: Gregory Chase
>         Attachments: 20171006-logs-stats-tds.zip, 20171020.zip, CacheClientProxyStats_sentBytes.gif,
DistributionStats_receivedBytes_CacheClientProxyStats_sentBytes.gif, gf-rest-stats-12-05.gfs,
myStatisticsArchiveFile-04-01.gfs
>
>
> Geode Version: 1.1.1
> In one of the project we are using Geode. Here is a summary of how we use it.
> - Geode servers have multiple regions. 
> - Clients subscribe to the data from these regions.
> - Clients subscribe interest in all the entries, therefore they get updates about all
the entries from creation to modification to deletion.
> - One of the regions usually has 5-10 million entries with a TTL of 24 hours. Most entries
are added in an hour's span one after other. So when TTL kicks in, they are often destroyed
in an hour.
> Problem:
> Every now and then we observe following message: 
> 	Client queue for _gfe_non_durable_client_with_id_x.x.x.x(14229:loner):42754:e4266fc4_2_queue
client is full.
> This seems to happen when the TTL kicks in on the region with 5-10 million entries. Entries
start getting evicted (deleted); the updates (destroys) now must be sent to clients. We see
that the updates do happen for a while but suddenly the updates stop and the queue size starts
growing. This is becoming a major issue for smooth functioning of our production setup. Any
help will be much appreciated. 
> I did some ground work by downloading and looking at the code. I see reference to 2 issues
#37581, #51400. But I am unable to view actual JIRA tickets (needs login credentials) Hopefully,
it helps someone looking at the issue.
> Here is the pertinent code:
>    @Override
>     @edu.umd.cs.findbugs.annotations.SuppressWarnings("TLW_TWO_LOCK_WAIT")
>     void checkQueueSizeConstraint() throws InterruptedException {
>       if (this.haContainer instanceof HAContainerMap && isPrimary()) { // Fix
for bug 39413
>         if (Thread.interrupted())
>           throw new InterruptedException();
>         synchronized (this.putGuard) {
>           if (putPermits <= 0) {
>             synchronized (this.permitMon) {
>               if (reconcilePutPermits() <= 0) {
>                 if (region.getSystem().getConfig().getRemoveUnresponsiveClient()) {
>                   isClientSlowReciever = true;
>                 } else {
>                   try {
>                     long logFrequency = CacheClientNotifier.DEFAULT_LOG_FREQUENCY;
>                     CacheClientNotifier ccn = CacheClientNotifier.getInstance();
>                     if (ccn != null) { // check needed for junit tests
>                       logFrequency = ccn.getLogFrequency();
>                     }
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 0) {
>                       logger.warn(LocalizedMessage.create(
>                           LocalizedStrings.HARegionQueue_CLIENT_QUEUE_FOR_0_IS_FULL,
>                           new Object[] {region.getName()}));
>                       this.maxQueueSizeHitCount = 0;
>                     }
>                     ++this.maxQueueSizeHitCount;
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // TODO: wait called while holding two locks
>                     this.permitMon.wait(CacheClientNotifier.eventEnqueueWaitTime);
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // Fix for #51400. Allow the queue to grow beyond its
>                     // capacity/maxQueueSize, if it is taking a long time to
>                     // drain the queue, either due to a slower client or the
>                     // deadlock scenario mentioned in the ticket.
>                     reconcilePutPermits();
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 1) {
>                       logger.info(LocalizedMessage
>                           .create(LocalizedStrings.HARegionQueue_RESUMING_WITH_PROCESSING_PUTS));
>                     }
>                   } catch (InterruptedException ex) {
>                     // TODO: The line below is meaningless. Comment it out later
>                     this.permitMon.notifyAll();
>                     throw ex;
>                   }
>                 }
>               }
>             } // synchronized (this.permitMon)
>           } // if (putPermits <= 0)
>           --putPermits;
>         } // synchronized (this.putGuard)
>       }
>     }
> *Reporter*: Mangesh Deshmukh
> *E-mail*: [mailto:mdeshmukh@quotient.com]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message