Mailing-List: contact issues-help@geode.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@geode.apache.org
Date: Thu, 26 Oct 2017 02:17:00 +0000 (UTC)
From: "Mangesh Deshmukh (JIRA)" <jira@apache.org>
To: issues@geode.apache.org
Message-ID: <JIRA.13105538.1506537986000.63974.1508984220150@Atlassian.JIRA>
In-Reply-To: <JIRA.13105538.1506537986000@Atlassian.JIRA>
References: <JIRA.13105538.1506537986000@Atlassian.JIRA> <JIRA.13105538.1506537986123@jira-lw-us.apache.org>
Subject: [jira] [Commented] (GEODE-3709) Geode Version: 1.1.1    In one of
 the project we a...
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 26 Oct 2017 02:17:07 -0000


    [ https://issues.apache.org/jira/browse/GEODE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219856#comment-16219856 ] 

Mangesh Deshmukh commented on GEODE-3709:
-----------------------------------------

I concur with you that there isn't sufficient information(from logs and tcpdump) to tell us where the problem really lies. I had also noticed the window size degradation and that seems to be a common theme in this issue. (Saw it across multiple runs).

To troubleshoot further, we could however do a custom build with some log statements. Please let me know what would be appropriate place to put some log statements. 

Is this the right place?
{code:java}

Message.java
  void flushBuffer() throws IOException {
    final ByteBuffer cb = getCommBuffer();
    if (this.socketChannel != null) {
      cb.flip();
      do {
        *{color:#d04437}this.socketChannel.write(cb);{color}*
      } while (cb.remaining() > 0);
    } else {
      this.outputStream.write(cb.array(), 0, cb.position());
    }
    if (this.messageStats != null) {
      this.messageStats.incSentBytes(cb.position());
    }
    cb.clear();
  }

{code}

This should give us exact time when the message is being written from application point of view. 

> Geode Version: 1.1.1    In one of the project we a...
> -----------------------------------------------------
>
>                 Key: GEODE-3709
>                 URL: https://issues.apache.org/jira/browse/GEODE-3709
>             Project: Geode
>          Issue Type: Improvement
>          Components: client queues
>            Reporter: Gregory Chase
>         Attachments: 20171006-logs-stats-tds.zip, 20171020.zip, CacheClientProxyStats_sentBytes.gif, DistributionStats_receivedBytes_CacheClientProxyStats_sentBytes.gif, gf-rest-stats-12-05.gfs, myStatisticsArchiveFile-04-01.gfs
>
>
> Geode Version: 1.1.1
> In one of the project we are using Geode. Here is a summary of how we use it.
> - Geode servers have multiple regions. 
> - Clients subscribe to the data from these regions.
> - Clients subscribe interest in all the entries, therefore they get updates about all the entries from creation to modification to deletion.
> - One of the regions usually has 5-10 million entries with a TTL of 24 hours. Most entries are added in an hour's span one after other. So when TTL kicks in, they are often destroyed in an hour.
> Problem:
> Every now and then we observe following message: 
> 	Client queue for _gfe_non_durable_client_with_id_x.x.x.x(14229:loner):42754:e4266fc4_2_queue client is full.
> This seems to happen when the TTL kicks in on the region with 5-10 million entries. Entries start getting evicted (deleted); the updates (destroys) now must be sent to clients. We see that the updates do happen for a while but suddenly the updates stop and the queue size starts growing. This is becoming a major issue for smooth functioning of our production setup. Any help will be much appreciated. 
> I did some ground work by downloading and looking at the code. I see reference to 2 issues #37581, #51400. But I am unable to view actual JIRA tickets (needs login credentials) Hopefully, it helps someone looking at the issue.
> Here is the pertinent code:
>    @Override
>     @edu.umd.cs.findbugs.annotations.SuppressWarnings("TLW_TWO_LOCK_WAIT")
>     void checkQueueSizeConstraint() throws InterruptedException {
>       if (this.haContainer instanceof HAContainerMap && isPrimary()) { // Fix for bug 39413
>         if (Thread.interrupted())
>           throw new InterruptedException();
>         synchronized (this.putGuard) {
>           if (putPermits <= 0) {
>             synchronized (this.permitMon) {
>               if (reconcilePutPermits() <= 0) {
>                 if (region.getSystem().getConfig().getRemoveUnresponsiveClient()) {
>                   isClientSlowReciever = true;
>                 } else {
>                   try {
>                     long logFrequency = CacheClientNotifier.DEFAULT_LOG_FREQUENCY;
>                     CacheClientNotifier ccn = CacheClientNotifier.getInstance();
>                     if (ccn != null) { // check needed for junit tests
>                       logFrequency = ccn.getLogFrequency();
>                     }
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 0) {
>                       logger.warn(LocalizedMessage.create(
>                           LocalizedStrings.HARegionQueue_CLIENT_QUEUE_FOR_0_IS_FULL,
>                           new Object[] {region.getName()}));
>                       this.maxQueueSizeHitCount = 0;
>                     }
>                     ++this.maxQueueSizeHitCount;
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // TODO: wait called while holding two locks
>                     this.permitMon.wait(CacheClientNotifier.eventEnqueueWaitTime);
>                     this.region.checkReadiness(); // fix for bug 37581
>                     // Fix for #51400. Allow the queue to grow beyond its
>                     // capacity/maxQueueSize, if it is taking a long time to
>                     // drain the queue, either due to a slower client or the
>                     // deadlock scenario mentioned in the ticket.
>                     reconcilePutPermits();
>                     if ((this.maxQueueSizeHitCount % logFrequency) == 1) {
>                       logger.info(LocalizedMessage
>                           .create(LocalizedStrings.HARegionQueue_RESUMING_WITH_PROCESSING_PUTS));
>                     }
>                   } catch (InterruptedException ex) {
>                     // TODO: The line below is meaningless. Comment it out later
>                     this.permitMon.notifyAll();
>                     throw ex;
>                   }
>                 }
>               }
>             } // synchronized (this.permitMon)
>           } // if (putPermits <= 0)
>           --putPermits;
>         } // synchronized (this.putGuard)
>       }
>     }
> *Reporter*: Mangesh Deshmukh
> *E-mail*: [mailto:mdeshmukh@quotient.com]


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)