Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 70C35DD87 for ; Wed, 12 Sep 2012 18:07:02 +0000 (UTC) Received: (qmail 29994 invoked by uid 500); 12 Sep 2012 18:07:00 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 29953 invoked by uid 500); 12 Sep 2012 18:07:00 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 29945 invoked by uid 99); 12 Sep 2012 18:07:00 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Sep 2012 18:07:00 +0000 Received: from localhost (HELO mail-iy0-f169.google.com) (127.0.0.1) (smtp-auth username apurtell, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Sep 2012 18:07:00 +0000 Received: by iagk10 with SMTP id k10so1963890iag.14 for ; Wed, 12 Sep 2012 11:06:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.195.234 with SMTP id ih10mr21774123igc.21.1347473219372; Wed, 12 Sep 2012 11:06:59 -0700 (PDT) Received: by 10.64.68.230 with HTTP; Wed, 12 Sep 2012 11:06:59 -0700 (PDT) In-Reply-To: References: Date: Wed, 12 Sep 2012 11:06:59 -0700 Message-ID: Subject: Re: Tracking down coprocessor pauses From: Andrew Purtell To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Inline On Wed, Sep 12, 2012 at 10:40 AM, Tom Brown wrote: > I have captured some logs from what is happening during one of these pauses. > > http://pastebin.com/K162Einz > > Can someone help me figure out what's actually going on from these logs? > > --- My interpretation of the logs --- > > As you can see at the start of the logs, my coprocessor for updating > the data is executing rapidly until 10:17:06. > > At that time the coprocessor for querying is invoked. This query > should take only moments to return, but doesn't return until 10:44:52. Here it would be helpful to get a stacktrace from the regionserver where the CP is executing, to see where the RPC threads servicing the CP invocations are hung up. > > At 10:18:53 there appear to be some compaction related messages > (though they didn't appear to be the cause, happening over a minute > after the server stops functioning). > > It appears to run compaction until 10:42:25. The next two minutes > contain just LRU eviction messages. > > At 10:44:52, the query from earlier appears to complete, after having > summarized only 863 rows. A few other queued requests are attempted, > but fail with exceptions (ClosedChannelException). > > Eventually the exceptions are being thrown from "openScanner", which > really doesn't sound good to me. ChannelClosedExceptions appear to be from RPC service threads, now unstuck, processing queued up CP invocations but the client has given up, so they can't write back results and error out. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)