Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 597CD11FBA for ; Wed, 17 Sep 2014 22:22:29 +0000 (UTC) Received: (qmail 12388 invoked by uid 500); 17 Sep 2014 22:22:27 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 12310 invoked by uid 500); 17 Sep 2014 22:22:27 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 12298 invoked by uid 99); 17 Sep 2014 22:22:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Sep 2014 22:22:27 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [50.97.100.2] (HELO mail.endcrypt.com) (50.97.100.2) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Sep 2014 22:22:21 +0000 Received: from endpoint.joshwilliams.name (endpoint.joshwilliams.name [IPv6:2001:470:5:509:230:48ff:fed4:11a0]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.endcrypt.com (Postfix) with ESMTPSA id 023C8869EA; Wed, 17 Sep 2014 22:21:58 +0000 (UTC) Message-ID: <1410992517.14622.34.camel@endpoint.com> Subject: Performance oddity between AWS instance sizes From: Josh Williams To: user@hbase.apache.org Date: Wed, 17 Sep 2014 18:21:57 -0400 Organization: End Point Corporation Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, everyone. Here's a strange one, at least to me. I'm doing some performance profiling, and as a rudimentary test I've been using YCSB to drive HBase (originally 0.98.3, recently updated to 0.98.6.) The problem happens on a few different instance sizes, but this is probably the closest comparison... On m3.2xlarge instances, works as expected. On c3.2xlarge instances, HBase barely responds at all during workloads that involve read activity, falling silent for ~62 second intervals, with the YCSB throughput output resembling: 0 sec: 0 operations; 2 sec: 918 operations; 459 current ops/sec; [UPDATE AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26] 4 sec: 918 operations; 0 current ops/sec; 6 sec: 918 operations; 0 current ops/sec; 62 sec: 918 operations; 0 current ops/sec; 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56] 66 sec: 5302 operations; 0 current ops/sec; 68 sec: 5302 operations; 0 current ops/sec; (And so on...) While that happens there's almost no activity on either side, the CPU's and disks are idle, no iowait at all. There isn't much that jumps out at me when digging through the Hadoop and HBase logs, except that those 62-second intervals are often (but note always) associated with ClosedChannelExceptions in the regionserver logs. But I believe that's just HBase finding that a TCP connection it wants to reply on had been closed. As far as I've seen this happens every time on this or any of the larger c3 class of instances, surprisingly. The m3 instance class sizes all seem to work fine. These are built with a custom AMI that has HBase and all installed, and run via a script, so the different instance type should be the only difference between them. Anyone seen anything like this? Any pointers as to what I could look at to help diagnose this odd problem? Could there be something I'm overlooking in the logs? Thanks! -- Josh