Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <1410992517.14622.34.camel@endpoint.com>
Subject: Performance oddity between AWS instance sizes
From: Josh Williams <jwilliams@endpoint.com>
To: user@hbase.apache.org
Date: Wed, 17 Sep 2014 18:21:57 -0400
Organization: End Point Corporation
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit

Hi, everyone.  Here's a strange one, at least to me.

I'm doing some performance profiling, and as a rudimentary test I've
been using YCSB to drive HBase (originally 0.98.3, recently updated to
0.98.6.)  The problem happens on a few different instance sizes, but
this is probably the closest comparison...

On m3.2xlarge instances, works as expected.
On c3.2xlarge instances, HBase barely responds at all during workloads
that involve read activity, falling silent for ~62 second intervals,
with the YCSB throughput output resembling:

 0 sec: 0 operations;
 2 sec: 918 operations; 459 current ops/sec; [UPDATE AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
 4 sec: 918 operations; 0 current ops/sec;
 6 sec: 918 operations; 0 current ops/sec;
<snip>
 62 sec: 918 operations; 0 current ops/sec;
 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
 66 sec: 5302 operations; 0 current ops/sec;
 68 sec: 5302 operations; 0 current ops/sec;
(And so on...)

While that happens there's almost no activity on either side, the CPU's
and disks are idle, no iowait at all.

There isn't much that jumps out at me when digging through the Hadoop
and HBase logs, except that those 62-second intervals are often (but
note always) associated with ClosedChannelExceptions in the regionserver
logs.  But I believe that's just HBase finding that a TCP connection it
wants to reply on had been closed.

As far as I've seen this happens every time on this or any of the larger
c3 class of instances, surprisingly.  The m3 instance class sizes all
seem to work fine.  These are built with a custom AMI that has HBase and
all installed, and run via a script, so the different instance type
should be the only difference between them.

Anyone seen anything like this?  Any pointers as to what I could look at
to help diagnose this odd problem?  Could there be something I'm
overlooking in the logs?

Thanks!

-- Josh