hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Williams <jwilli...@endpoint.com>
Subject Re: Performance oddity between AWS instance sizes
Date Fri, 19 Sep 2014 03:43:21 GMT
Hi Andrew,

I'll definitely bump up the heap on subsequent tests -- thanks for the
tip.  It was increased to 8 GB, but that didn't make any difference for
the older YCSB.

Using your YCSB branch with the updated HBase client definitely makes a
difference, however, showing consistent throughput for a little while.
After a little bit of time, so far under about 5 minutes in the few
times I ran it, it'll hit a NullPointerException[1] ... but it
definitely seems to point more at a problem in the older YCSB.

[1] https://gist.github.com/joshwilliams/0570a3095ad6417ca74f

Thanks for your help,

-- Josh


On Thu, 2014-09-18 at 15:02 -0700, Andrew Purtell wrote:
> 1 GB heap is nowhere enough to run if you're tying to test something
> real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB,
> use case dependent. >= 32 GB gives away compressed OOPs and maybe GC
> issues.
> 
> Also, I recently redid the HBase YCSB client in a modern way for >=
> 0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It
> performs in an IMHO more useful fashion than the previous for what
> YCSB is intended, but might need some tuning (haven't tried it on a
> cluster of significant size). One difference you should see is we
> won't back up for 30-60 seconds after a bunch of threads flush fat 12+
> MB write buffers.
> 
> On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <jwilliams@endpoint.com> wrote:
> > Ted,
> >
> > Stack trace, that's definitely a good idea.  Here's one jstack snapshot
> > from the region server while there's no apparent activity going on:
> > https://gist.github.com/joshwilliams/4950c1d92382ea7f3160
> >
> > If it's helpful, this is the YCSB side of the equation right around the
> > same time:
> > https://gist.github.com/joshwilliams/6fa3623088af9d1446a3
> >
> >
> > And Gary,
> >
> > As far as the memory configuration, that's a good question.  Looks like
> > HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB.  There
> > isn't any swap configured, and 12G of the memory on the instance is
> > going to file cache, so there's definitely room to spare.
> >
> > Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE.
> > Couldn't hurt to try that now...
> >
> > What's strange is running on m3.xlarge, which also has 15G of RAM but
> > fewer CPU cores, it runs fine.
> >
> > Thanks to you both for the insight!
> >
> > -- Josh
> >
> >
> >
> > On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote:
> >> What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
> >> possible that you're overcommitting memory and the instance is
> >> swapping?  Just a shot in the dark, but I see that the m3.2xlarge
> >> instance has 30G of memory vs. 15G for c3.2xlarge.
> >>
> >> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >> > bq. there's almost no activity on either side
> >> >
> >> > During this period, can you capture stack trace for the region server and
> >> > pastebin the stack ?
> >> >
> >> > Cheers
> >> >
> >> > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jwilliams@endpoint.com>
> >> > wrote:
> >> >
> >> >> Hi, everyone.  Here's a strange one, at least to me.
> >> >>
> >> >> I'm doing some performance profiling, and as a rudimentary test I've
> >> >> been using YCSB to drive HBase (originally 0.98.3, recently updated
to
> >> >> 0.98.6.)  The problem happens on a few different instance sizes, but
> >> >> this is probably the closest comparison...
> >> >>
> >> >> On m3.2xlarge instances, works as expected.
> >> >> On c3.2xlarge instances, HBase barely responds at all during workloads
> >> >> that involve read activity, falling silent for ~62 second intervals,
> >> >> with the YCSB throughput output resembling:
> >> >>
> >> >>  0 sec: 0 operations;
> >> >>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
> >> >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
> >> >>  4 sec: 918 operations; 0 current ops/sec;
> >> >>  6 sec: 918 operations; 0 current ops/sec;
> >> >> <snip>
> >> >>  62 sec: 918 operations; 0 current ops/sec;
> >> >>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
> >> >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
> >> >>  66 sec: 5302 operations; 0 current ops/sec;
> >> >>  68 sec: 5302 operations; 0 current ops/sec;
> >> >> (And so on...)
> >> >>
> >> >> While that happens there's almost no activity on either side, the CPU's
> >> >> and disks are idle, no iowait at all.
> >> >>
> >> >> There isn't much that jumps out at me when digging through the Hadoop
> >> >> and HBase logs, except that those 62-second intervals are often (but
> >> >> note always) associated with ClosedChannelExceptions in the regionserver
> >> >> logs.  But I believe that's just HBase finding that a TCP connection
it
> >> >> wants to reply on had been closed.
> >> >>
> >> >> As far as I've seen this happens every time on this or any of the larger
> >> >> c3 class of instances, surprisingly.  The m3 instance class sizes all
> >> >> seem to work fine.  These are built with a custom AMI that has HBase
and
> >> >> all installed, and run via a script, so the different instance type
> >> >> should be the only difference between them.
> >> >>
> >> >> Anyone seen anything like this?  Any pointers as to what I could look
at
> >> >> to help diagnose this odd problem?  Could there be something I'm
> >> >> overlooking in the logs?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> -- Josh
> >> >>
> >> >>
> >> >>
> >
> >
> 
> 
> 



Mime
View raw message