hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikael Sitruk <mikael.sit...@gmail.com>
Subject Re: 0.92 and Read/writes not scaling
Date Wed, 21 Mar 2012 05:29:51 GMT
Juhani,
Can you look at the storefiles and tell how they behave during the test?
What is the size of the data you insert/update?
Mikael
On Mar 20, 2012 8:10 PM, "Juhani Connolly" <juhanic@gmail.com> wrote:

> Hi Matt,
>
> this is something we haven't tested much, we were always running with
> about 32 regions which gave enough coverage for an even spread over
> all machines.
> I will run our tests with enough regions per server to cover all cores
> and get back to the ml
>
> On Tue, Mar 20, 2012 at 1:55 AM, Matt Corgan <mcorgan@hotpads.com> wrote:
> > I'd be curious to see what happens if you split the table into 1 region
> per
> > CPU core, so 24 cores * 11 servers = 264 regions.  Each region has 1
> > memstore which is a ConcurrentSkipListMap, and you're currently hitting
> > each CSLM with 8 cores which might be too contentious.  Normally in
> > production you would want multiple memstores per CPU core.
> >
> >
> > On Mon, Mar 19, 2012 at 5:31 AM, Juhani Connolly <juhanic@gmail.com>
> wrote:
> >
> >> Actually we did try running off two machines both running our own
> >> tests in parallel. Unfortunately the results were a split that results
> >> in the same total throughput. We also did the same thing with iperf
> >> running from each machine to another machine, indicating 800Mb
> >> additional throughput between each pair of machines.
> >> However we didn't try these tests very thoroughly so I will revisit
> >> them as soon as I get back to the office, thanks.
> >>
> >> On Mon, Mar 19, 2012 at 9:21 PM, Christian Schäfer <
> syrious3000@yahoo.de>
> >> wrote:
> >> > referring to my experiences I expect the client to be the bottleneck,
> >> too.
> >> >
> >> > So try to increase the count of client-machines (not client threads)
> >> each with its own unshared network interface.
> >> >
> >> > In my case I could double write throughput by doubling client machine
> >> count with a much smaller system than yours (5 machines, 4gigs RAM
> each).
> >> >
> >> > Good Luck
> >> > Chris
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >  Von: Juhani Connolly <juhanic@gmail.com>
> >> > An: user@hbase.apache.org
> >> > Gesendet: 13:02 Montag, 19.März 2012
> >> > Betreff: Re: 0.92 and Read/writes not scaling
> >> >
> >> > I was concerned that may be the case too, which is why we ran the ycsb
> >> > tests in addition to our application specific and general performance
> >> > tests. checking profiles of the execution just showed the vast
> majority
> >> of
> >> > time spent waiting for responses. these were all run with 400
> >> > threads(though we tried more/less just in case)
> >> > 2012/03/19 20:57 "Mingjian Deng" <koven2049@gmail.com>:
> >> >
> >> >> @Juhani:
> >> >> How many clients did you test? Maybe the bottleneck was client?
> >> >>
> >> >> 2012/3/19 Ramkrishna.S.Vasudevan <ramkrishna.vasudevan@huawei.com>
> >> >>
> >> >> > Hi Juhani
> >> >> >
> >> >> > Can you tell more on how the regions are balanced?
> >> >> > Are you overloading only specific region server alone?
> >> >> >
> >> >> > Regards
> >> >> > Ram
> >> >> >
> >> >> > > -----Original Message-----
> >> >> > > From: Juhani Connolly [mailto:juhanic@gmail.com]
> >> >> > > Sent: Monday, March 19, 2012 4:11 PM
> >> >> > > To: user@hbase.apache.org
> >> >> > > Subject: 0.92 and Read/writes not scaling
> >> >> > >
> >> >> > > Hi,
> >> >> > >
> >> >> > > We're running into a brick wall where our throughput numbers
will
> >> not
> >> >> > > scale as we increase server counts both using custom inhouse
> tests
> >> and
> >> >> > > ycsb.
> >> >> > >
> >> >> > > We're using hbase 0.92 on hadoop 0.20.2(we also experience
the
> same
> >> >> > > issues using 0.90 before switching our testing to  this version).
> >> >> > >
> >> >> > > Our cluster consists of:
> >> >> > > - Namenode and hmaster on separate servers, 24 core, 64gb
> >> >> > > - up to 11 datanode/regionservers. 24 core, 64gb, 4 * 1tb
> disks(hope
> >> >> > > to get this changed)
> >> >> > >
> >> >> > > We have adjusted our gc settings, and mslabs:
> >> >> > >
> >> >> > >   <property>
> >> >> > >     <name>hbase.hregion.memstore.mslab.enabled</name>
> >> >> > >     <value>true</value>
> >> >> > >   </property>
> >> >> > >
> >> >> > >   <property>
> >> >> > >     <name>hbase.hregion.memstore.mslab.chunksize</name>
> >> >> > >     <value>2097152</value>
> >> >> > >   </property>
> >> >> > >
> >> >> > >   <property>
> >> >> > >     <name>hbase.hregion.memstore.mslab.max.allocation</name>
> >> >> > >     <value>1024768</value>
> >> >> > >   </property>
> >> >> > >
> >> >> > > hdfs xceivers is set to 8192
> >> >> > >
> >> >> > > We've experimented with a variety of handler counts for namenode,
> >> >> > > datanodes and regionservers with no changes in throughput.
> >> >> > >
> >> >> > > For testing with ycsb, we do the following each time(with
nothing
> >> else
> >> >> > > using the cluster):
> >> >> > > - truncate test table
> >> >> > > - add a small amount of data, then split the table into 32
> regions
> >> and
> >> >> > > call balancer from the shell.
> >> >> > > - load 10m rows
> >> >> > > - do a 1:2:7 insert:update:read test with 10million rows
> (64k/sec)
> >> >> > > - do a 5:5 insert:update test with 10 million rows (23k/sec)
> >> >> > > - do a pure read test with 10 million rows (75k/sec)
> >> >> > >
> >> >> > > We have observed ganglia, iostat -d -x, iptraf, top, dstat
and a
> >> >> > > variety of other diagnostic tools and network/io/cpu/memory
as
> >> >> > > bottlenecks seem highly unlikely as none of them are  ever
> seriously
> >> >> > > taxed. This leave me to assume this is some kind of locking
> issue?
> >> >> > > Delaying WAL flushes gives a small throughput bump but it
doesn't
> >> >> > > scale.
> >> >> > >
> >> >> > > There also doesn't seem to be many figures around to compare
ours
> >> to.
> >> >> > > We can get our throughput numbers higher with tricks like
not
> >> writing
> >> >> > > the WAL or delaying flushes, batching requests, but nothing
> seems to
> >> >> > > scale with additional slaves.
> >> >> > > Could anyone provide guidance as to what may be preventing
> >> throughput
> >> >> > > figures from scaling as we increase our slave count?
> >> >> >
> >> >> >
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message