hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juhani Connolly <juhani_conno...@cyberagent.co.jp>
Subject Re: 0.92 and Read/writes not scaling
Date Tue, 27 Mar 2012 03:18:50 GMT
Hi Todd,

Here's our thread dumps from one of our slave nodes while running a load.
The particular load was set up to grab a table from a tablepool, stop it 
from autoflushing, put 1000 entries from 128-256 bytes each in(the keys 
being a random spread throughout the entire keyspace) and then manually 
flushed. The average latency is an attrocious 58 seconds, though of 
course it is nothing like that if we use single puts or small batches...
Also put in our configs... They had more in them but we stripped them 
down a lot to try to get at the problem source, no luck though(we took 
them down to the bare minimum as well but  that didn't change things so 
we restored some of  the settings).

Thanks,
  Juhani

On 03/27/2012 10:43 AM, Todd Lipcon wrote:
> Hi Juhani,
>
> I wouldn't have expected CDH4b1 (0.23) to be slower than 0.20 for
> writes. They should be around the same speed, or even a little faster
> in some cases. That said, I haven't personally run any benchmarks in
> several months on this setup. I know our performance/QA team has done
> some, so I asked them to take a look. Hopefully we should have some
> results soon.
>
> If you can take 10-20 jstacks of the RegionServer and the DN on that
> same machine while performing your write workload, that would be
> helpful. It's possible we had a regression during some recent
> development right before the 4b1 release. If you're feeling
> adventurous, you can also try upgrading to CDH4b2 snapshot builds,
> which do have a couple of performance improvements/bugfixes that may
> help. Drop by #cloudera on IRC and one of us can point you in the
> right direction if you're willing to try (though of course the nightly
> builds are somewhat volatile and haven't had any QA)
>
> -Todd
>
> On Mon, Mar 26, 2012 at 10:08 AM, Juhani Connolly<juhanic@gmail.com>  wrote:
>> On Tue, Mar 27, 2012 at 1:42 AM, Stack<stack@duboce.net>  wrote:
>>> On Mon, Mar 26, 2012 at 6:58 AM, Matt Corgan<mcorgan@hotpads.com>  wrote:
>>>> When you increased regions on your previous test, did it start maxing out
>>>> CPU?  What improvement did you see?
>>>>
>>> What Matt asks, what is your cluster doing?  What changes do you see
>>> when you say, increase size of your batching or as Mat asks, what is
>>> the difference when you went from less to more regions?
>>>
>> None of our hardware is even near its limit. Ganglia rarely has a
>> single machine over 25% load, and we have verified io, network, cpu
>> and memory all have plenty of breathing space with other tools(top,
>> iostat, dstat and others mentioned in the hstack article).
>>
>>>> Have you tried increasing the memstore flush size to something like 512MB?
>>>>   Maybe you're blocked on flushes.  40,000 (4,000/server) is pretty slow
for
>>>> a disabled WAL i think, especially with batch size of 10.  If you increase
>>>> write batch size to 1000 how much does your write throughput increase?
>>>>
>>> The above sounds like something to try -- upping flush sizes.
>>>
>>> Are you spending your time compacting all the time?  For kicks try
>>> disabling compactions when doing your write tests.  Does it make a
>>> difference?  What does ganglia show as hot?  Are you network-bound,
>>> io-bound, cpu-bound?
>>>
>>> Thanks,
>>> St.Ack
>> The compaction and flush times according to ganglia are pretty short
>> and insignificant. I've also been watching the rpcs and past events
>> from the html control panel which don't seem to be indicative of a
>> problem. However I will try changing the flushes and using bigger
>> batches, it might turn up something interesting, thanks.
>
>


Mime
View raw message