hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Date Wed, 21 Nov 2012 05:05:05 GMT
On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <vincent.barat@gmail.com> wrote:
> We have changed some parameters on our 16(!) region servers : 1GB more -Xmx,
> more rpc handler (from 10 to 30) longer timeout, but nothing seems to
> improve the response time:
>

You have taken a look at the perf chapter Vincent:
http://hbase.apache.org/book.html#performance

You carried forward your old hbase-default.xml or did you remove it
(0.92 should have defaults in hbase-X.X.X.jar -- some defaults will
have changed).


> - Scans with HBase 0.92  are x3 SLOWER than with HBase 0.90.3

Any scan caching going on?


> - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom
> read response time
>

The gets are returning lots of data? (If you thread dump the server at
this time -- see at top of the regionserver UI -- can you see what we
are hung up on?  Are all handlers occupied?).


> ... despite the fact that our RS CPU load is really low (10%)
>

As has been suggested earlier, perhaps up the handlers?


> Note: we have not (yet) activated MSlabs, nor direct read on HDFS.
>

MSlab will help you avoid stop-the-world GCs.  Direct read of HDFS
should speed up random access.

St.Ack

> Any idea please ? I'm really stuck on that issue.
>
> Best regards,
>
> Le 16/11/12 20:55, Vincent Barat a écrit :
>>
>> Hi,
>>
>> Right now (and previously with 0.90.3) we were using the default value
>> (10).
>> We are trying right now to increase to 30 to see if it is better.
>>
>> Thanks for your concern
>>
>> Le 16/11/12 18:13, Ted Yu a écrit :
>>>
>>> Vincent:
>>> What's the value for hbase.regionserver.handler.count ?
>>>
>>> I assume you keep the same value as that from 0.90.3
>>>
>>> Thanks
>>>
>>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent
>>> Barat<vincent.barat@gmail.com>wrote:
>>>
>>>> Le 16/11/12 01:56, Stack a écrit :
>>>>
>>>>   On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<gperrot@ubikod.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>>> It happens when several tables are being compacted and/or when there
>>>>>> is
>>>>>> several scanners running.
>>>>>>
>>>>> It happens for a particular region?  Anything you can tell about the
>>>>> server looking in your cluster monitoring?  Is it running hot?  What
>>>>> do the hbase regionserver stats in UI say?  Anything interesting about
>>>>> compaction queues or requests?
>>>>>
>>>> Hi, thanks for your answser Stack. I will take the lead on that thread
>>>> from now on.
>>>>
>>>> It does not happens on any particular region. Actually, things get
>>>> better
>>>> now since compactions have been performed on all tables and have been
>>>> stopped.
>>>>
>>>> Nevertheless, we face a dramatic decrease of performances (especially on
>>>> random gets) of the overall cluster:
>>>>
>>>> Despite the fact we double our number of region servers (from 8 to 16)
>>>> and
>>>> despite the fact that these region server CPU load are just about 10% to
>>>> 30%, performances are really bad : very often an light increase of
>>>> request
>>>> lead to a clients locked on request, very long response time. It looks
>>>> like
>>>> a contention / deadlock somewhere in the HBase client and C code.
>>>>
>>>>
>>>>
>>>>> If you look at the thread dump all handlers are occupied serving
>>>>> requests?  These timedout requests couldn't get into the server?
>>>>>
>>>> We will investigate on that and report to you.
>>>>
>>>>
>>>>   Before the timeouts, we observe an increasing CPU load on a single
>>>> region
>>>>>>
>>>>>> server and if we add region servers and wait for rebalancing, we
>>>>>> always
>>>>>> have the same region server causing problems like these:
>>>>>>
>>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer:
IPC
>>>>>> Server Responder, call
>>>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc
>>>>>> version=1, client version=29, methodsFingerPrint=54742778 from
>>>>>> <ip>:45334: output error
>>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer:
IPC
>>>>>> Server handler 3 on 60020 caught: java.nio.channels.**
>>>>>> ClosedChannelException
>>>>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(**
>>>>>> SocketChannelImpl.java:133)
>>>>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324)
>>>>>> at
>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(**
>>>>>> HBaseServer.java:1653)
>>>>>> at
>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
>>>>>> processResponse(HBaseServer.**java:924)
>>>>>> at
>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
>>>>>> doRespond(HBaseServer.java:**1003)
>>>>>> at
>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady(
>>>>>> HBaseServer.java:409)
>>>>>> at
>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(**
>>>>>> HBaseServer.java:1346)
>>>>>>
>>>>>> With the same access patterns, we did not have this issue in HBase
>>>>>> 0.90.3.
>>>>>>
>>>>> The above is other side of the timeout -- the client is gone.
>>>>>
>>>>> Can you explain the rising CPU?
>>>>>
>>>> No there is no explanation (no high access a a given region for
>>>> exemple).
>>>> But this specific problem has gone when we finished compactions.
>>>>
>>>>
>>>>      Is it iowait on this box because of
>>>>>
>>>>> compactions?  Bad disk?  Always same regionserver or issue moves
>>>>> around?
>>>>>
>>>>> Sorry for all the questions.  0.92 should be better than 0.90
>>>>>
>>>> Our experience is currently the exact opposite : for us, 0.92 seems to
>>>> be
>>>> times slower than the 0.90.3.
>>>>
>>>>   generally (0.94 even better still -- can you go there?).
>>>> We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we
>>>> cannot go back to 0.90.3, since there is apparently a modification of
>>>> the
>>>> format of the ROOT table).
>>>> The upgrade works, but the downgrade not. And we are afraid of having
>>>> even
>>>> more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with
>>>> some days of data loses).
>>>>
>>>> Thanks for your reply we will continue to investigate.
>>>>
>>>>
>>>>
>>>>      Interesting
>>>>>
>>>>> that these issues show up post upgrade.  I can't think of a reason why
>>>>> the different versions would bring this on...
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>
>>
>

Mime
View raw message