hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Barat <vincent.ba...@gmail.com>
Subject Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Date Wed, 21 Nov 2012 08:23:43 GMT

Le 21/11/12 06:05, Stack a écrit :
> On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <vincent.barat@gmail.com> wrote:
>> We have changed some parameters on our 16(!) region servers : 1GB more -Xmx,
>> more rpc handler (from 10 to 30) longer timeout, but nothing seems to
>> improve the response time:
>>
> You have taken a look at the perf chapter Vincent:
> http://hbase.apache.org/book.html#performance
>
> You carried forward your old hbase-default.xml or did you remove it
> (0.92 should have defaults in hbase-X.X.X.jar -- some defaults will
> have changed).
We use the new default settings for HBase, just a few changes (more 
RPC handlers and longer timeout (but this last was a bad idea).
>> - Scans with HBase 0.92  are x3 SLOWER than with HBase 0.90.3
> Any scan caching going on?
yes the cache is set between 64 and 1024 depending on the need
>> - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom
>> read response time
>>
> The gets are returning lots of data? (If you thread dump the server at
> this time -- see at top of the regionserver UI -- can you see what we
> are hung up on?  Are all handlers occupied?).
We will check this...
>> ... despite the fact that our RS CPU load is really low (10%)
>>
> As has been suggested earlier, perhaps up the handlers?
>
>
>> Note: we have not (yet) activated MSlabs, nor direct read on HDFS.
>>
> MSlab will help you avoid stop-the-world GCs.  Direct read of HDFS
> should speed up random access.
OK, I guess we will give it a try, but on a second step.

Thansk for your help
>
> St.Ack
>
>> Any idea please ? I'm really stuck on that issue.
>>
>> Best regards,
>>
>> Le 16/11/12 20:55, Vincent Barat a écrit :
>>> Hi,
>>>
>>> Right now (and previously with 0.90.3) we were using the default value
>>> (10).
>>> We are trying right now to increase to 30 to see if it is better.
>>>
>>> Thanks for your concern
>>>
>>> Le 16/11/12 18:13, Ted Yu a écrit :
>>>> Vincent:
>>>> What's the value for hbase.regionserver.handler.count ?
>>>>
>>>> I assume you keep the same value as that from 0.90.3
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent
>>>> Barat<vincent.barat@gmail.com>wrote:
>>>>
>>>>> Le 16/11/12 01:56, Stack a écrit :
>>>>>
>>>>>    On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<gperrot@ubikod.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It happens when several tables are being compacted and/or when
there
>>>>>>> is
>>>>>>> several scanners running.
>>>>>>>
>>>>>> It happens for a particular region?  Anything you can tell about
the
>>>>>> server looking in your cluster monitoring?  Is it running hot?  What
>>>>>> do the hbase regionserver stats in UI say?  Anything interesting
about
>>>>>> compaction queues or requests?
>>>>>>
>>>>> Hi, thanks for your answser Stack. I will take the lead on that thread
>>>>> from now on.
>>>>>
>>>>> It does not happens on any particular region. Actually, things get
>>>>> better
>>>>> now since compactions have been performed on all tables and have been
>>>>> stopped.
>>>>>
>>>>> Nevertheless, we face a dramatic decrease of performances (especially
on
>>>>> random gets) of the overall cluster:
>>>>>
>>>>> Despite the fact we double our number of region servers (from 8 to 16)
>>>>> and
>>>>> despite the fact that these region server CPU load are just about 10%
to
>>>>> 30%, performances are really bad : very often an light increase of
>>>>> request
>>>>> lead to a clients locked on request, very long response time. It looks
>>>>> like
>>>>> a contention / deadlock somewhere in the HBase client and C code.
>>>>>
>>>>>
>>>>>
>>>>>> If you look at the thread dump all handlers are occupied serving
>>>>>> requests?  These timedout requests couldn't get into the server?
>>>>>>
>>>>> We will investigate on that and report to you.
>>>>>
>>>>>
>>>>>    Before the timeouts, we observe an increasing CPU load on a single
>>>>> region
>>>>>>> server and if we add region servers and wait for rebalancing,
we
>>>>>>> always
>>>>>>> have the same region server causing problems like these:
>>>>>>>
>>>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer:
IPC
>>>>>>> Server Responder, call
>>>>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa),
rpc
>>>>>>> version=1, client version=29, methodsFingerPrint=54742778 from
>>>>>>> <ip>:45334: output error
>>>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer:
IPC
>>>>>>> Server handler 3 on 60020 caught: java.nio.channels.**
>>>>>>> ClosedChannelException
>>>>>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(**
>>>>>>> SocketChannelImpl.java:133)
>>>>>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324)
>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(**
>>>>>>> HBaseServer.java:1653)
>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
>>>>>>> processResponse(HBaseServer.**java:924)
>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
>>>>>>> doRespond(HBaseServer.java:**1003)
>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady(
>>>>>>> HBaseServer.java:409)
>>>>>>> at
>>>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(**
>>>>>>> HBaseServer.java:1346)
>>>>>>>
>>>>>>> With the same access patterns, we did not have this issue in
HBase
>>>>>>> 0.90.3.
>>>>>>>
>>>>>> The above is other side of the timeout -- the client is gone.
>>>>>>
>>>>>> Can you explain the rising CPU?
>>>>>>
>>>>> No there is no explanation (no high access a a given region for
>>>>> exemple).
>>>>> But this specific problem has gone when we finished compactions.
>>>>>
>>>>>
>>>>>       Is it iowait on this box because of
>>>>>> compactions?  Bad disk?  Always same regionserver or issue moves
>>>>>> around?
>>>>>>
>>>>>> Sorry for all the questions.  0.92 should be better than 0.90
>>>>>>
>>>>> Our experience is currently the exact opposite : for us, 0.92 seems to
>>>>> be
>>>>> times slower than the 0.90.3.
>>>>>
>>>>>    generally (0.94 even better still -- can you go there?).
>>>>> We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way
we
>>>>> cannot go back to 0.90.3, since there is apparently a modification of
>>>>> the
>>>>> format of the ROOT table).
>>>>> The upgrade works, but the downgrade not. And we are afraid of having
>>>>> even
>>>>> more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with
>>>>> some days of data loses).
>>>>>
>>>>> Thanks for your reply we will continue to investigate.
>>>>>
>>>>>
>>>>>
>>>>>       Interesting
>>>>>> that these issues show up post upgrade.  I can't think of a reason
why
>>>>>> the different versions would bring this on...
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>

Mime
View raw message