hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Barat <vincent.ba...@gmail.com>
Subject Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1
Date Fri, 16 Nov 2012 16:14:11 GMT
Le 16/11/12 01:56, Stack a écrit :
> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <gperrot@ubikod.com> wrote:
>> It happens when several tables are being compacted and/or when there is
>> several scanners running.
> It happens for a particular region?  Anything you can tell about the
> server looking in your cluster monitoring?  Is it running hot?  What
> do the hbase regionserver stats in UI say?  Anything interesting about
> compaction queues or requests?

Hi, thanks for your answser Stack. I will take the lead on that 
thread from now on.

It does not happens on any particular region. Actually, things get 
better now since compactions have been performed on all tables and 
have been stopped.

Nevertheless, we face a dramatic decrease of performances 
(especially on random gets) of the overall cluster:

Despite the fact we double our number of region servers (from 8 to 
16) and despite the fact that these region server CPU load are just 
about 10% to 30%, performances are really bad : very often an light 
increase of request lead to a clients locked on request, very long 
response time. It looks like a contention / deadlock somewhere in 
the HBase client and C code.

> If you look at the thread dump all handlers are occupied serving
> requests?  These timedout requests couldn't get into the server?
We will investigate on that and report to you.

>> Before the timeouts, we observe an increasing CPU load on a single region
>> server and if we add region servers and wait for rebalancing, we always
>> have the same region server causing problems like these:
>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server Responder, call
>> multi(org.apache.hadoop.hbase.client.MultiAction@2c3da1aa), rpc
>> version=1, client version=29, methodsFingerPrint=54742778 from
>> <ip>:45334: output error
>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 3 on 60020 caught: java.nio.channels.ClosedChannelException
>> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> at
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1653)
>> at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.
>> processResponse(HBaseServer.java:924)
>> at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.
>> doRespond(HBaseServer.java:1003)
>> at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady(
>> HBaseServer.java:409)
>> at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1346)
>> With the same access patterns, we did not have this issue in HBase 0.90.3.
> The above is other side of the timeout -- the client is gone.
> Can you explain the rising CPU?
No there is no explanation (no high access a a given region for 
exemple). But this specific problem has gone when we finished 

>    Is it iowait on this box because of
> compactions?  Bad disk?  Always same regionserver or issue moves
> around?
> Sorry for all the questions.  0.92 should be better than 0.90
Our experience is currently the exact opposite : for us, 0.92 seems 
to be times slower than the 0.90.3.
> generally (0.94 even better still -- can you go there?).

We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way 
we cannot go back to 0.90.3, since there is apparently a 
modification of the format of the ROOT table).
The upgrade works, but the downgrade not. And we are afraid of 
having even more "new" problems with 0.94 and be forced to rollback 
to 0.90.3 (with some days of data loses).

Thanks for your reply we will continue to investigate.

>    Interesting
> that these issues show up post upgrade.  I can't think of a reason why
> the different versions would bring this on...
> St.Ack

View raw message