hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Region server not accept connections intermittently
Date Wed, 09 Jul 2014 04:48:41 GMT
Hi Rural,

Thats interesting. Since you are passing
hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
HBase? If you experience the issue again, can you try to obtain a jstack
(as the user that started the hbase process or try from the RS UI if
responsive rs:port/dump) as Ted suggested? the output of "top -H -p <PID>"
might be useful too where <PID> is the pid of the RS. If you have some
metrics monitoring it would be interesting to see how callQueueLength and
the blocked threads change over time.

cheers,
esteban.


--
Cloudera, Inc.



On Tue, Jul 8, 2014 at 6:58 PM, Rural Hunter <ruralhunter@gmail.com> wrote:

> No. I used the standard log4j file and there is not any network problem
> from the client. I checked the web admin ui and the master still take the
> slave as working. Just the request count is very small(about 10 while
> others are several hundreds). I sshed on the slave server and I can see the
> 60020 is open by netstat command. But I am not able to telnet the port even
> on the server itself. It just timed out. This situation is same as the
> client from other servers. After it recovered automatically, I can telnet
> to the 60020 port on both the slave server and other servers.
>
> This is my server configuration: http://pastebin.com/Ks4cCiaE
>
> Client configuration:
>         myConf.set("hbase.zookeeper.quorum", hbaseQuorum);
>         myConf.set("hbase.client.retries.number", "3");
>         myConf.set("hbase.client.pause", "1000");
>         myConf.set("hbase.client.max.perserver.tasks", "10");
>         myConf.set("hbase.client.max.perregion.tasks", "10");
>         myConf.set("hbase.client.ipc.pool.size", "5");
>         myConf.set("zookeeper.recovery.retry", "1");
>
> The error of the client:
> Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed after attempts=3, exceptions:
> Mon Jul 07 19:10:35 CST 2014, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@69eb9518, org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=slave2/
> 192.168.2.88:60020]
> Mon Jul 07 19:10:58 CST 2014, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@69eb9518, org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=slave2/
> 192.168.2.88:60020]
> Mon Jul 07 19:11:23 CST 2014, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@69eb9518, org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=slave2/
> 192.168.2.88:60020]
>
>     at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
> RpcRetryingCaller.java:134)
>     at org.apache.hadoop.hbase.client.HTable.delete(HTable.java:831)
>
> 于 2014/7/9 1:02, Esteban Gutierrez 写道:
>
>  Hello Rural,
>>
>> It doesn't seem to be a problem from the region server from what I can
>> tell. The RS is not showing in the logs any message about a long pause
>> (unless you have a non standard log4j.properties file) and also if the RS
>> was in a very long pause due GC or any other issue, then the master should
>> have considered this region server as dead and from the logs doesn't look
>> like that happened. Have you double checked from the client side for any
>> connectivity issue to the RS? can you pastebin the client and the HBase
>> cluster confs?
>>
>> cheers,
>> esteban.
>>
>>
>> --
>> Cloudera, Inc.
>>
>>
>>
>> On Tue, Jul 8, 2014 at 2:14 AM, Rural Hunter <ruralhunter@gmail.com>
>> wrote:
>>
>>  OK, I will try to do that when it happens again. Thanks.
>>>
>>> 于 2014/7/8 17:06, Ted Yu 写道:
>>>
>>>   Next time this happens, can you take jstack of the region server and
>>>
>>>> pastebin it ?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message