hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)
Date Wed, 16 Jul 2014 18:54:05 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack resolved HBASE-1736.
--------------------------

    Resolution: Invalid

All is different now, 5 years later.

> If RS can't talk to master, pause; more importantly, don't split (Currently we do and
splits are lost and table is wounded)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1736
>                 URL: https://issues.apache.org/jira/browse/HBASE-1736
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>
> What I saw was master shutting itself down because it had lost zk lease.  Fine.   The
RS though doesn't look like it can deal with this situation.    We'll see stuff like this:
> {code}
> ...failed on connection exception: java.net.ConnectException: Connection refused
>     at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>     at $Proxy0.regionServerReport(Unknown Source)
>     at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>     at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>     at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
>     ... 4 more
> {code}
> ... all over the regionserver as it tries to send heartbeat to master on this broken
connection.
> On split, we close parent, add children to the catalog but then when we try to tell the
master about the split, it fails.  Means the children never get deployed.  Meantime  the parent
is offline.
> This issue is about going through the regionserver and anytime it has a connection to
master, make sure on fault that no damage is done the table and then that the regionserver
puts a pause on splitting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message