hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashu Pachauri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16752) Upgrading from 1.2 to 1.3 can lead to replication failures due to difference in RPC size limit
Date Wed, 12 Oct 2016 08:48:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568079#comment-15568079
] 

Ashu Pachauri commented on HBASE-16752:
---------------------------------------

[~anoop.hbase] The way it's implemented right now is that there is no feedback to the client
(there is no RequestTooBigException), and the connection is simply dropped. This has two side
effects:
1. Client only sees connection drops without any reason, which may be hard to debug for people
not very familiar with HBase codebase. Even if I do try to return a RequestTooBigException
(a new exception), client simply discards this because server sends an incorrect call ID that
it's not expecting (Server has an incorrect call ID because it does not want to read the whole
request  as it's too large).
2. Client will retry the same rpc again and again and keep failing (until retries are exhausted
or forever in case of replication).

The implication on replication is that if the destination peer is upgraded to 1.3 (where servers
enforce this limit), replication can fail because source can take large RPCs while peer cannot.
A temporary fix here is that the HBase admin override this rpc size limit on the peer. We
could also change the default on HBase 1.3 (currently 256 MB per call, 1 GB total call queue
size ) to match max call queue size on HBase 1.2 (1 GB), but then it defeats the purpose of
this config.

That said, I do not plan to fix the replication problem, just to give better feedback to the
client so that this can be easily diagnosed, temporary fix can be applied and clients can
be modified to respect the rpc size limit.


> Upgrading from 1.2 to 1.3 can lead to replication failures due to difference in RPC size
limit
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-16752
>                 URL: https://issues.apache.org/jira/browse/HBASE-16752
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, rpc
>    Affects Versions: 1.3.0
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>
> In HBase 1.2, we don't limit size of a single RPC but in 1.3 we limit it by default to
256 MB.  This means that during upgrade scenarios (or when source is 1.2 peer is already on
1.3), it's possible to encounter a situation where we try to send an rpc with size greater
than 256 MB because we never unroll a WALEdit while sending replication traffic.
> RpcServer throws the underlying exception locally, but closes the connection with returning
the underlying error to the client, and client only sees a "Broken pipe" error.
> I am not sure what is the proper fix here (or if one is needed) to make sure this does
not happen, but we should return the underlying exception to the RpcClient, because without
it, it can be difficult to diagnose the problem, especially for someone new to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message