hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Helmling (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17114) Add an option to set special retry pause when encountering CallQueueTooBigException
Date Fri, 18 Nov 2016 18:04:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15677304#comment-15677304

Gary Helmling commented on HBASE-17114:

bq. Well, if checking the uploaded patch, it's indeed tied to CQTBE only. Introducing a new
property is only for making things more flexible, and of course we could use a hard-coded,
like 5 times than the existing pause, for CQTBE. But I'd say this is a trade-off, waiting
longer for CQTBE could prevent the vicious circle but is also causing a higher latency, and
IMHO user should be able to control such trade-off. If they don't want CQTBE to be special,
they could set hbase.client.pause.special to the same value as hbase.client.pause, which gives
them more options.

I agree with allowing the user to control the behavior here, but this is also increasing complexity
and knowledge needed for configuration tuning, which we already have way too much of.  In
general, we should be moving in the direction of making the system dynamically tune itself
according to load instead of forcing all users to grapple with yet another configuration property.
 By default the configuration should be simple to provide the best experience to all users.
 For advanced users who really need to treat CQTBE differently, that should be possible by
means of an override, but should not be forced on everyone.

bq. Sorry but I don't see any difference in "should not clear the client meta cache" and "should
not retry so frequently", both trying to resolve some problem and make things better.

These are two completely different things.  I don't see the equivalence.  We don't clear the
meta cache because we don't have an indication that the region has moved, so there is no need
to go back to meta.  The meta cache handling is completely independent of what is appropriate
in terms of retries.

bq. No offense but I'm even thinking of making CQTBE thrown optional, because for some case
dead-wait for the request to be executed in RpcServer until time-out is preferable by user
rather than receiving some exception and retry and fail again, but obviously this is another
topic (Smile).

Blocking the RpcServer Reader threads indefinitely when the queue is full, making the server
completely unresponsive and spilling overflow back in to the OS networking buffers is pretty
poor behavior.  CQTBE is a crude mechanism for back-pressure to the client, but at least it
gets the client a response and allows it to make an informed decision about how to proceed.
 In the case where the application implements its own retries the client may want to simply
fail and kick the exception back up the stack, allowing other layers to retry.  Or the client
could decide to retry for a fixed duration.  But in either case I think CQTBE provides a very
clear improvement in overall server behavior.  Another part of the puzzle is the CoDel scheduler
which will allow more useful work to get done in overloaded situations.

I'm all for improving the client/server interactions in these scenarios, and what I first
outlined in this issue was one idea for how to do that more effectively.  However, I would
also like us to avoid unexpected surprises for our users, and regressions in server behavior.

I'm not sure of the exact symptoms you're trying to solve, but if you're seeing issues with
meta being overloaded, then I'd suggest tuning the configuration for the number of priority
handlers and size of the priority queues.  You could also evaluate running with meta hosted
on master, which together with zk-less assignment can make region assignment much more stable.

> Add an option to set special retry pause when encountering CallQueueTooBigException
> -----------------------------------------------------------------------------------
>                 Key: HBASE-17114
>                 URL: https://issues.apache.org/jira/browse/HBASE-17114
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Yu Li
>            Assignee: Yu Li
>         Attachments: HBASE-17114.patch
> As titled, after HBASE-15146 we will throw {{CallQueueTooBigException}} instead of dead-wait.
This is good for performance for most cases but might cause a side-effect that if too many
clients connect to the busy RS, that the retry requests may come over and over again and RS
never got the chance for recovering, and the issue will become especially critical when the
target region is META.
> So here in this JIRA we propose to supply some special retry pause for CQTBE in name
of {{hbase.client.pause.special}}, and by default it will be 500ms (5 times of {{hbase.client.pause}}
default value)

This message was sent by Atlassian JIRA

View raw message