heron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Ramasamy <kramas...@gmail.com>
Subject Re: About stream manager's quitting logic on connection failures
Date Mon, 05 Feb 2018 19:01:56 GMT
Ning - let us get this rolled out soon.


> On Feb 5, 2018, at 10:57 AM, Sanjeev Kulkarni <sanjeevrk@gmail.com> wrote:
> This sounds good to me!
> On Mon, Feb 5, 2018 at 1:08 AM, Ning Wang <wangninggm@gmail.com> wrote:
>> Yeah. That is an option too. In fact it was my first try:
>> https://github.com/twitter/heron/pull/2693 (just an initiative, not
>> completed, a count map should be used instead of a single total count)
>> In most cases, I think both solutions should have the same result. A few
>> reasons I changed to a tmaster check:
>> - with tmaster, there is only one source of truth and tmaster is more
>> critical anyway. If the tmaster link is not healthy, stmgrs won't work
>> correctly: topology may have created replacement nodes but the disconnected
>> nodes could keep going by themselves.
>> - it is more straightforward. The logic is the same as the current one. One
>> the other side, if we use an array for all remote stmgrs, we could have a
>> smarter logic (which is good) but it could make stmgrs more complicated and
>> less straightforward (bad). I left the stmgr counters there so if in future
>> we decide to add this feature, it should be easy to add. There is a gap
>> between "errors from all" and "errors from a few" and this is not a
>> simple/quick question.
>> On Sun, Feb 4, 2018 at 6:48 PM, Sanjeev Kulkarni <sanjeevrk@gmail.com>
>> wrote:
>>> I could't add comments to the document, thus am posting my comments to
>> the
>>> mailing list
>>> One more approach could be to do the current measurement as it is, but
>>> instead of leaving the quitting decision to the stmgtclient, have
>>> stmgrclientmgr do the decision. Thus everytime a stmgr client detects
>>> connection issues, inform that to stmgrclientmgr which keeps a map of
>>> peerstmgrid to error count. Thus it is able to decide things like am i
>>> seeing connection errors from all stmgrs or if only a few of them are
>>> having issues. Then it can take the decisions better.
>>> On Sat, Feb 3, 2018 at 8:11 PM, Ning Wang <wangninggm@gmail.com> wrote:
>>>> Hi, heron devs~
>>>> I think the current stream manager's quitting logic on connection
>>> failures
>>>> is problematic. We saw a few internal cases in Twitter that this logic
>>>> could cause extra issue.
>>>> Here is a doc with more details:
>>>> https://docs.google.com/document/d/1WHNc2NEp2gVL9ge2QVKp9t4Hpd4U9
>>>> sAbzBqCu4-iDUM/edit#
>>>> Comments and feedbacks are welcome!
>>>> Thanks.
>>>> --ning

View raw message