helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santiago Perez <san...@santip.com.ar>
Subject Re: Long GC
Date Sat, 04 May 2013 15:32:22 GMT
Hi Ming,

We also have some issues when long GC pauses cause ZK expiration. In our
use case we found a way to detect the expiration had occurred by
registering a ControllerChangeListener in HelixManager and catching a
change notification of type INIT (NotificationContext.Type.INIT) after our
service is already started (IOW when we see it for the second time).

What we do when we detect this situation is call HelixManager.disconnect
and then HelixManager.connect, essentially withdrawing and reconnecting the
participant. This causes all the appropriate transitions to be triggered.
Not sure if this would help your use case but at least it gives you a way
to intercept this behavior and take the necessary measures to keep your
cluster in shape.

While we're on the topic I'd love to get a clear understanding of the
expected set of transitions that should occur when zk session expires and a
new one is created.

Cheers,
Santiago


On Sat, May 4, 2013 at 11:29 AM, kishore g <g.kishore@gmail.com> wrote:

> Hi Ming,
>
> Need some more details,
> 1. How long was the GC, what is the session timeout in zk.
>
> Behavior you are seeing is expected, what is happening is due to GC and
> losing zookeeper session we call the transitions so that partition goes
> back to OFFLINE state.
>
> What is the behavior you are looking for when there is GC.
>
> a. You dont want to lose mastership ? or
> b. Its ok to lose mastership but you dont want to become master again ?
>
> One question regarding your application, is it possible your application
> can recover after long GC pause?
>
> Dont think this is related to HELIX-79, in that case there were
> consecutive GC's and I think we have a patch for that issue.
>
> Thanks,
> Kishore G
>
>
> On Sat, May 4, 2013 at 6:32 AM, Ming Fang <mingfang@mac.com> wrote:
>
>> We're experiencing a potentially showstopper issue with how Helix is
>> dealing with very long GCs.
>> Our system is using the Master Slave model.
>> A simple test when running just the Master under extreme load, causing
>> seconds of GC.
>> Under long GC condition the Master gets transitioned to Slave then to
>> Offline.
>> After the GC, we get transited back to Slave then to Master.
>>
>> I found this Jira that may be related HELIX-79<https://issues.apache.org/jira/browse/HELIX-79>
>> .
>> We're scheduled to go live with our system next week.
>> Are there any quick workarounds for this problem?
>>
>>
>>
>

Mime
View raw message