helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ming Fang <mingf...@mac.com>
Subject Re: Long GC
Date Sat, 04 May 2013 15:40:45 GMT
Santiago

Thanks for the info. I will definitely explore your technique.

--ming

On May 4, 2013, at 11:32 AM, Santiago Perez <santip@santip.com.ar> wrote:

> Hi Ming,
> 
> We also have some issues when long GC pauses cause ZK expiration. In our use case we
found a way to detect the expiration had occurred by registering a ControllerChangeListener
in HelixManager and catching a change notification of type INIT (NotificationContext.Type.INIT)
after our service is already started (IOW when we see it for the second time).
> 
> What we do when we detect this situation is call HelixManager.disconnect and then HelixManager.connect,
essentially withdrawing and reconnecting the participant. This causes all the appropriate
transitions to be triggered. Not sure if this would help your use case but at least it gives
you a way to intercept this behavior and take the necessary measures to keep your cluster
in shape.
> 
> While we're on the topic I'd love to get a clear understanding of the expected set of
transitions that should occur when zk session expires and a new one is created. 
> 
> Cheers,
> Santiago
> 
> 
> On Sat, May 4, 2013 at 11:29 AM, kishore g <g.kishore@gmail.com> wrote:
> Hi Ming,
> 
> Need some more details,
> 1. How long was the GC, what is the session timeout in zk.
> 
> Behavior you are seeing is expected, what is happening is due to GC and losing zookeeper
session we call the transitions so that partition goes back to OFFLINE state. 
> 
> What is the behavior you are looking for when there is GC.
> 
> a. You dont want to lose mastership ? or
> b. Its ok to lose mastership but you dont want to become master again ?
> 
> One question regarding your application, is it possible your application can recover
after long GC pause?
> 
> Dont think this is related to HELIX-79, in that case there were consecutive GC's and
I think we have a patch for that issue.
> 
> Thanks,
> Kishore G
> 
> 
> On Sat, May 4, 2013 at 6:32 AM, Ming Fang <mingfang@mac.com> wrote:
> We're experiencing a potentially showstopper issue with how Helix is dealing with very
long GCs.
> Our system is using the Master Slave model.
> A simple test when running just the Master under extreme load, causing seconds of GC.
> Under long GC condition the Master gets transitioned to Slave then to Offline.
> After the GC, we get transited back to Slave then to Master.
> 
> I found this Jira that may be related HELIX-79.
> We're scheduled to go live with our system next week.
> Are there any quick workarounds for this problem?
> 
> 
> 
> 


Mime
View raw message