helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Long GC
Date Sat, 04 May 2013 16:25:28 GMT
Hi Ming

I dont see anything wrong with the design. What you need is the ability to
validate few things before reconnecting to cluster. We do invoke a
preconnect callback before joining the cluster.you can validate for
consistency and refuse to join the cluster. You can also disable the node
if validation fails
Will this work
On May 4, 2013 9:03 AM, "Ming Fang" <mingfang@mac.com> wrote:

> Kishore
> I'm setting _sessionTimeout to 3 seconds.
> That's an aggressive number, but my applications needs to detect failures
> quickly.
> I suppose taking the participant to OFFLINE is acceptable but I can't have
> it flip back to MASTER.
> I didn't want to bore you with the details before but I think I need to
> explain my system more now.
> We are using Helix to manage a MASTER/SLAVE cluster using AUTO mode.
> AUTO mode enable us to place the MASTER and SLAVE to the correct host.
> We name the MASTER as Node1 and SLAVE as Node2.
> The system processes a high rate of incoming events, thousands per second.
> Node1 consumes the events, generate internal state, and then replicates
> event to the Node2.
> Node2 will consume events from the Node1 and generates exactly same
> internal state.
> When Node1 fails, we want Node2 to become new MASTER and process incoming
> events.
> This means we can not restart Node1 since the Node2's state has move
> beyond the failed MASTER.
> We keep the failed Node1 down for the rest of the business day.
> Everything works as expected under ideal situation.
> The problem we're experiencing with long GCs is that Node1 transitions to
> OFFLINE and then back to MASTER.
> This causes the Node1 and Node2 to get out of sync.
> Ideally I can find a general solution such that whenever Node2 becomes
> MASTER, it modifies the Ideal state so that Node1 can come back as SLAVE.
> This solution will address the Node1 failure issue and think should fix
> the long GC issue too.
> Sorry for the long email.
> --ming
> On May 4, 2013, at 10:29 AM, kishore g <g.kishore@gmail.com> wrote:
> Hi Ming,
> Need some more details,
> 1. How long was the GC, what is the session timeout in zk.
> Behavior you are seeing is expected, what is happening is due to GC and
> losing zookeeper session we call the transitions so that partition goes
> back to OFFLINE state.
> What is the behavior you are looking for when there is GC.
> a. You dont want to lose mastership ? or
> b. Its ok to lose mastership but you dont want to become master again ?
> One question regarding your application, is it possible your application
> can recover after long GC pause?
> Dont think this is related to HELIX-79, in that case there were
> consecutive GC's and I think we have a patch for that issue.
> Thanks,
> Kishore G
> On Sat, May 4, 2013 at 6:32 AM, Ming Fang <mingfang@mac.com> wrote:
>> We're experiencing a potentially showstopper issue with how Helix is
>> dealing with very long GCs.
>> Our system is using the Master Slave model.
>> A simple test when running just the Master under extreme load, causing
>> seconds of GC.
>> Under long GC condition the Master gets transitioned to Slave then to
>> Offline.
>> After the GC, we get transited back to Slave then to Master.
>> I found this Jira that may be related HELIX-79<https://issues.apache.org/jira/browse/HELIX-79>
>> .
>> We're scheduled to go live with our system next week.
>> Are there any quick workarounds for this problem?

View raw message