helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ming Fang <mingf...@mac.com>
Subject Re: Long GC
Date Sat, 04 May 2013 20:11:51 GMT
Yes that's perfect.
I'll try implementing that over weekend. 
Would you happen to have an example?
Thanks. 

Sent from my iPad

On May 4, 2013, at 12:25 PM, kishore g <g.kishore@gmail.com> wrote:

> Hi Ming
> 
> I dont see anything wrong with the design. What you need is the ability to validate few
things before reconnecting to cluster. We do invoke a preconnect callback before joining the
cluster.you can validate for consistency and refuse to join the cluster. You can also disable
the node if validation fails
> Will this work
> 
> On May 4, 2013 9:03 AM, "Ming Fang" <mingfang@mac.com> wrote:
>> Kishore
>> 
>> I'm setting _sessionTimeout to 3 seconds.
>> That's an aggressive number, but my applications needs to detect failures quickly.
>> I suppose taking the participant to OFFLINE is acceptable but I can't have it flip
back to MASTER.
>> 
>> I didn't want to bore you with the details before but I think I need to explain my
system more now.
>> We are using Helix to manage a MASTER/SLAVE cluster using AUTO mode.
>> AUTO mode enable us to place the MASTER and SLAVE to the correct host.
>> We name the MASTER as Node1 and SLAVE as Node2. 
>> 
>> The system processes a high rate of incoming events, thousands per second.
>> Node1 consumes the events, generate internal state, and then replicates event to
the Node2.
>> Node2 will consume events from the Node1 and generates exactly same internal state.
>> 
>> When Node1 fails, we want Node2 to become new MASTER and process incoming events.
>> This means we can not restart Node1 since the Node2's state has move beyond the failed
MASTER.
>> We keep the failed Node1 down for the rest of the business day.
>> Everything works as expected under ideal situation.
>> 
>> The problem we're experiencing with long GCs is that Node1 transitions to OFFLINE
and then back to MASTER.
>> This causes the Node1 and Node2 to get out of sync.
>> 
>> Ideally I can find a general solution such that whenever Node2 becomes MASTER, it
modifies the Ideal state so that Node1 can come back as SLAVE.
>> This solution will address the Node1 failure issue and think should fix the long
GC issue too.
>> Sorry for the long email.
>> 
>> --ming 
>> 
>> 
>> 
>> On May 4, 2013, at 10:29 AM, kishore g <g.kishore@gmail.com> wrote:
>> 
>>> Hi Ming,
>>> 
>>> Need some more details,
>>> 1. How long was the GC, what is the session timeout in zk.
>>> 
>>> Behavior you are seeing is expected, what is happening is due to GC and losing
zookeeper session we call the transitions so that partition goes back to OFFLINE state. 
>>> 
>>> What is the behavior you are looking for when there is GC.
>>> 
>>> a. You dont want to lose mastership ? or
>>> b. Its ok to lose mastership but you dont want to become master again ?
>>> 
>>> One question regarding your application, is it possible your application can
recover after long GC pause?
>>> 
>>> Dont think this is related to HELIX-79, in that case there were consecutive GC's
and I think we have a patch for that issue.
>>> 
>>> Thanks,
>>> Kishore G
>>> 
>>> 
>>> On Sat, May 4, 2013 at 6:32 AM, Ming Fang <mingfang@mac.com> wrote:
>>>> We're experiencing a potentially showstopper issue with how Helix is dealing
with very long GCs.
>>>> Our system is using the Master Slave model.
>>>> A simple test when running just the Master under extreme load, causing seconds
of GC.
>>>> Under long GC condition the Master gets transitioned to Slave then to Offline.
>>>> After the GC, we get transited back to Slave then to Master.
>>>> 
>>>> I found this Jira that may be related HELIX-79.
>>>> We're scheduled to go live with our system next week.
>>>> Are there any quick workarounds for this problem?

Mime
View raw message