helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Potential bug in manual partition placement
Date Fri, 22 Feb 2013 06:12:45 GMT
Hi Ming,

It is easier to understand if you look at the transition order in the first
email you sent.
localhost_12000 transitioning from OFFLINE to SLAVE for MyResource_0
localhost_12002 transitioning from OFFLINE to SLAVE for MyResource_1
localhost_12000 transitioning from OFFLINE to SLAVE for MyResource_1
localhost_12002 transitioning from OFFLINE to SLAVE for MyResource_0

If you see Helix at this point Helix has not sent any transition from
OFFLINE to SLAVE to localhost_12001, this is because you have set the
constraint that max number of nodes that can be in SLAVE state is 2 (
replicas=2) in the state model definition.

For MyResource_1 Since localhost_12002 and localhost_12000 are already
slave, localhost_12001 can never become a slave since that would violate
the constraint of slave <2. Since it cannot become slave, it cannot become
Master.

For MyResource_0, you can see that it first made localhost_12000 master and
hence it could send message to localhost_12001 to become Slave.

localhost_12000 transitioning from SLAVE to MASTER for MyResource_0
localhost_12001 transitioning from OFFLINE to SLAVE for MyResource_0

Helix-50 fixes the random selection of nodes to sort messages based on the
preference list.

Thanks,
Kishore G


On Thu, Feb 21, 2013 at 4:42 PM, Zhen Zhang <nehzgnahz@gmail.com> wrote:

> Hi Ming, thanks the feedback. With REPLICAS set to 2, it's a random
> behavior that Helix controller will pick up any two of the hosts in the
> preference list and do the transitions. In your case it happens that it
> will work fine. We have updated the jira accordingly and will fix it soon.
> https://issues.apache.org/jira/browse/HELIX-50
>
> Thanks,
> Zhen
>
> On Thu, Feb 21, 2013 at 4:34 PM, Ming Fang <mingfang@mac.com> wrote:
>
>> Thanks for pointing that out.
>> It does work was expected after I set REPLICAS to 3.
>>
>> But the strange thing is even with REPLICAS set to 2 and placement
>> configured as below, everything works.
>>         "MyResource_0" : [ "localhost_12000", "localhost_12001",
>> "localhost_12002" ],
>>         "MyResource_1" : [ "localhost_12000", "localhost_12001",
>> "localhost_12002" ]
>>
>> On Feb 20, 2013, at 2:02 AM, kishore g <g.kishore@gmail.com> wrote:
>>
>>
>> https://github.com/mingfang/apache-helix/blob/master/helix-core/src/main/resources/manual.json
has
>> replicas set to 2 but the preference list for each partition is of size 3.
>> If you set the number of REPLICAS to 3, it should work.
>>
>> We do some validation of the idealstate but we dont validate that number
>> of replicas is same as the preference list size for all partitions. Created
>> JIRA https://issues.apache.org/jira/browse/HELIX-50
>>
>>
>> Thanks,
>> Kishore G
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Feb 19, 2013 at 7:08 PM, Ming Fang <mingfang@mac.com> wrote:
>>
>>> I've "repurpose" the Quickstart example in an attempt to implement
>>> manual placement of partitions.
>>> I'm using JSON file and the relevant section is below
>>>
>>>         "MyResource_0" : [ "localhost_12000", "localhost_12001",
>>> "localhost_12002" ],
>>>         "MyResource_1" : [ "localhost_12001", "localhost_12000",
>>> "localhost_12002" ]
>>>
>>> The goal is to make _12000 the MASTER for MyResource_0 and _12001 the
>>> MASTER of MyResource_1.
>>> The last instance, _12002 will serve as the last resort backup for both
>>> partitions in the event the other two died.
>>> This is a small example of what I was hoping to implement as part of a
>>> larger system.
>>>
>>> You may run the example here
>>>
>>> https://github.com/mingfang/apache-helix/blob/master/helix-core/src/main/java/org/apache/helix/examples/ManualPlacementTest.java
>>>
>>> using the JSON file here
>>>
>>> https://github.com/mingfang/apache-helix/blob/master/helix-core/src/main/resources/manual.json
>>>
>>> The problem is when I run this, the output looks like this
>>>
>>> STARTING Zookeeper at localhost:2199
>>> Creating cluster: HELIX_QUICKSTART
>>> Adding 3 participants to the cluster
>>>          Added participant: localhost_12000
>>>          Added participant: localhost_12001
>>>          Added participant: localhost_12002
>>> Starting Participants
>>>          Started Participant: localhost_12000
>>>          Started Participant: localhost_12001
>>>          Started Participant: localhost_12002
>>> Starting Helix Controller
>>> localhost_12000 transitioning from OFFLINE to SLAVE for MyResource_0
>>> localhost_12002 transitioning from OFFLINE to SLAVE for MyResource_1
>>> localhost_12000 transitioning from OFFLINE to SLAVE for MyResource_1
>>> localhost_12002 transitioning from OFFLINE to SLAVE for MyResource_0
>>> localhost_12000 transitioning from SLAVE to MASTER for MyResource_0
>>> localhost_12001 transitioning from OFFLINE to SLAVE for MyResource_0
>>> CLUSTER STATE: After starting 3 nodes
>>>                 localhost_12000 localhost_12001 localhost_12002
>>>         MyResource_0    M               S               S
>>>         MyResource_1    S               -               S
>>> ###################################################################
>>>
>>> Notice there is no MASTER for MyResource_1.
>>> I've been trying to debug this for a day now with no success.
>>>
>>> Did I stumble onto an actual bug?
>>
>>
>>
>>
>

Mime
View raw message