stratos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reka Thirunavukkarasu <r...@wso2.com>
Subject Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?
Date Sun, 29 Jun 2014 17:02:08 GMT
Hi

On Sun, Jun 29, 2014 at 9:28 PM, Lakmal Warusawithana <lakmal@wso2.com>
wrote:

> Hi Reka,
>
> We can double commit these into 4.0.0 branch and master, and will do 4.0.1
> minor release with these fixers. I also like suggest some UX improvements
> for 4.0.1 release. I had some offline discussion with several folks, will
> send some suggestions on UX improvement with the user stories in separate
> thread.
>

+1 for the 4.0.1 release with all the minor fixes and UI improvements. Then
will commit the fixes done to the 4.0.0 as well.

Thanks,
Reka

>
> thanks
>
>
> On Sun, Jun 29, 2014 at 9:05 PM, Reka Thirunavukkarasu <reka@wso2.com>
> wrote:
>
>> Hi Cris,
>>
>>
>> On Sat, Jun 28, 2014 at 11:54 AM, chris snow <chsnow123@gmail.com> wrote:
>>
>>> Hi Reka, will this fix also need to get applied to 4.0.0?
>>>
>>  Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue
>> will be there only when you publish events to BAM from cloud controller and
>> when you unsubscribe from an instance. I will create a patch from 4.0.0
>> branch with the fix and update the jira with the patch..
>>
>> Thanks,
>> Reka
>>
>> Thanks,
>> Reka
>>
>>
>>>  On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <reka@wso2.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <
>>>> nirmal070125@gmail.com> wrote:
>>>>
>>>>>
>>>>> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <imesh@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Michiel,
>>>>>>
>>>>>> As Reka has pointed out there is a potential issue
>>>>>> in CloudControllerServiceImpl class. It seems like cloud controller
is
>>>>>> retrieving its state from registry
>>>>>> in CloudControllerServiceImpl constructor and it's being invoked
in two
>>>>>> other places than it's expected to:
>>>>>>
>>>>>>
>>>>>> ​
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> This was a bug, we identified recently and someone has made this
>>>>> commit without properly analyzing the way CC has implemented. :-(
>>>>>
>>>>> AFAIK Reka has already filed a jira and on her way to remove that
>>>>> broken logic.
>>>>>
>>>> I have fixed this issue in master and updated the jira (STRATOS-685).
>>>> I have removed CloudControllerServiceImpl initialization which used in
>>>> cloud controller when publishing events to BAM and in the instance
>>>> termination on behalf of the MemberReadyToShutdownEvent.
>>>>
>>>> The fix that i did was to get the relevant cartridge information from
>>>> FasterLookupDataHolder when publishing events to BAM instead of getting it
>>>> from buggy way as earlier. Handled the instance termination via Autoscaler
>>>> on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
>>>> member. I think that this would be good  way as autoscaler is the one who
>>>> requests to start or terminate the member in all scenarios.
>>>>
>>>> Thanks,
>>>> Reka
>>>>
>>>> However the above logic does not retrieve the topology from registry.
>>>>>> It is being retrieved by Topology Manager:
>>>>>>
>>>>>>
>>>>>> ​
>>>>>> Therefore the above issue may have very little affect on the problem
>>>>>> you have noticed. However I wonder whether we have an issue in Autoscaler
>>>>>> in refreshing its state once restarted.
>>>>>>
>>>>>>  Just to narrow down the cause of this issue, will you be able to
>>>>>> list down the actions that you carried out from the very beginning
please?
>>>>>> Then we could try to re-produce this problem by going through them.
>>>>>>
>>>>>>
>>>>>> Many Thanks
>>>>>> Imesh
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>>>>>> mblokzij@cisco.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Basically, I was stopping and starting Stratos and looking at
how it
>>>>>>> handled dying cartridges, and found that Stratos only detected
cartridge
>>>>>>> deaths while it was running..
>>>>>>>
>>>>>>> *The problem*
>>>>>>> In steady state, I have some cartridges managed by Stratos,
>>>>>>>
>>>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant
|
>>>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>>>> le-vm.foo.cisco.com |
>>>>>>>
>>>>>>> nova list | grep samp
>>>>>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE
|
>>>>>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>>>>>
>>>>>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the
sample
>>>>>>> cartridge, and then start ActiveMQ and Stratos again.
>>>>>>>
>>>>>>> Now, at first things look good..:
>>>>>>>
>>>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant
|
>>>>>>> cisco-sample-vm | Inactive | 0                 |
>>>>>>> cisco-sample-vm.foo.cisco.com |
>>>>>>>
>>>>>>> But then,
>>>>>>>
>>>>>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>>>>>> list-subscribed-cartridges | grep samp
>>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant
|
>>>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>>>> le-vm.foo.cisco.com |
>>>>>>>
>>>>>>> # nova list | grep samp
>>>>>>> #
>>>>>>>
>>>>>>> How did the cartridge become active without it actually being
there?
>>>>>>> As far as I can tell, Stratos never recovers from this.
>>>>>>>
>>>>>>> I found this bug here:
>>>>>>> https://issues.apache.org/jira/browse/STRATOS-234 - is this
>>>>>>> describing the issue I’m seeing? I was a little bit confused
by the usage
>>>>>>> of the word “obsolete”.
>>>>>>>
>>>>>>> *Where to go next?*
>>>>>>> Now, I’ve done a little bit of digging, but I don’t yet have
a full
>>>>>>> mental model of how everything fits together in Stratos - please
could
>>>>>>> someone help me put the pieces together? :)
>>>>>>>
>>>>>>> What I’m seeing is the following:
>>>>>>> - The cluster monitor appears to be active:
>>>>>>>
>>>>>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster
monitor
>>>>>>> is running.. Cluste
>>>>>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>>>>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy
[id]static-1
>>>>>>> [partitions] [org
>>>>>>>
>>>>>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>>>>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null,
de
>>>>>>> scription=null], lbReferenceType=null]
>>>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>>>>>
>>>>>>> - It looks like the CEP FaultHandlingWindowProcessor usually
detects
>>>>>>> inactive members. However, since this member was never active,
the
>>>>>>> timeStampMap doesn’t contain an element for this member, so
it’s
>>>>>>> never checked.
>>>>>>> - I think the fault handling is triggered by a fault_message,
but I
>>>>>>> didn’t manage to figure out where it’s coming from. Does
anyone know what
>>>>>>> triggers it? (is it the CEP extension?)
>>>>>>>
>>>>>>> Anyway..
>>>>>>>
>>>>>>> *Questions*
>>>>>>> - How should Stratos detect after some downtime which cartridges
are
>>>>>>> still there and which ones aren’t? (what was the intended design?)
>>>>>>> - Why did the missing cartridge go “active”? Is this a result
from
>>>>>>> restoring persistent state? (If I look in the registry I can
see stuff
>>>>>>> under subscriptions/active, but not sure if that’s where it
comes from)
>>>>>>> - Who should be responsible for detecting the absence of an instance
>>>>>>> - the ClusterMonitor? That seems to be fed incorrect data, since
it clearly
>>>>>>> thinks there are enough instances running. Which component has
the
>>>>>>> necessary data?
>>>>>>> - It looks like it’s possible to snapshot CEP state
>>>>>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent>
to
>>>>>>> make it semi-persistent. However, if I restarted Stratos after
2min
>>>>>>> downtime, wouldn’t it try to kill all the nodes since the last
reply was
>>>>>>> more than 60s ago? Also, snapshots would be periodic, so there’s
still a
>>>>>>> window in which cartridges might “disappear".
>>>>>>>
>>>>>>> Thanks a lot and best regards!
>>>>>>>
>>>>>>> Michiel
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Imesh Gunaratne
>>>>>>
>>>>>> Technical Lead, WSO2
>>>>>> Committer & PPMC Member, Apache Stratos
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Nirmal
>>>>>
>>>>> Nirmal Fernando.
>>>>> PPMC Member & Committer of Apache Stratos,
>>>>> Senior Software Engineer, WSO2 Inc.
>>>>>
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Reka Thirunavukkarasu
>>>> Senior Software Engineer,
>>>> WSO2, Inc.:http://wso2.com,
>>>> Mobile: +94776442007
>>>>
>>>>
>>>>
>>
>>
>> --
>> Reka Thirunavukkarasu
>> Senior Software Engineer,
>> WSO2, Inc.:http://wso2.com,
>> Mobile: +94776442007
>>
>>
>>
>
>
> --
> Lakmal Warusawithana
> Vice President, Apache Stratos
> Director - Cloud Architecture; WSO2 Inc.
> Mobile : +94714289692
> Blog : http://lakmalsview.blogspot.com/
>
>


-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Mime
View raw message