stratos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michiel Blokzijl (mblokzij)" <mblok...@cisco.com>
Subject Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?
Date Fri, 04 Jul 2014 08:53:31 GMT
Hi all,

Apologies for the radio silence since my initial email, I’ve been very busy.. :(

Thank you Reka for your detailed explanations, I now have a much better understanding of how
it’s supposed to work!

I’m not actually using the BAM (yet)*, so STRATOS-685 shouldn’t affect me, right? Even
if it doesn’t affect me I think the fix would still be nice to have in the 4.0.0 branch.

> Just to narrow down the cause of this issue, will you be able to list down the actions
that you carried out from the very beginning please? Then we could try to re-produce this
problem by going through them.

I’ve attached an annotated log of the steps I’ve taken to reproduce the issue.

I think there’s still an issue in this area, since I’m hitting this issue without using
the BAM. I could try Reka’s suggestion of enabling the CEP persistence, but I suspect given
that restarting Stratos takes more than 1min, the fault handler will think that ALL cartridges
are inactive and kill them all. Does anyone know if this is the right documentation for setting
up CEP snapshotting?

*: The <BamServerURL> is commented out in <stratos>/repository/conf/carbon.xml.

Best regards,

Michiel



On 29 Jun 2014, at 18:02, Reka Thirunavukkarasu <reka@wso2.com> wrote:

> Hi
> 
> On Sun, Jun 29, 2014 at 9:28 PM, Lakmal Warusawithana <lakmal@wso2.com> wrote:
> Hi Reka,
> 
> We can double commit these into 4.0.0 branch and master, and will do 4.0.1 minor release
with these fixers. I also like suggest some UX improvements for 4.0.1 release. I had some
offline discussion with several folks, will send some suggestions on UX improvement with the
user stories in separate thread.
>  
> +1 for the 4.0.1 release with all the minor fixes and UI improvements. Then will commit
the fixes done to the 4.0.0 as well.
> 
> Thanks,
> Reka
> 
> thanks 
> 
> 
> On Sun, Jun 29, 2014 at 9:05 PM, Reka Thirunavukkarasu <reka@wso2.com> wrote:
> Hi Cris,
> 
> 
> On Sat, Jun 28, 2014 at 11:54 AM, chris snow <chsnow123@gmail.com> wrote:
> Hi Reka, will this fix also need to get applied to 4.0.0?
> 
> Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue will be there
only when you publish events to BAM from cloud controller and when you unsubscribe from an
instance. I will create a patch from 4.0.0 branch with the fix and update the jira with the
patch..
> 
> Thanks,
> Reka
> 
> Thanks,
> Reka
>  
> On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <reka@wso2.com> wrote:
> Hi all,
> 
> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <nirmal070125@gmail.com> wrote:
> 
> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <imesh@apache.org> wrote:
> Hi Michiel,
> 
> As Reka has pointed out there is a potential issue in CloudControllerServiceImpl class.
It seems like cloud controller is retrieving its state from registry in CloudControllerServiceImpl
constructor and it's being invoked in two other places than it's expected to:
> 
> <Screen Shot 2014-06-25 at 10.36.07 PM.png>
> ​
> 
> <Screen Shot 2014-06-25 at 10.14.01 PM.png>
> 
> 
> This was a bug, we identified recently and someone has made this commit without properly
analyzing the way CC has implemented. :-(
> 
> AFAIK Reka has already filed a jira and on her way to remove that broken logic.
> I have fixed this issue in master and updated the jira (STRATOS-685).  I have removed
CloudControllerServiceImpl initialization which used in cloud controller when publishing events
to BAM and in the instance termination on behalf of the MemberReadyToShutdownEvent.
>  
> The fix that i did was to get the relevant cartridge information from FasterLookupDataHolder
when publishing events to BAM instead of getting it from buggy way as earlier. Handled the
instance termination via Autoscaler on behalf of MemberReadyToShutdownEvent instead of CC
itself terminates the member. I think that this would be good  way as autoscaler is the one
who requests to start or terminate the member in all scenarios.
> 
> Thanks,
> Reka
> 
> However the above logic does not retrieve the topology from registry. It is being retrieved
by Topology Manager:
> 
> <Screen Shot 2014-06-25 at 10.45.36 PM.png>
> ​
> Therefore the above issue may have very little affect on the problem you have noticed.
However I wonder whether we have an issue in Autoscaler in refreshing its state once restarted.
> 
> Just to narrow down the cause of this issue, will you be able to list down the actions
that you carried out from the very beginning please? Then we could try to re-produce this
problem by going through them.
> 
> 
> Many Thanks
> Imesh
> 
> 
> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <mblokzij@cisco.com>
wrote:
> Hi all,
> 
> Basically, I was stopping and starting Stratos and looking at how it handled dying cartridges,
and found that Stratos only detected cartridge deaths while it was running..
> 
> The problem
> In steady state, I have some cartridges managed by Stratos, 
> 
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Active
| 1                 | cisco-sample-vm.foo.cisco.com |
> 
> nova list | grep samp
> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None       | Running
    | core=172.16.2.17, 10.86.205.231  |
> 
> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample cartridge, and
then start ActiveMQ and Stratos again.
> 
> Now, at first things look good..:
> 
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Inactive
| 0                 | cisco-sample-vm.foo.cisco.com |
> 
> But then,
> 
> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh list-subscribed-cartridges
| grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Active
| 1                 | cisco-sample-vm.foo.cisco.com |
> 
> # nova list | grep samp
> # 
> 
> How did the cartridge become active without it actually being there? As far as I can
tell, Stratos never recovers from this.
> 
> I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - is this describing
the issue I’m seeing? I was a little bit confused by the usage of the word “obsolete”.
> 
> Where to go next?
> Now, I’ve done a little bit of digging, but I don’t yet have a full mental model
of how everything fits together in Stratos - please could someone help me put the pieces together?
:)
> 
> What I’m seeing is the following:
> - The cluster monitor appears to be active:
> 
> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
-  Cluster monitor is running.. Cluste
> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, serviceId=cisco-sample-vm, deploymentPolicy=Deployment
Policy [id]static-1 [partitions] [org
> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], autoscalePolicy=ASPolicy
[id=economyPolicy, displayName=null, de
> scription=null], lbReferenceType=null] {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
> 
> - It looks like the CEP FaultHandlingWindowProcessor usually detects inactive members.
However, since this member was never active, the timeStampMap doesn’t contain an element
for this member, so it’s never checked.
> - I think the fault handling is triggered by a fault_message, but I didn’t manage to
figure out where it’s coming from. Does anyone know what triggers it? (is it the CEP extension?)
> 
> Anyway.. 
> 
> Questions
> - How should Stratos detect after some downtime which cartridges are still there and
which ones aren’t? (what was the intended design?)
> - Why did the missing cartridge go “active”? Is this a result from restoring persistent
state? (If I look in the registry I can see stuff under subscriptions/active, but not sure
if that’s where it comes from)
> - Who should be responsible for detecting the absence of an instance - the ClusterMonitor?
That seems to be fed incorrect data, since it clearly thinks there are enough instances running.
Which component has the necessary data?
> - It looks like it’s possible to snapshot CEP state to make it semi-persistent. However,
if I restarted Stratos after 2min downtime, wouldn’t it try to kill all the nodes since
the last reply was more than 60s ago? Also, snapshots would be periodic, so there’s still
a window in which cartridges might “disappear".
> 
> Thanks a lot and best regards!
> 
> Michiel
> 
> 
> 
> -- 
> Imesh Gunaratne
> 
> Technical Lead, WSO2
> Committer & PPMC Member, Apache Stratos
> 
> 
> 
> -- 
> Best Regards,
> Nirmal
> 
> Nirmal Fernando.
> PPMC Member & Committer of Apache Stratos,
> Senior Software Engineer, WSO2 Inc.
> 
> Blog: http://nirmalfdo.blogspot.com/
> 
> 
> 
> -- 
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
> 
> 
> 
> 
> 
> -- 
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
> 
> 
> 
> 
> 
> -- 
> Lakmal Warusawithana
> Vice President, Apache Stratos
> Director - Cloud Architecture; WSO2 Inc.
> Mobile : +94714289692
> Blog : http://lakmalsview.blogspot.com/
> 
> 
> 
> 
> -- 
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
> 
> 


Mime
View raw message