stratos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaheedur Haque (shahhaqu)" <shahh...@cisco.com>
Subject RE: ***UNCHECKED*** Problem with Stratos 4.1 failover (was RE: Problems with Stratos 4.1 failover and autohealing)
Date Wed, 15 Jul 2015 00:02:14 GMT
Yes, I will collect the logs. It may take a day or so though…

From: Reka Thirunavukkarasu [mailto:reka@wso2.com]
Sent: 14 July 2015 13:54
To: dev
Cc: Shaheedur Haque (shahhaqu)
Subject: Re: ***UNCHECKED*** Problem with Stratos 4.1 failover (was RE: Problems with Stratos
4.1 failover and autohealing)

Hi Shaheed,
As i went through the logs(wso2carbon.log and wso2-cep-trace.log) of octl-02-after, identified
that there is a possibility for the member fault to occur as cartridge agent didn't publish
the events for more than one minute from 00:53:24 to 00:55:09 ([2], [3]) which is obviously
more than one minute where the cartridge agent health publishing interval is 15s ([1], [2]).
I observed this sequence([1], [2], [3]) in the wso2-cep-trace.log. It has happened intermittently.
Please see below for the details. Member fault occur[4] in stratos side at 00:54:34 as the
last event from cartridge agent was marked as 00:53:24 [2]. The next event from cartridge
agent received at 00:55:09. So, we will need to investigate at the cartridge agent side why
it failed sent the health stats intermittently even after everything back to normal as i could
see that cartridge agent was publishing health stat after octl-02 came without any issue for
nearly two minutes Eg: [1] and [2]

Is it possible for you to get the cartridge agent logs? I know that it is little bit hard
as stratos will immediately terminate the instance. But if you could tail the logs, then i
believe that you can get the logs from where you tailed.

[1]
00:53:09,347 [-] [DataBridge-Core-pool-2-thread-8]  INFO EVENT_TRACE_LOGGER TenantId=-1234
: Input Event Adaptor : DefaultWSO2EventInputAdaptor, received

Event{
  streamId='cartridge_agent_health_stats:1.0.0',
  timeStamp=0,
  metaData=null,
  correlationData=null,
  payloadData=[cartridge-proxy.cartridge-proxy.cartridge-proxy.domain, cartridge-proxy-1,
RegionOne, cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3,
whole-region, load_average, 0.0],
  arbitraryDataMap={},
}

[2]
00:53:24,345 [-] [DataBridge-Core-pool-2-thread-4]  INFO EVENT_TRACE_LOGGER TenantId=-1234
: Input Event Adaptor : DefaultWSO2EventInputAdaptor, received

Event{
  streamId='cartridge_agent_health_stats:1.0.0',
  timeStamp=0,
  metaData=null,
  correlationData=null,
  payloadData=[cartridge-proxy.cartridge-proxy.cartridge-proxy.domain, cartridge-proxy-1,
RegionOne, cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3,
whole-region, load_average, 0.0],
  arbitraryDataMap={},
}

[3]
00:55:09,349 [-] [DataBridge-Core-pool-2-thread-10]  INFO EVENT_TRACE_LOGGER TenantId=-1234
: Input Event Adaptor : DefaultWSO2EventInputAdaptor, received

Event{
  streamId='cartridge_agent_health_stats:1.0.0',
  timeStamp=0,
  metaData=null,
  correlationData=null,
  payloadData=[cartridge-proxy.cartridge-proxy.cartridge-proxy.domain, cartridge-proxy-1,
RegionOne, cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3,
whole-region, load_average, 0.0],
  arbitraryDataMap={},
}


[4] TID: [0] [STRATOS] [2015-07-14 00:54:34,818]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor}
-  Faulty member detected [member-id] cartridge-proxy.cartridge-proxy.cartridge-proxy.domainc1dd3d8b-95ed-439c-96be-daaac2cb15a3
with [last time-stamp] 1436835204346 [time-out] 60000 milliseconds

1436835204346 ==> 14 Jul 2015 00:53:24 GMT
Thanks,
Reka


On Tue, Jul 14, 2015 at 5:21 PM, Reka Thirunavukkarasu <reka@wso2.com<mailto:reka@wso2.com>>
wrote:
Thanks Shaheed for the update on the failover issue..I will go through the logs and share
the observation further..
Thanks,
Reka

On Tue, Jul 14, 2015 at 2:39 PM, Shaheedur Haque (shahhaqu) <shahhaqu@cisco.com<mailto:shahhaqu@cisco.com>>
wrote:
This thread breaks out the failover issue previously reported in the thread below.

Scenario


1.      Start 2 orchestration controllers, octl-01 and octl-02. At this stage, Stratos comes
up on octl-01.

2.      Shortly later, launch two single VM applications (cisco-sample and cartridge-proxy)

3.      Some time later, launch three more single VM applications (di-000-001 to di-000-003)

4.      Some time later, kill octl-01

5.      Wait

6.      About 60 seconds later or so, DNS switched to point Cartridge Agents to octl-02 and
then Stratos comes up on octl-02.

Expected result


•        New Stratos instance picks up the old cartridges and carries on

Actual results


•        The new Startos instance seems to think the old cartridges are dead or something,
and spins up new VMs.

•        All sorts of exceptions are being thrown.

Supporting info

Two tarballs attached. The “before”  set was taken before step 4, the “after” set
was taken after step 6. Each contains tarballs for octl-01 and octl-02. Thus, in the first,
the wso2logs are at


•        tech-support-octl-01.tar, under /var/log/apache-stratos

and in the second, they are at


•        tech-support-octl-02.tar, under /var/log/apache-stratos



From: Shaheedur Haque (shahhaqu)
Sent: 07 July 2015 12:10
To: dev@stratos.apache.org<mailto:dev@stratos.apache.org>
Subject: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover
and autohealing)

I met with several of the wso2 team, and we agreed the following next steps:


1.      Treat the “Stratos failover” issue as separate from the “duplicate instances”
issue, and use a new thread for the former, keeping this thread for the latter.

2.      The lack of synchronisation between the cleanup of the faulty VMs and the startup
of the replacement VMs was noted.

a.      It was mentioned that the period of overlap should be seconds, perhaps upto 60s.

b.      I noted that for many of our scenarios, where the hardware is 1:1 mapped to VMs (i.e.
neither over-subscribed nor under-subscribed), the overlap may be a problem. I will consult
with others and confirm. Imesh ack’d this point.

c.      I will provide the wso2 logs for my test case (summarised below).

d.      I will check with the test team where the problem was originally detected if this
is a transient  (upto 60s) or long-lived, and gather logs as needed.

From: Imesh Gunaratne [mailto:imesh@apache.org<mailto:imesh@apache.org>]
Sent: 06 July 2015 17:11
To: dev
Subject: Re: Problems with Stratos 4.1 failover and autohealing

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ravihansa@wso2.com<mailto:ravihansa@wso2.com>>
wrote:

If it detects that cartridge agent has not published any health event within the given time
window (60mins), it will publish the MemberFault event.

This should be 60 seconds.

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ravihansa@wso2.com<mailto:ravihansa@wso2.com>>
wrote:
Hi Shaheed,

Regarding Stratos fail-over scenario, by looking at the logs it seems that there are multiple
member fault cases detected by CEP. This FaultHandlingWindowProcessor [1] running inside CEP
is implemented to check for health stats published by cartridge agents. It does not check
whether actual VM or processes running inside the VM is active or not. If it detects that
cartridge agent has not published any health event within the given time window (60mins),
it will publish the MemberFault event.
Ideally this should not happen regardless of whether the deployment is HA or not. Because
FaultHandlingWindowProcessor will count the 60mins interval from the point Stratos server
started. Only reason I could think of which may have caused these faulty member logs is Stratos
(or external CEP) did not receive health stats published by cartridge agents on time.

Are you running Stratos in single-JVM mode? If so, I'd highly recommend that you deploy the
CEP profile separately. Please refer to [2] for reference deployment architecture of Stratos
HA setup.

Regarding your second concern; Kill multiple Cartridges, I believe your concern is that the
first message relating to recovery of the CPS came 66s after faulty member detection and recovery
of other VMs started before that. I'd like to summarize few points regarding Stratos recovery
process which I think will explain that behavior.

 - Stratos detects faulty members only by checking the health stats published by cartridge
agent and not by checking the existence of the actual VM.
 - Upon detecting such faulty member, CEP will publish a faulty member event and after receiving
such event AS will move that instance to obsolete list. AS will iterate over this obsolete
list and try to terminate those instances via CC. When the termination is complete, CC will
publish member terminated event, upon receiving that AS will remove that member from obsolete
list.
 - If CC fails to terminate the instance, it will keep retrying until a obsolete member timeout
interval has passed. When the timeout expires, AS will forcefully remove that member from
obsolete list.
 - Stratos will not try to re-spawn the faulty instance right after detecting it. This recovery
process is triggered by the Cluster monitor object. There is a Drools logic which gets executed
periodically (monitor interval is configurable, default is 90s) which will execute the min-check
rule, scale-up rule and scale-down rule and decide whether to spawn new instances. So this
recovery process could take some time depending on your monitor interval.


So in retrospect;
- There is no way to control which cluster monitor gets executed first, so therefore it is
possible that CPS recovery takes place after other VMs.
- The reason for seeing lot of termination logs is because Stratos continuously tries to terminate
those faulty members until obsolete member timeout has expired. This could occur if the actual
VM was killed manually or by other means.

However, we need to figure out why you got these faulty member events. Could you send us the
CEP trace log for further analysis? Also you can enable debug logs for FaultHandlingWindowProcessor
class to get more insights as to what's going on.

I hope this would explain your concerns. We can have a Hangout session to discuss this further.

[1] https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
[2] https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java

Thanks.

On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <shahhaqu@cisco.com<mailto:shahhaqu@cisco.com>>
wrote:
Hi,

I am seeing some issues around Stratos 4.1 around failover on the one hand (i.e. stop one
running instance of Startos, start another as a replacement), and autohealing (i.e. a Cartridge
dies and need to be replaced by Stratos) which feel as though they might have a common cause,
so I am lumping them together for now. I’ll explain the two scenarios as I can…and the
attach some logs explaining one scenario in detail.

Failover of Stratos

The situation here is that I have 3 servers, all running MySQL in active-active mode, and
Stratos running on at most one of them under Pacemaker control. The test case is that I kill
the server (actually a VM) running Stratos, and Pacemaker responds by starting Stratos on
one of the remaining servers. Around the time of the action, there is no ongoing REST activity
in terms of defining new Cartridges, or Applications or anything like that.


•        Expected result: the replacement Stratos instance “acquires” the running Cartridges,
and the system continues without killing/replacing any Cartridges.

•        Actual result: all Cartridges are kill and restarted.

Kill multiple Cartridges

I have a test where several Cartridges are killed at the same time as follows. We have a Cartridge
Proxy Server (CPS) that hosts the (Java) Cartridge agent for VMs which cannot run one natively.
The CPS therefore runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
Killing the CPS therefore causes all N+1 JCAs to disappear, and from a Startos point of view,
all N+1 VMs need to be restarted.


•        Expected result: each Cartridges is replaced by Stratos.

•        Actual result: looking at the Stratos logs, it seems that there is a race between
the termination logic which cleans up as old cluster members dies against the startup logic
which is spinning up replacement VMs for the terminated ones.

Both cases previously seemed to work fine, where we *think* previously means versions of Stratos
4.1 dating back ~3 weeks or so. I attach a series of log extracts for the second scenario;
I suspect this also covers the first, even if not, at least we can start here.


1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault seen around 2015-06-12
18:54:51, contemporaneously, all the other faults are seen:

ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor}
-  Faulty member detected [member-id] di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor}
-  Faulty member detected [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor}
-  Faulty member detected [member-id] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
...
etc
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor}
-  Faulty member detected [member-id] di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
...
etc
...


2.      The last message relating to the CPS failure is at 18:54:52. The first message relating
to recovery of the CPS is some 66 seconds later:

TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO {org.apache.stratos.autoscaler.rule.RuleLog}
-  [min-check] Partition available, hence trying to spawn an instance to fulfil minimum count!
[cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain


3.      However, recovery of other VMs started as much as 54 seconds before this point to
as much as 26 seconds after:

TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient}
-  Trying to spawn an instance via cloud controller: [cluster] di-000-008.di-000-008.cisco-qvpc-sf-0.domain
[partition] whole-region [network-partition-id] RegionOne
...
TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient}
-  Trying to spawn an instance via cloud controller: [cluster] di-000-007.di-000-007.cisco-qvpc-sf-0.domain
[partition] whole-region [network-partition-id] RegionOne


4.      Between point 2 and 3 (with fuzz for the races), I see Stratos cleaning the old VMs.
For example:

TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO {org.apache.stratos.cloud.controller.iaases.JcloudsIaas}
-  Starting to terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id] di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f


5.      However, around that point, look at this:

TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO {org.apache.stratos.autoscaler.rule.RuleLog}
-  [scale-up] Trying to scale up over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
[max] 1



6.      Is it possible that there is a race between the terminates and restarts? It certainly
seems so, though a detailed knowledge of Startos is needed to confirm it:

TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO {org.apache.stratos.autoscaler.rule.RuleLog}
-  [min-check] Partition available, hence trying to spawn an instance to fulfil minimum count!
[cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient}
-  Trying to spawn an instance via cloud controller: [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
[partition] whole-region [network-partition-id] RegionOne
TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO {org.apache.stratos.common.client.CloudControllerServiceClient}
-  Terminating instance via cloud controller: [member] di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
-  Publishing member created event: [service-name] cisco-qvpc-sf-0 [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
[cluster-instance-id] di-000-005-1 [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
[instance-id] null [network-partition-id] RegionOne [partition-id] whole-region [lb-cluster-id]
null
TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
-  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
[member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
[cluster-instance-id] di-000-005-1
TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR {org.apache.stratos.autoscaler.rule.RuleTasksDelegator}
-  Request in flight threshold is Zero
TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO {org.apache.stratos.autoscaler.rule.RuleLog}
-  [scale-up] Trying to scale up over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
[max] 1

(I tried to file a JIRA on this, but I’m having a bit of trouble, hence posting here to
get the ball rolling).

Thanks, Shaheed





--
Akila Ravihansa Perera
Software Engineer, WSO2

Blog: http://ravihansa3000.blogspot.com



--
Imesh Gunaratne

Senior Technical Lead, WSO2
Committer & PMC Member, Apache Stratos


--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007<tel:%2B94776442007>




--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007<tel:%2B94776442007>

Mime
View raw message