Return-Path: X-Original-To: apmail-stratos-dev-archive@minotaur.apache.org Delivered-To: apmail-stratos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 476B91199E for ; Wed, 30 Jul 2014 13:55:01 +0000 (UTC) Received: (qmail 99256 invoked by uid 500); 30 Jul 2014 13:55:01 -0000 Delivered-To: apmail-stratos-dev-archive@stratos.apache.org Received: (qmail 99201 invoked by uid 500); 30 Jul 2014 13:55:01 -0000 Mailing-List: contact dev-help@stratos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@stratos.apache.org Delivered-To: mailing list dev@stratos.apache.org Received: (qmail 99187 invoked by uid 99); 30 Jul 2014 13:55:01 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jul 2014 13:55:00 +0000 Received: from localhost (HELO mail-vc0-f181.google.com) (127.0.0.1) (smtp-auth username imesh, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jul 2014 13:55:00 +0000 Received: by mail-vc0-f181.google.com with SMTP id lf12so1780135vcb.40 for ; Wed, 30 Jul 2014 06:54:59 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.52.69.172 with SMTP id f12mr8184907vdu.9.1406728499761; Wed, 30 Jul 2014 06:54:59 -0700 (PDT) Received: by 10.52.156.164 with HTTP; Wed, 30 Jul 2014 06:54:59 -0700 (PDT) In-Reply-To: <0D9B2E04-0437-4833-A848-BE5166E8FAE1@cisco.com> References: <0D9B2E04-0437-4833-A848-BE5166E8FAE1@cisco.com> Date: Wed, 30 Jul 2014 09:54:59 -0400 Message-ID: Subject: Re: MemberFault event is lost forever when MB is down From: Imesh Gunaratne To: dev Content-Type: multipart/alternative; boundary=20cf307cffd6861cb804ff6980b1 --20cf307cffd6861cb804ff6980b1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable As I understood its not just the Member Fault event that is affected in this scenario, any event that CEP publishes to message broker will encounter the same problem. On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij) < mblokzij@cisco.com> wrote: > +1. > > If Stratos, or any component it relies on, fails, and eventually returns > to service, Stratos should "orchestrate" the cloud back to the desired > state. If any cartridges went missing and after some time T (post failure= ) > Stratos hasn=E2=80=99t re-discovered them, they should be respawned. > > Best regards, > > Michiel > > > On 30 Jul 2014, at 05:51, Isuru Haththotuwa wrote: > > > > > On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera < > ravihansa@wso2.com> wrote: > >> Hi Devs, >> >> Current Stratos architecture relies heavily on high availability of >> the message broker. We faced a situation when MB is down, some of the >> messages published will get lost forever and the system state will >> never be recovered. >> >> One such example is, when a cartridge instance goes down the CEP >> component will identify this event and publish a MemberFault event to >> the MB's summarized-health-stat topic. But the problem is CEP >> component creates its own list of cartridge instance members by >> looking at health-stats published to MB - it does not consider the >> topology. Hence, when a cartridge instance goes down, MemberFault >> event will get fired only once. But if the MB is down at this time, it >> will cause this message to be lost forever resulting in an un-stable >> system state in which Stratos thinks a member exists but in reality it >> is not the case. >> >> We can introduce a simple house keeping task to check whether every >> member is alive. Ideally this should be auto-scaler's responsibility. >> It will allow the system to recover itself from an un-stable >> situation. I think this is a critical bug and should be given high >> priority. >> >> Please share your thoughts. >> > +1. We would need to decide what is the best method for this though. If w= e > consider CEP the central point of decision making, another option is to > make it listen to topology and get the correct decision. Or else, we can > use a health check mechanism for the MB which can detect if the MB is dow= n > and replay any of the messages. This IMO can be very useful since the > primary communication mechanism in Stratos is the MB. > > One other important thing is to have fail-over/HA for MB. There can be > many other occasion if the MB is down, the system going to a undefined > state due to loss of messages. > >> >> -- >> Akila Ravihansa Perera >> Software Engineer >> WSO2 Inc. >> http://wso2.com >> >> Blog: http://ravihansa3000.blogspot.com >> >> -- >> >> Thanks and Regards, >> >> Isuru H. >> >> +94 716 358 048 * = * >> >> >> * * >> >> >> > --=20 Imesh Gunaratne Technical Lead, WSO2 Committer & PPMC Member, Apache Stratos --20cf307cffd6861cb804ff6980b1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
As I understood its not just the Member Fault event that i= s affected in this scenario, any event that CEP publishes to message broker= will encounter the same problem.

<= br>
On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzij= l (mblokzij) <mblokzij@cisco.com> wrote:
+1.

If Stratos, or a= ny component it relies on, fails, and eventually returns to service, Strato= s should "orchestrate" the cloud back to the desired state. If an= y cartridges went missing and after some time T (post failure) Stratos hasn= =E2=80=99t re-discovered them, they should be respawned.

Best regards,

Michiel


On 30 Jul 2014, at 05:51, Isuru Hathth= otuwa <isuruh@apa= che.org> wrote:




On Wed, Jul 30, 2014 at 9:45 AM, Akil= a Ravihansa Perera <ravihansa@wso2.com> wrote:
Hi Devs,

Current Stratos architecture relies heavily on high availability of
the message broker. We faced a situation when MB is down, some of the
messages published will get lost forever and the system state will
never be recovered.

One such example is, when a cartridge instance goes down the CEP
component will identify this event and publish a MemberFault event to
the MB's summarized-health-stat topic. But the problem is CEP
component creates its own list of cartridge instance members by
looking at health-stats published to MB - it does not consider the
topology. Hence, when a cartridge instance goes down, MemberFault
event will get fired only once. But if the MB is down at this time, it
will cause this message to be lost forever resulting in an un-stable
system state in which Stratos thinks a member exists but in reality it
is not the case.

We can introduce a simple house keeping task to check whether every
member is alive. Ideally this should be auto-scaler's responsibility. It will allow the system to recover itself from an un-stable
situation. I think this is a critical bug and should be given high
priority.

Please share your thoughts.
+1. We would need to decid= e what is the best method for this though. If we consider CEP the central p= oint of decision making, another option is to make it listen to topology an= d get the correct decision. Or else, we can use a health check mechanism fo= r the MB which can detect if the MB is down and replay any of the messages.= This IMO can be very useful since the primary communication mechanism in S= tratos is the MB.

One other important thing is to have fail-over/HA for MB. Th= ere can be many other occasion if the MB is down, the system going to a und= efined state due to loss of messages.

--
Akila Ravihansa Perera
Software Engineer
WSO2 Inc.
http://wso2.com

Blog: http= ://ravihansa3000.blogspot.com

--






--
Imesh Gunaratne
Technical Lead, WSO2
Committer & PP= MC Member, Apache Stratos
--20cf307cffd6861cb804ff6980b1--