Return-Path: Delivered-To: apmail-incubator-river-dev-archive@locus.apache.org Received: (qmail 80817 invoked from network); 5 Jun 2007 08:13:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Jun 2007 08:13:45 -0000 Received: (qmail 29565 invoked by uid 500); 5 Jun 2007 08:13:49 -0000 Delivered-To: apmail-incubator-river-dev-archive@incubator.apache.org Received: (qmail 29547 invoked by uid 500); 5 Jun 2007 08:13:48 -0000 Mailing-List: contact river-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: river-dev@incubator.apache.org Delivered-To: mailing list river-dev@incubator.apache.org Received: (qmail 29538 invoked by uid 99); 5 Jun 2007 08:13:48 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2007 01:13:48 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [194.109.24.27] (HELO smtp-vbr7.xs4all.nl) (194.109.24.27) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2007 01:13:44 -0700 Received: from [192.168.1.51] (marbro.xs4all.nl [80.126.48.138]) (authenticated bits=0) by smtp-vbr7.xs4all.nl (8.13.8/8.13.8) with ESMTP id l558DKui051976 for ; Tue, 5 Jun 2007 10:13:20 +0200 (CEST) (envelope-from mark.brouwer@cheiron.org) Message-ID: <46651B1F.30708@cheiron.org> Date: Tue, 05 Jun 2007 10:13:19 +0200 From: Mark Brouwer User-Agent: Thunderbird 2.0.0.0 (Windows/20070326) MIME-Version: 1.0 To: river-dev@incubator.apache.org Subject: Re: SourceAliveRemoteEvent Part II References: <465C83B4.9020008@dcrdev.demon.co.uk> <465D8E2C.7090503@cheiron.org> <46648B68.6030904@Sun.COM> <4664A018.4010903@cheiron.org> In-Reply-To: <4664A018.4010903@cheiron.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: by XS4ALL Virus Scanner X-Virus-Checked: Checked by ClamAV on apache.org Mark Brouwer wrote: > Bob Scheifler wrote: >> >> part of this being, when you know you've lost an event, being able to >> call some form of getState that would include an event sequence >> number in the returned state, so you can figure out whether new >> events are before or after your getState call. I wonder if solutions >> to either or both of these would be more useful than SARE. > > The problem I see with this is that we have two execution paths > with no end to end synchronization between them (we can't stop notifying > as soon as we intend to make the getState call and we know for sure > there are no events in transit and the latency for each path). Therefore > I'm very unsure of how to make decisions based on the outcome of > getState(). Just to be sure I'm not saying there is no value in the getState() method that would include a sequence number in the returned state. In case the event rate is not high (for which there are probably plenty of use cases) you might get to a minimum of false conclusions (as result of timing/synchronization issues). So for a lookup service this might be sufficient assuming you can also have a proper way to test for the possibility of callbacks. Note that in the case of the getState() method which will likely be associated with the event registration received you also have to perform the security deployment the necessary proxy preparation, so all in all I think this also brings the additional coding/configuration with it. As I said I believe there is room for multiple mechanisms that complement each other so what I just did is I tried to add the following method to a particular EventRegistration subclass part of the future JSC spec and gave the semantics some thought and here I run into a problem for which I need some help. /** * Returns the sequence number of the last remote event that has been * ... by the event producer for this event registration. * * @return the sequence number ... * * @throws RemoteException * @throws UnknownLeaseException if the lease associated with the event * registration has expired or has been cancelled */ public abstract long getLastSequenceNumber() throws RemoteException, UnknownLeaseException; But what should the semantics be for this method. Often in the event producer the generation of the events is decoupled from the actual delivery. So the actual event producer might have a different notion of the sequence number for the event registration than the actual delivery code. Depending on the load of the service events can be queued to be delivered at a later moment and in case of sporadic network failures failed event delivery that is indefinite might be retried [1]. So given these different views on what the event sequence number is I have a hard time coming up with a proper specification for that helpful to the client in a generic way. Is it the last sequence number as seen by the event producer, the last sequence number of a successful event delivery to the first hop, is it the sequence number of the indefinite failed remote event. Should the API be completely different that we could ask for all three of them? But assuming this last sequence is useful to you what will be your next step if you don't know the exact cause of missing the event. I'm not very optimistic that you can get much further than sending NOC, "go analyze some log files and messages in your console to come up with a reason why this happened", but I might be wrong here. I think I agree with "we don't have a solid theory in general about how to recover from lost events", it might even be that there isn't such a theory. Distributed events could be considered evil as are many optimizations, but often it is a necessary one. For that reason I like SARE. The model is simple, it is expected that at least every x (milli)seconds you get a notification of your event producer being there. If not (maybe combined with timestamps if data is time critical) you know something is wrong, the wrong is for somebody else to sort out, but often as a client I switch to something that hopefully is able to meet my expected QoS level. I'm not saying that we shouldn't pursue being able to get more data points so a client can make better decisions, all I'm saying to me it has been a model that has been helpful for alerting and fail-over in large (mainly) event driven systems. To reiterate SARE gives me continuous notifications of the aliveness of my event producer, in case of a strict increasing sequence number an indication whether I missed events (likely events deliberately dropped by the server, which is really serious) and it provides a test (following the *exact* notification route) whether callbacks will work. Therefore it establish what I would call trust in my event producer, I only have to start worry when not receiving any events. Depending on my love affair with that event producer I might need additional means to find out why it stopped, but often visiting the other service next door brings me in a state of comfort again. Everything together for the small prize of one additional event type a constraint and some coding in the server (the latter I can understand people consider a hurdle). The more often I repeat myself the more I start to like it, but no doubt the pattern of the fanatic is showing through again ;-) [1] in the current implementation of Seven I maintain ordering of events when event delivery for the Jini Distributed Event Model fails due to indefinite failures that are not related to the event itself, so all events will be stalled and the first indefinite failed event will be scheduled for retry with an increasing interval, after a successful delivery the notification of the other events continue. Only events for which it is clear their failure is definite or really caused by the event object itself will be striked-out from the queue. So when a clients asks the state to me it seems I can't say it missed an event as an indefinite failed event it is still scheduled for delivery, only with some time-shift namely a point in the future (assuming at that moment e.g. the network is working then). -- Mark