Mailing-List: contact river-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: river-dev@incubator.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
Message-ID: <46651B1F.30708@cheiron.org>
Date: Tue, 05 Jun 2007 10:13:19 +0200
From: Mark Brouwer <mark.brouwer@cheiron.org>
User-Agent: Thunderbird 2.0.0.0 (Windows/20070326)
MIME-Version: 1.0
To: river-dev@incubator.apache.org
Subject: Re: SourceAliveRemoteEvent Part II
References: <465C83B4.9020008@dcrdev.demon.co.uk>
 <465D8E2C.7090503@cheiron.org> <46648B68.6030904@Sun.COM>
 <4664A018.4010903@cheiron.org>
In-Reply-To: <4664A018.4010903@cheiron.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Mark Brouwer wrote:
> Bob Scheifler wrote:
>>
>> part of this being, when you know you've lost an event, being able to
>> call some form of getState that would include an event sequence
>> number in the returned state, so you can figure out whether new
>> events are before or after your getState call.  I wonder if solutions
>> to either or both of these would be more useful than SARE.
> 
> The problem I see with this is that we have two execution paths
> with no end to end synchronization between them (we can't stop notifying
> as soon as we intend to make the getState call and we know for sure
> there are no events in transit and the latency for each path). Therefore
> I'm very unsure of how to make decisions based on the outcome of 
> getState().

Just to be sure I'm not saying there is no value in the getState()
method that would include a sequence number in the returned state. In
case the event rate is not high (for which there are probably plenty of
use cases) you might get to a minimum of false conclusions (as result of
timing/synchronization issues). So for a lookup service this might be
sufficient assuming you can also have a proper way to test for the
possibility of callbacks.

Note that in the case of the getState() method which will likely be
associated with the event registration received you also have to perform
the security deployment the necessary proxy preparation, so all in all I
think this also brings the additional coding/configuration with it.

As I said I believe there is room for multiple mechanisms that
complement each other so what I just did is I tried to add the
following method to a particular EventRegistration subclass part of the
future JSC spec and gave the semantics some thought and here I run into
a problem for which I need some help.

/**
  * Returns the sequence number of the last remote event that has been
  * ... by the event producer for this event registration.
  *
  * @return the sequence number ...
  *
  * @throws RemoteException
  * @throws UnknownLeaseException if the lease associated with the event
  *      registration has expired or has been cancelled
  */
public abstract long getLastSequenceNumber()
        throws RemoteException, UnknownLeaseException;

But what should the semantics be for this method. Often in the event
producer the generation of the events is decoupled from the actual
delivery. So the actual event producer might have a different notion of
the sequence number for the event registration than the actual delivery
code. Depending on the load of the service events can be queued to be
delivered at a later moment and in case of sporadic network failures
failed event delivery that is indefinite might be retried [1]. So given
these different views on what the event sequence number is I have a hard
time coming up with a proper specification for that helpful to the
client in a generic way. Is it the last sequence number as seen by the
event producer, the last sequence number of a successful event delivery
to the first hop, is it the sequence number of the indefinite failed
remote event. Should the API be completely different that we could ask
for all three of them?

But assuming this last sequence is useful to you what will be your next
step if you don't know the exact cause of missing the event. I'm not
very optimistic that you can get much further than sending NOC, "go
analyze some log files and messages in your console to come up with a
reason why this happened", but I might be wrong here. I think I agree
with "we don't have a solid theory in general about how to recover from
lost events", it might even be that there isn't such a theory.
Distributed events could be considered evil as are many optimizations,
but often it is a necessary one.

For that reason I like SARE. The model is simple, it is expected that at
least every x (milli)seconds you get a notification of your event
producer being there. If not (maybe combined with timestamps if data is
time critical) you know something is wrong, the wrong is for somebody
else to sort out, but often as a client I switch to something that
hopefully is able to meet my expected QoS level. I'm not saying that we
shouldn't pursue being able to get more data points so a client can make
better decisions, all I'm saying to me it has been a model that has been
helpful for alerting and fail-over in large (mainly) event driven systems.

To reiterate SARE gives me continuous notifications of the aliveness of
my event producer, in case of a strict increasing sequence number an
indication whether I missed events (likely events deliberately dropped
by the server, which is really serious) and it provides a test
(following the *exact* notification route) whether callbacks will work.
Therefore it establish what I would call trust in my event producer, I
only have to start worry when not receiving any events. Depending on my
love affair with that event producer I might need additional means to
find out why it stopped, but often visiting the other service next door
brings me in a state of comfort again. Everything together for the small
prize of one additional event type a constraint and some coding in the
server (the latter I can understand people consider a hurdle). The more
often I repeat myself the more I start to like it, but no doubt the
pattern of the fanatic is showing through again ;-)

[1] in the current implementation of Seven I maintain ordering of events
when event delivery for the Jini Distributed Event Model fails due to
indefinite failures that are not related to the event itself, so all
events will be stalled and the first indefinite failed event will be
scheduled for retry with an increasing interval, after a successful
delivery the notification of the other events continue. Only events for
which it is clear their failure is definite or really caused by the
event object itself will be striked-out from the queue. So when a
clients asks the state to me it seems I can't say it missed an event as
an indefinite failed event it is still scheduled for delivery, only with
some time-shift namely a point in the future (assuming at that moment
e.g. the network is working then).
-- 
Mark