tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Rossbach ...@objektpark.de>
Subject Re: Rolling 5.5.25?
Date Fri, 17 Aug 2007 20:24:22 GMT
Hi Filip,

yes, the "out of session sync" is the real cluster issue. We must  
find a way that member can see, that the other member come back.
We must extend the current membership protocol to give receiver a  
chance that a member come back after a io problem or restart.
But I have no idea to good strategy to resync sessions :(
The other thing that I want to change is: That a normal shutdown/ 
restart of member can be signal of other members.
Currently admin must wait sometime a long time before a normal  
restart can made.

Peter



Am 17.08.2007 um 21:39 schrieb Filip Hanik - Dev Lists:

> Peter Rossbach wrote:
>> Hi Filip,
>>
>> OK, but second  is a real problem and frist you fix ;-)
>> Can you fix it as we call checkExpire at the RecoveryThread?
> I don't know about this one, I could call checkExpire, but if the  
> datagram socket is down, then is the expiration real?
> I guess this should be done, to still guarantee correct  
> notifications according to how it works.
>
> In a situation like this, your cluster will be out of sync, since  
> once the network card is backup, no state transfer is initiated again.
> what are your thoughts?
> Filip
>
>>
>> Peter
>>
>>
>> Am 17.08.2007 um 21:11 schrieb Filip Hanik - Dev Lists:
>>
>>> There are a few drawbacks to my current implementation that I  
>>> need to think about, these are
>>>
>>> 1. I also reset the membership map, this should probably not be  
>>> done at all
>>> 2. During a failure, since I invoked stop, to reset the thread, I  
>>> am no longer sending out "member disappared" messages, as the  
>>> service is not running
>>>
>>> Filip
>>>
>>> Filip Hanik - Dev Lists wrote:
>>>> hi Peter,
>>>> here is the SVN link
>>>> http://svn.apache.org/viewvc?view=rev&revision=567104
>>>>
>>>> basically what I do, in the receiver/sender thread, if an error  
>>>> happens, I increment a counter.
>>>> this counter also gets decremented upon success.
>>>> after X number of consecutive failures, I launch a new thread,  
>>>> called a RecoveryThread
>>>> this thread simply invokes stop->init->start until it succeeds.
>>>>
>>>> The recovery thread is setup as a singleton, ie, only one can  
>>>> run at any point in time.
>>>>
>>>> I think you'll find that the solution in 6, is much simpler, as  
>>>> I don't have to change any code in the existing membership stuff.
>>>> I had to pull out some initialization from the constructor into  
>>>> the init() method, but after that I could use stop/init/start
>>>> without changing the sender or receiver threads.
>>>>
>>>> I also changed the logging a little bit, only logging the error  
>>>> once (after that log at debug ) to avoid filling up the logs.
>>>> the recovery thread will log every 5 seconds.
>>>>
>>>> So to really answer your question after all my bla bla,
>>>> Yes, the only option is to shut down the socket and start a new  
>>>> one. But to get it done right, I rely on the McastServiceImpl to  
>>>> do the right thing during stop() and start(),
>>>> instead of recoding that into a new method
>>>>
>>>> Filip
>>>>
>>>> Peter Rossbach wrote:
>>>>> HI Filip,
>>>>>
>>>>> can you explain your 6.0.x fix ((http://issues.apache.org/ 
>>>>> bugzilla/show_bug.cgi?id=40042).) a little bit, please?
>>>>> I think we hava only a chance to recover membership after  
>>>>> cluster membership send failure, to reopen the socket.
>>>>>
>>>>> Here my current cluster 5.5 fix:
>>>>>
>>>>> ==
>>>>>     public class SenderThread extends Thread {
>>>>>         long time;
>>>>>         McastServiceImpl service ;
>>>>>         public SenderThread(long time, McastServiceImpl service) {
>>>>>             this.time = time;
>>>>>             this.service = service ;
>>>>>             setName("Cluster-MembershipSender");
>>>>>
>>>>>         }
>>>>>         public void run() {
>>>>>             long retry = 0 ;
>>>>>             while ( doRun ) {
>>>>>                 try {
>>>>>                     send();
>>>>>                     retry = 0;
>>>>>                 } catch ( Exception x ) {
>>>>>                     // FIXME: Only increment as network is  
>>>>> really down: NoRouteToHostException or BindException
>>>>>                     retry++ ;
>>>>>                     log.warn("Unable to send mcast message.",x);
>>>>>                 }
>>>>>
>>>>>                 if(retry > 0) {
>>>>>                     if(retry * time < timeToExpiration ) {
>>>>>                         try {
>>>>>                             Thread.sleep(time);
>>>>>                         } catch ( Exception ignore ) {}
>>>>>                        restartHeartbeat(retry);
>>>>>                     } else {
>>>>>                         long recover = retry % 10 ;
>>>>>                         try {
>>>>>                             Thread.sleep((recover+1)*time);
>>>>>                         } catch ( Exception ignore ) {}
>>>>>                         if( recover == 0) {
>>>>>                             restartHeartbeat(retry) ;
>>>>>                         }
>>>>>                     }
>>>>>                 }
>>>>>             }
>>>>>         }
>>>>>
>>>>>         private void restartHeartbeat(long retry) {
>>>>>             try {
>>>>>                 socket.leaveGroup(address);
>>>>>             } catch (IOException ignore) {}
>>>>>             try {
>>>>>                 log.warn("Restarting membership heartbeat after  
>>>>> send failure (number of recovery " + retry + ")");
>>>>>                 service.setupSocket();
>>>>>                 socket.joinGroup(address);
>>>>>             } catch (IOException ignore) {}
>>>>>         }
>>>>>
>>>>>     }//class SenderThread
>>>>> ===
>>>>> peter
>>>>>
>>>>>
>>>>>
>>>>> Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:
>>>>>
>>>>>> Rainer Jung wrote:
>>>>>>> Looks like an active weekend then ;)
>>>>>> I'm sorry, I just reread friday. Friday next week is totally  
>>>>>> fine. No one should have to work on a weekend.
>>>>>> also, for the mcast problem, I'm implementing a fix in 6.0 and  
>>>>>> 6.x, you should be able to copy that one
>>>>>>
>>>>>> Filip
>>>>>>
>>>>>>>
>>>>>>> I think that will suffice.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rainer
>>>>>>>
>>>>>>> Filip Hanik - Dev Lists wrote:
>>>>>>>> sounds good, lets shoot for Tue or Wed next week then
>>>>>>>>
>>>>>>>> Filip
>>>>>>>
>>>>>>> ----------------------------------------------------------------

>>>>>>> -----
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------

>>>>>> ----
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ------
>>>>>
>>>>> No virus found in this incoming message.
>>>>> Checked by AVG Free Edition. Version: 7.5.484 / Virus Database:  
>>>>> 269.12.0/957 - Release Date: 8/16/2007 1:46 PM
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>
>>>
>>
>>
>> --------------------------------------------------------------------- 
>> ---
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition. Version: 7.5.484 / Virus Database:  
>> 269.12.0/957 - Release Date: 8/16/2007 1:46 PM
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message