tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filip Hanik - Dev Lists <devli...@hanik.com>
Subject Re: Rolling 5.5.25?
Date Fri, 17 Aug 2007 20:49:39 GMT
Peter Rossbach wrote:
> Hi Filip,
>
> yes, the "out of session sync" is the real cluster issue. We must find 
> a way that member can see, that the other member come back.
> We must extend the current membership protocol to give receiver a 
> chance that a member come back after a io problem or restart.
> But I have no idea to good strategy to resync sessions :(
yes, the problem lies in the original solution, all-to-all clusters are 
not a good idea, and that is why we are fighting this issue.
I'm about to start a primary/secondary replication solution in trunk, I 
just need a good way to select a backup member and propagating that 
through JvmRoute.
I think I'm pretty close to a solution. This solution would also include 
a versioning mechanism as well, that could help the recovery solution, 
should data become out of sync.
> The other thing that I want to change is: That a normal 
> shutdown/restart of member can be signal of other members.
This has been resolved in Tomcat 6 through Tribes. When a 
memberDisappeared is called, you can check Member.getCommand equals 
Member.SHUTDOWN_COMMAND (or something like that)
That is the signal for a normal shutdown.
> Currently admin must wait sometime a long time before a normal restart 
> can made.
Yes, one of the reasons the refactoring was so important, to have a 
communication framework, extensible to handle cases like this.

Filip
>
> Peter
>
>
>
> Am 17.08.2007 um 21:39 schrieb Filip Hanik - Dev Lists:
>
>> Peter Rossbach wrote:
>>> Hi Filip,
>>>
>>> OK, but second  is a real problem and frist you fix ;-)
>>> Can you fix it as we call checkExpire at the RecoveryThread?
>> I don't know about this one, I could call checkExpire, but if the 
>> datagram socket is down, then is the expiration real?
>> I guess this should be done, to still guarantee correct notifications 
>> according to how it works.
>>
>> In a situation like this, your cluster will be out of sync, since 
>> once the network card is backup, no state transfer is initiated again.
>> what are your thoughts?
>> Filip
>>
>>>
>>> Peter
>>>
>>>
>>> Am 17.08.2007 um 21:11 schrieb Filip Hanik - Dev Lists:
>>>
>>>> There are a few drawbacks to my current implementation that I need 
>>>> to think about, these are
>>>>
>>>> 1. I also reset the membership map, this should probably not be 
>>>> done at all
>>>> 2. During a failure, since I invoked stop, to reset the thread, I 
>>>> am no longer sending out "member disappared" messages, as the 
>>>> service is not running
>>>>
>>>> Filip
>>>>
>>>> Filip Hanik - Dev Lists wrote:
>>>>> hi Peter,
>>>>> here is the SVN link
>>>>> http://svn.apache.org/viewvc?view=rev&revision=567104
>>>>>
>>>>> basically what I do, in the receiver/sender thread, if an error 
>>>>> happens, I increment a counter.
>>>>> this counter also gets decremented upon success.
>>>>> after X number of consecutive failures, I launch a new thread, 
>>>>> called a RecoveryThread
>>>>> this thread simply invokes stop->init->start until it succeeds.
>>>>>
>>>>> The recovery thread is setup as a singleton, ie, only one can run 
>>>>> at any point in time.
>>>>>
>>>>> I think you'll find that the solution in 6, is much simpler, as I 
>>>>> don't have to change any code in the existing membership stuff.
>>>>> I had to pull out some initialization from the constructor into 
>>>>> the init() method, but after that I could use stop/init/start
>>>>> without changing the sender or receiver threads.
>>>>>
>>>>> I also changed the logging a little bit, only logging the error 
>>>>> once (after that log at debug ) to avoid filling up the logs.
>>>>> the recovery thread will log every 5 seconds.
>>>>>
>>>>> So to really answer your question after all my bla bla,
>>>>> Yes, the only option is to shut down the socket and start a new 
>>>>> one. But to get it done right, I rely on the McastServiceImpl to 
>>>>> do the right thing during stop() and start(),
>>>>> instead of recoding that into a new method
>>>>>
>>>>> Filip
>>>>>
>>>>> Peter Rossbach wrote:
>>>>>> HI Filip,
>>>>>>
>>>>>> can you explain your 6.0.x fix 
>>>>>> ((http://issues.apache.org/bugzilla/show_bug.cgi?id=40042).) a 
>>>>>> little bit, please?
>>>>>> I think we hava only a chance to recover membership after cluster

>>>>>> membership send failure, to reopen the socket.
>>>>>>
>>>>>> Here my current cluster 5.5 fix:
>>>>>>
>>>>>> ==
>>>>>>     public class SenderThread extends Thread {
>>>>>>         long time;
>>>>>>         McastServiceImpl service ;
>>>>>>         public SenderThread(long time, McastServiceImpl service)
{
>>>>>>             this.time = time;
>>>>>>             this.service = service ;
>>>>>>             setName("Cluster-MembershipSender");
>>>>>>
>>>>>>         }
>>>>>>         public void run() {
>>>>>>             long retry = 0 ;
>>>>>>             while ( doRun ) {
>>>>>>                 try {
>>>>>>                     send();
>>>>>>                     retry = 0;
>>>>>>                 } catch ( Exception x ) {
>>>>>>                     // FIXME: Only increment as network is really

>>>>>> down: NoRouteToHostException or BindException
>>>>>>                     retry++ ;
>>>>>>                     log.warn("Unable to send mcast message.",x);
>>>>>>                 }
>>>>>>
>>>>>>                 if(retry > 0) {
>>>>>>                     if(retry * time < timeToExpiration ) {
>>>>>>                         try {
>>>>>>                             Thread.sleep(time);
>>>>>>                         } catch ( Exception ignore ) {}
>>>>>>                        restartHeartbeat(retry);
>>>>>>                     } else {
>>>>>>                         long recover = retry % 10 ;
>>>>>>                         try {
>>>>>>                             Thread.sleep((recover+1)*time);
>>>>>>                         } catch ( Exception ignore ) {}
>>>>>>                         if( recover == 0) {
>>>>>>                             restartHeartbeat(retry) ;
>>>>>>                         }
>>>>>>                     }
>>>>>>                 }
>>>>>>             }
>>>>>>         }
>>>>>>
>>>>>>         private void restartHeartbeat(long retry) {
>>>>>>             try {
>>>>>>                 socket.leaveGroup(address);
>>>>>>             } catch (IOException ignore) {}
>>>>>>             try {
>>>>>>                 log.warn("Restarting membership heartbeat after 
>>>>>> send failure (number of recovery " + retry + ")");
>>>>>>                 service.setupSocket();
>>>>>>                 socket.joinGroup(address);
>>>>>>             } catch (IOException ignore) {}
>>>>>>         }
>>>>>>
>>>>>>     }//class SenderThread
>>>>>> ===
>>>>>> peter
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 17.08.2007 um 19:56 schrieb Filip Hanik - Dev Lists:
>>>>>>
>>>>>>> Rainer Jung wrote:
>>>>>>>> Looks like an active weekend then ;)
>>>>>>> I'm sorry, I just reread friday. Friday next week is totally

>>>>>>> fine. No one should have to work on a weekend.
>>>>>>> also, for the mcast problem, I'm implementing a fix in 6.0 and

>>>>>>> 6.x, you should be able to copy that one
>>>>>>>
>>>>>>> Filip
>>>>>>>
>>>>>>>>
>>>>>>>> I think that will suffice.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Rainer
>>>>>>>>
>>>>>>>> Filip Hanik - Dev Lists wrote:
>>>>>>>>> sounds good, lets shoot for Tue or Wed next week then
>>>>>>>>>
>>>>>>>>> Filip
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------

>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------

>>>>>>>
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------

>>>>>>
>>>>>>
>>>>>> No virus found in this incoming message.
>>>>>> Checked by AVG Free Edition. Version: 7.5.484 / Virus Database: 
>>>>>> 269.12.0/957 - Release Date: 8/16/2007 1:46 PM
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>>>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition. Version: 7.5.484 / Virus Database: 
>>> 269.12.0/957 - Release Date: 8/16/2007 1:46 PM
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: dev-help@tomcat.apache.org
>>
>>
>
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.484 / Virus Database: 269.12.0/957 - Release Date: 8/16/2007 1:46 PM
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Mime
View raw message