lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Black-Knight <pblackkni...@globalgiving.org>
Subject Solr Replication during Tomcat shutdown causes shutdown to hang/fail
Date Thu, 02 Oct 2014 13:18:57 GMT
Sorry to provide a link rather than reply to an email from the mailing
list, but I joined late, and can't seem to figure out a sane way to reply
to the original message...

At any rate, I was helping to look into this issue with Solr Replacation
blocking a clean tomcat shutdown, see the original text here:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201409.mbox/%3CCAMcoGHYbHH-evO4%2BqOqY%3D4F7m-XXvn3Ce9VVaMOZvcqK5j%2BYBg%40mail.gmail.com%3E

The problem is easily reproducible by starting replication on the slave and
then sending a shutdown command to tomcat (e.g. catalina.sh stop).

With a debugger attached, it looks like the fsyncService thread is blocking
VM shutdown because it is created as a non-daemon thread.

Essentially what seems to be happening is that the fsyncService thread is
running when 'catalina.sh stop' is executed. This goes in and calls
SnapPuller.destroy() which aborts the current sync. Around line 517 of the
SnapPuller, there is code that is supposed to cleanup the fsyncService
thread, but I don't think it is getting executed because the thread that
called SnapPuller.fetchLatestIndex() is configured as a daemon Thread, so
the JVM ends up shutting that down before it can cleanup the fysncService...

So... it seems like:

    if (fsyncService != null)
ExecutorUtil.shutdownNowAndAwaitTermination(fsyncService);
could be added around line 1706 of SnapPuller.java,  or

          puller.setDaemon(*false*);
could be added around line 230 of ReplicationHandler.java, however this
needs some additional work (and I think it might need to be added
regardless) since the cleanup code in SnapPuller(around 517) that shuts
down the fsync thread never gets execute since
logReplicationTimeAndConfFiles() can throw IO exceptions bypassing the rest
of the finally block...So the call to
logReplicationTimeAndConfFiles() around line 512 would need to get wrapped
with a try/catch block to catch the IO exception...

I can submit patches if needed... and cross post to the dev mailing list...

-Phil

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message