Hi,

This post is related to SOLR-1475 - 'Java-based remplication doesn't properly reserve its commit point during backups', and index backups in general.

In Solr 1.4 and 1.4.1, the SOLR-1475 patch is certainly there, but I don't believe it truly addresses the problem.

Here's why:

When a 'backup' command is received by the RemplicationHandler, it creates a SnapShooter instance and asynchronously does a full file snapshot of the current commit point.
The current commit version to which this refers, however, is set to be cleared on the next commit by the value of 'commitReserveDuration', which, by default, is set to 10secs. (see cleanReserves() in IndexDeletionPolicyWrapper.java).

If you perform a backup and no commits occur during this time, it's fine, because clearReserves() is not called. If you do get a commit during the backup process, and the backup takes longer than 10secs,
the whole snapshot operation fails (because delete() doesn't see the commit point in savedCommits - see below).

The non-coding workaround to this is to explicitly set 'commitReserveDuration' in solrconfig.xml to a value that is higher than the maximum time it takes to do a full backup. As this parameter looks to be used by backup snapshots/postCommits only,
setting this to a high value should be ok (but I could be wrong about this - anyone familiar with the SnapShooter/DeletionPolicy code know why this might be bad?). I've tested it set to 02:00:00 (2hours) with no ill effects.

Possible patch to SOLR-1475?
Looking at the code in IndexDeletionPolicyWrapper.java, I believe the problem can be found in saveCommitPoint(). The 'savedCommits' HashMap is referenced and checked, but it's always empty as there is no savedCommits.put().

It looks to be a one-line fix:

IndexDeletionPolicyWrapper.java:103:
  /** Permanently prevent this commit point from being deleted.
   * A counter is used to allow a commit point to be correctly saved and released
   * multiple times. */
  public synchronized void saveCommitPoint(Long indexCommitVersion) {
    AtomicInteger reserveCount = savedCommits.get(indexCommitVersion);
    if (reserveCount == null) reserveCount = new AtomicInteger();
    reserveCount.incrementAndGet();
+   savedCommits.put(indexCommitVersion, reserveCount);
  }

If it's agreed by the experts this is a good fix, I guess it should go into the SOLR-1475 issue etc., but I thought I'd run it past those more knowledgable of this part of the code base before entering it into JIRA.
Any thoughts, comments are greatly appreciated.

Thanks,
Peter