Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
From: "Ferdinand Chan" <ferdinand.chan@dcivision.com>
To: <dev@jackrabbit.apache.org>
References: <286825977104680910@unknownmsgid>
	 <510143ac0610090707u27eaabc9lcda6c7615049406f@mail.gmail.com>
 <9f929f1c0610101031l3d4672a8ob59c8a467b484710@mail.gmail.com>
In-Reply-To: <9f929f1c0610101031l3d4672a8ob59c8a467b484710@mail.gmail.com>
Subject: RE: About Issue JCR-546
Date: Wed, 11 Oct 2006 11:54:18 +0800
Message-ID: <028c01c6ece8$ed494e50$c7dbeaf0$@chan@dcivision.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
Thread-Index: AcbskmfzyspTAvR5StSL5caOv2xGvQAVj1+g
Content-Language: en-us

It seems that the problem is quite serious. Does anyone use Jackrabbit in
production environment which can successfully find an alternative way to
solve this problem?

I am working on a Content Management system which requires a lot of Content
I/O and a lot of versioning will take place.


-----Original Message-----
From: Miro Walker [mailto:miro.walker@gmail.com] 
Sent: Wednesday, October 11, 2006 1:31 AM
To: dev@jackrabbit.apache.org
Subject: Re: About Issue JCR-546

> My best advice for now has been to explicitly synchronize on the 
> repository instance whenever you are doing versioning operations. Note 
> that you can still do normal read and write operations concurrently 
> with versioning, so this isn't as bad as it could be. Perhaps we 
> should put that synchronization inside the versioning methods until 
> the concurrency issues are solved...

The problem here is that "versioning operations" covers quite a lot.
For us the real nasty is cloning nodes between workspaces, as we've used a
content model that maps releases to workspaces. Publishing a release
therefore involves cloning an entire workspace (which takes a few 10s of
minutes). During this period no other write operations can take place.
Putting synchronisation code inside the versioning methods would mean that
the entire application locks up during this period, while having it outside
in our own app means that we can be a bit more flexible with how we handle
locking (e.g. use locks that timeout with an error rather than allowing the
application to be completely locked for 30-60 mins at a time).

There are a few areas of the code that cause this sort of problem - the
other big one is indexing. In order to support a home-brewed failover
mechanism for active-passive clustering we need to delete search indexes on
failover (as they are likely to be corrupt in the event of failover). On
subsequent startup the application needs to reindex each workspace
independently when it is first accessed. This takes a few minutes to do,
again locking users out while this takes place.

I don't think there is a "quick fix" other than to go in and spend some time
fixing the existing scenarios where deadlock can occur and doing some
hardcore testing of concurrency issues.

Miro