jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Guggisberg" <stefan.guggisb...@gmail.com>
Subject Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)
Date Sat, 10 Mar 2007 11:10:34 GMT
On 3/10/07, Bryan Davis <brdavis@bea.com> wrote:
> Stefan,
>
> There are a couple of issues that, collectively, we need to address in order
> to successfully use Jackrabbit.
>
> Issue #1: Serialization of all repository updates.  See
> https://issues.apache.org/jira/browse/JCR-314, which I think seriously
> understates the significance of the issue.  In any environment where users
> are routinely writing anything at all to the repository (like audit or log
> information), a large file upload (or a small file over a slow link) will
> effectively block all other users until it completes.
>
> Having all other threads hang while a file is being uploaded is simply a
> show stopper for us (unfortunately this issue is marked as minor, reported
> in 0.9, and not currently slated for a particular release).  Trying to solve
> this issue outside of Jackrabbit is impossible, providing only stopgap
> solutions; plus external mitigation strategies (like uploading to a local
> file and then streaming into Jackrabbit as fast as possible) all seem fairly
> complex to make robust, seeing as how some data management would have to be
> handled outside of the repository transaction.  Which leaves us with trying
> to resolve the issue by patching Jackrabbit.
>
> I now understand that the  Jackrabbit fixes are multifaceted, and that (at
> least) they involve changes to both the Persistence Manager and Shared Item
> State Manager.  The Persistence Manager changes (which I will talk about
> separately), I think, are easy enough.  The SISM obviously needs to be
> upgraded to have more granular locking semantics (possibly by using
> item-level nested locks, or maybe a partial solution that depends on
> database-level locking in the Persistence Manager).
>
> There are a number of lock manager implementations floating around that
> could potentially be repurposed for use inside the SISM.  I am uncertain of
> the requirements for distribution here, although on the surface it seems
> like a local locking implementation is all that is required since it seems
> like clustering support is handled at a higher level.
>
> It is also tempting to try and push this functionality into the database,
> since it is already doing all of the required locking anyway.  A custom
> transaction manager that delegated to the repository session transaction
> manager (thereby associating JDBC connections with the repository session),
> in conjunction with a stock data source implementation (see below) might do
> the trick.  Of course this would only work with database PM¹s, but perhaps
> other TM¹s could still have the existing SISM locking enabled.  This would
> be good enough for us since we only use database PM¹s, and a better, more
> universal solution could be implemented at a later date.
>
> Has anyone looked into this issue at all or have any advice / thoughts?
>
> Issue #2: Multiple issues in database persistence managers.  I believe the
> database persistence managers have multiple issues (please correct me if I
> get any of this wrong).

bryan,

thanks for sharing your elaborate analysis. however i don't agree with your
analysis and proposals regarding the database persistence manager.
i tried to explain my point in my previous replies. it seems like you either
didn't read it,  don't agree or don't care. that's all perfectly fine with me,
but i am not gonna repeat myself.

since you seem to know exactly what's wrong in the current implementation,
feel free to create a jira issue and submit a pach with the proposed changes.

btw: i agree that synchronization is bad if you don't understand it and use it
incorrectly ;)

cheers
stefan

>
> 1. JDBC connection details should not be in the repository.xml.  I should be
> free to change the specifics of a particular database connection without it
> constituting a repository initialization parameter change (which is
> effectively what changing the repository.xml is, since it gets copied and
> subsequently accessed from inside the repository itself).  If a host name or
> driver class or even connection URL changes, I should not have to manually
> edit internal repository configuration files to effect the change.
> 2. Sharing JDBC connections (and objects obtained from them, like prepared
> statements) between multiple threads is not a good practice.  Even though
> many drivers support such activity, it is not specifically required by the
> spec, and many drivers do not support it.  Even for ones that do, there are
> always a significant list of caveats (like changing the transaction
> isolation of a connection impacting all threads, or rollbacks sometimes
> being executed against the wrong thread).  Plus, as far as I can tell, there
> is also no particular good reason to attempt this in this case.  Connection
> pooling is extremely well understood and there are a variety of
> implementations to choose from (including Apache¹s own in Jakarta Commons).
> 3. Synchronization is  bad (generally speaking of course :).  Once the
> multithreaded issues of JDBC are removed (via a connection pool), there are
> no good reasons that I can see to have any synchronization in the database
> persistence managers.  Since any sort of requirement for synchronized
> operation would be coming from a higher layer, it should also be provided at
> a higher layer.  I have always felt that a good rule of thumb in server code
> is to avoid synchronization at all costs, particular in core server
> functionality (like reading and writing to a repository).  It if extremely
> difficult to fully understand the global implications of synchronized code,
> particular code that is synchronized at a low level.  Any serialization of
> user requests is extremely serious in a multithreaded server and, in my
> experience, will lead to show-stopping performance and scalability issues in
> nearly all cases.  In addition, serialization of requests at such a low
> level probably means that other synchronized code that is intended to be
> properly multithreaded is probably not well tested since the request
> serialization has eliminated (or greatly reduced) the possibility of the
> code being reentered like it normally would be.
>
> The solution to all of these issues, maybe, is to use the standard JDBC
> DataSource interface to encapsulate the details of managing the JDBC
> connections.  If all of the current PM and FS implementations that use JDBC
> were refactored to have a DataSource member and to get and release
> connections inside of each method, then parity with the existing
> implementations could be achieved by providing a default DataSourceLookup
> strategy implemention that simply encapsulated the existing connection
> creation code (ignoring connection release requests).  This would allow us
> (and others) to externally extend the implementation with alternative
> DataSourceLookup strategy implementations, say for accessing a datasource
> from JNDI, or getting it from a Spring application context.  This solution
> also neatly externalizes all of the details of the actual datasource
> configuration from the repository.xml.
>
> Thanks!
> Bryan.
>
>
>
> On 3/8/07 2:48 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com> wrote:
>
> > On 3/7/07, Bryan Davis <brdavis@bea.com> wrote:
> >> Well, serializing on the prepared statement is still fairly serialized since
> >> we are really only talking about nodes and properties (two locks instead of
> >> one).  If concurrency is controlled at a higher level then why is
> >> synchronization in the PM necessary?
> >
> > a PM's implementation should be thread-safe because it might be used
> > in another context or e.g. by a tool.
> >
> >>
> >> The code now seems to assume that the connection object is thread-safe (and
> >> the specifics for thread-safeness of of connection objects and other object
> >> derived from them is pretty much up to the driver).  This is one of the
> >> reasons why connection pooling is used pretty much universally.
> >>
> >> If the built-in PM's used data sources instead of connections then the
> >> connection settings could be more easily externalized (as these are
> >> typically configurable by the end user). Is there anyway to external the
> >> JDBC connection settings from repository.xml right now (in 1.2.2) and
> >> configure them at runtime?
> >
> > i strongly disagree. the pm configuration is *not* supposed to be
> > configurable by the end user and certainly not at runtime. do you think
> > that e.g. the tablespace settings (physical datafile paths etc) of an oracle
> > db should be user configurable? i hope not...
> >
> >>
> >> You didn't really answer my question about Jackrabbit and its ability to
> >> fetch and store information through the PM concurrently... What is the
> >> synchronization at the higher level and how does it work?
> >
> > the current synchronization is used to guarantee data consistency (such as
> > referential integrity).
> >
> > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> > and you'll get the idea.
> >
> >>
> >> Finally, we are seeing a new issue where if a particular user uploads a
> >> large document all other users start to get exceptions (doing a normal mix
> >> of mostly reads/some writes).  If there is no way to do concurrent writes to
> >> the PM I don't see any way around this problem (and it is pretty serious for
> >> us).
> >
> > there's a related improvement issue:
> > https://issues.apache.org/jira/browse/JCR-314
> >
> > please feel free to comment on this issue or file a new issue if you think
> > that it doesn't cover your use case.
> >
> > cheers
> > stefan
> >
> >>
> >> Bryan.
> >>
> >>
> >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com> wrote:
> >>
> >>> On 3/5/07, Bryan Davis <brdavis@bea.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
wrote:
> >>>>
> >>>>> hi bryan
> >>>>>
> >>>>> On 3/2/07, Bryan Davis <brdavis@bea.com> wrote:
> >>>>>> What persistence manager are you using?
> >>>>>>
> >>>>>> Our tests indicate that the stock persistence managers are a
significant
> >>>>>> bottleneck for both writes and also initial reads to load the
transient
> >>>>>> store (on the order of .5 seconds per node when using a remote
database
> >>>>>> like
> >>>>>> MSSQL or Oracle).
> >>>>>
> >>>>> what do you mean by "load the transient store"?
> >>>>>
> >>>>>>
> >>>>>> The stock db persistence managers have all methods marked as
> >>>>>> "synchronized",
> >>>>>> which blocks on the classdef (which means that even different
persistence
> >>>>>> managers for different workspaces will serialize all load, exists
and
> >>>>>> store
> >>>>>
> >>>>> assuming you're talking about DatabasePersistenceManager:
> >>>>> the store/destroy methods are 'synchronized' on the instance, not
on
> >>>>> the 'classdef'.
> >>>>> see e.g.
> >>>>>
> >>>>">>>>>">http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
> th.htm>>>>>
> <http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>
> <http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>  l
> >>>>>
> >>>>> the load/exists methods are synchronized on the specific prepared
stmt
> >>>>> they're
> >>>>> using.
> >>>>>
> >>>>> since every workspace uses its own persistence manager instance
i can't
> >>>>> follow your conclusion that all load, exists and store operations
would be
> >>>>> be globally serialized across all workspaces.
> >>>>
> >>>> Hm, this is my bad... It does seem that sync methods are on the instance.
> >>>> Since the db persistence manager has "synchronized" on load, store and
> >>>> exists, though, this would still serialize all of these operations for
a
> >>>> particular workspace.
> >>>
> >>> ?? the load methods are *not* synchronized. they contain a section which
> >>> is synchronized on the particular prepared stmt.
> >>>
> >>> <quote from my previous reply>
> >>> wrt synchronization:
> >>> concurrency is controlled outside the persistence manager on a higher level.
> >>> eliminating the method synchronization would imo therefore have *no* impact
> >>> on concurrency/performance.
> >>> </quote>
> >>>
> >>> cheers
> >>> stefan
> >>>
> >>>>
> >>>>>> operations).  Presumably this is because they allocate a JDBC
connection
> >>>>>> at
> >>>>>> startup and use it throughout, and the connection object is
not
> >>>>>> multithreaded.
> >>>>>
> >>>>> what leads you to this assumption?
> >>>>
> >>>> Are there other requirements that all of these operations are serialized
> >>>> for
> >>>> a particular PM instance?  This seems like a pretty serious bottleneck
> >>>> (and,
> >>>> in fact, is a pretty serious bottleneck when the database is remote
from
> >>>> the
> >>>> repository).
> >>>>
> >>>>>>
> >>>>>> This problem isn't as noticeable when you are using embedded
Derby and
> >>>>>> reading/writing to the file system, but when you are doing a
network
> >>>>>> operation to a database server, the network latency in combination
with
> >>>>>> the
> >>>>>> serialization of all database operations results in a significant
> >>>>>> performance degradation.
> >>>>>
> >>>>> again: serialization of 'all' database operations?
> >>>>
> >>>> The distinction between all and all for a workspace is would really
only be
> >>>> relevant during versioning, right?
> >>>>
> >>>>>>
> >>>>>> The new bundle persistence manager (which isn't yet in SVN)
improves
> >>>>>> things
> >>>>>> dramatically since it inlines properties into the node, so loading
or
> >>>>>> persisting a node is only one operation (plus the additional
connection
> >>>>>> for
> >>>>>> the LOB) instead of one for the node and and one for each property.
 The
> >>>>>> bundle persistence manager also uses prepared statements and
keeps a
> >>>>>> PM-level cache of nodes (with properties) and also non-existent
nodes
> >>>>>> (which
> >>>>>> permits many exists() calls to return without accessing the
database).
> >>>>>>
> >>>>>> Changing all db persistence managers to use a datasource and
get and
> >>>>>> release
> >>>>>> connections inside of load, exists and store operations and
eliminating
> >>>>>> the
> >>>>>> method synchronization is a relatively simple change that further
> >>>>>> improves
> >>>>>> performance for connecting to database servers.
> >>>>>
> >>>>> the use of datasources, connection pools and the like have been
discussed
> >>>>> in extenso on the list. see e.g.
> >>>>>
> >>
> ">">http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > <http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
> <http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
> >> >>
> >> l
> >>>>> http://issues.apache.org/jira/browse/JCR-313
> >>>>>
> >>>>> i don't see how getting & releasing connections in every load,
exists and
> >>>>> store
> >>>>> call would improve preformance. could you please elaborate?
> >>>>>
> >>>>> please note that you wouldn't be able to use prepared statements
over
> >>>>> multiple
> >>>>> load, store etc operations because you'd have to return the connection
> >>>>> at the end
> >>>>> of every call. the performance might therefore be even worse.
> >>>>>
> >>>>> further note that write operations must occur within a single jdbc
> >>>>> transaction, i.e.
> >>>>> you can't get a new connection for every store/destroy operation.
> >>>>>
> >>>>> wrt synchronization:
> >>>>> concurrency is controlled outside the persistence manager on a higher
> >>>>> level.
> >>>>> eliminating the method synchronization would imo therefore have
*no*
> >>>>> impact
> >>>>> on concurrency/performance.
> >>>>
> >>>> So you are saying that it is impossible to concurrently load or store
data
> >>>> in Jackrabbit?
> >>>>
> >>>>>> There is a persistence manager with an ASL license called
> >>>>>> "DataSourcePersistenceManager" which seems to the PM of choice
for people
> >>>>>> using Magnolia (which is backed by Jackrabbit).  It also uses
prepared
> >>>>>> statements and eliminates the current single-connection issues
associated
> >>>>>> with all of the stock db PMs.  It doesn't seem to have been
submitted
> >>>>>> back
> >>>>>> to the Jackrabbit project.  If you Google for
> >>>>>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
> >>>>>> you
> >>>>>> should be able to find it.
> >>>>>
> >>>>> thanks for the hint. i am aware of this pm and i had a look at it
a couple
> >>>>> of
> >>>>> months ago. the major issue was that it didn't implement the
> >>>>> correct/required
> >>>>> semantics. it used a new connection for every write operation which
> >>>>> clearly violates the contract that the write operations should occur
> >>>>> within
> >>>>> a jdbc transaction bracket. further it creates a prepared stmt on
every
> >>>>> load, store etc. which is hardly efficient...
> >>>>
> >>>> Yes, this PM does have this issue.  The bundle PM implements prepared
> >>>> statements in the correct way.
> >>>>
> >>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you
do not need
> >>>>>> to
> >>>>>> use the Oracle-specific PMs because the 10g drivers support
the standard
> >>>>>> BLOB API (in addition to the Oracle-specific BLOB API required
by the
> >>>>>> older
> >>>>>> 9i drivers).  This is true even if you are connecting to an
older
> >>>>>> database
> >>>>>> server as the limitation was in the driver itself.  Frankly
you should
> >>>>>> never
> >>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers
represent
> >>>>>> a
> >>>>>> complete rewrite.  Make sure you use the new driver package
because the
> >>>>>> 10g
> >>>>>> driver JAR also includes the older 9i drivers for backward-compatibility.
> >>>>>> The new driver is in a new package (can't remember the exact
name off the
> >>>>>> top of my head).
> >>>>>
> >>>>> thanks for the information.
> >>>>>
> >>>>> cheers
> >>>>> stefan
> >>>>
> >>>> We are very interested in getting a good understanding of the specifics
of
> >>>> how PM's work, as initial reads and writes, according to our profiling,
are
> >>>> spending 80-90% of the time inside the PM.
> >>>>
> >>>> Bryan.
> >>>>
> >>>> _______________________________________________________________________
> >>>> Notice:  This email message, together with any attachments, may contain
> >>>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> >>>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> >>>> legally privileged, and is intended solely for the use of the individual
> >>>> or entity named in this message. If you are not the intended recipient,
> >>>> and have received this message in error, please immediately return this
> >>>> by email and then delete it.
> >>>>
> >>
> >> _______________________________________________________________________
> >> Notice:  This email message, together with any attachments, may contain
> >> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> >> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> >> legally privileged, and is intended solely for the use of the individual
> >> or entity named in this message. If you are not the intended recipient,
> >> and have received this message in error, please immediately return this
> >> by email and then delete it.
> >>
>
> _______________________________________________________________________
> Notice:  This email message, together with any attachments, may contain
> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> entities,  that may be confidential,  proprietary,  copyrighted  and/or
> legally privileged, and is intended solely for the use of the individual
> or entity named in this message. If you are not the intended recipient,
> and have received this message in error, please immediately return this
> by email and then delete it.
>

Mime
View raw message