jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Davis <brda...@bea.com>
Subject Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)
Date Fri, 09 Mar 2007 23:24:13 GMT
Stefan,

There are a couple of issues that, collectively, we need to address in order
to successfully use Jackrabbit.

Issue #1: Serialization of all repository updates.  See
https://issues.apache.org/jira/browse/JCR-314, which I think seriously
understates the significance of the issue.  In any environment where users
are routinely writing anything at all to the repository (like audit or log
information), a large file upload (or a small file over a slow link) will
effectively block all other users until it completes.

Having all other threads hang while a file is being uploaded is simply a
show stopper for us (unfortunately this issue is marked as minor, reported
in 0.9, and not currently slated for a particular release).  Trying to solve
this issue outside of Jackrabbit is impossible, providing only stopgap
solutions; plus external mitigation strategies (like uploading to a local
file and then streaming into Jackrabbit as fast as possible) all seem fairly
complex to make robust, seeing as how some data management would have to be
handled outside of the repository transaction.  Which leaves us with trying
to resolve the issue by patching Jackrabbit.

I now understand that the  Jackrabbit fixes are multifaceted, and that (at
least) they involve changes to both the Persistence Manager and Shared Item
State Manager.  The Persistence Manager changes (which I will talk about
separately), I think, are easy enough.  The SISM obviously needs to be
upgraded to have more granular locking semantics (possibly by using
item-level nested locks, or maybe a partial solution that depends on
database-level locking in the Persistence Manager).

There are a number of lock manager implementations floating around that
could potentially be repurposed for use inside the SISM.  I am uncertain of
the requirements for distribution here, although on the surface it seems
like a local locking implementation is all that is required since it seems
like clustering support is handled at a higher level.

It is also tempting to try and push this functionality into the database,
since it is already doing all of the required locking anyway.  A custom
transaction manager that delegated to the repository session transaction
manager (thereby associating JDBC connections with the repository session),
in conjunction with a stock data source implementation (see below) might do
the trick.  Of course this would only work with database PM¹s, but perhaps
other TM¹s could still have the existing SISM locking enabled.  This would
be good enough for us since we only use database PM¹s, and a better, more
universal solution could be implemented at a later date.

Has anyone looked into this issue at all or have any advice / thoughts?

Issue #2: Multiple issues in database persistence managers.  I believe the
database persistence managers have multiple issues (please correct me if I
get any of this wrong).

1. JDBC connection details should not be in the repository.xml.  I should be
free to change the specifics of a particular database connection without it
constituting a repository initialization parameter change (which is
effectively what changing the repository.xml is, since it gets copied and
subsequently accessed from inside the repository itself).  If a host name or
driver class or even connection URL changes, I should not have to manually
edit internal repository configuration files to effect the change.
2. Sharing JDBC connections (and objects obtained from them, like prepared
statements) between multiple threads is not a good practice.  Even though
many drivers support such activity, it is not specifically required by the
spec, and many drivers do not support it.  Even for ones that do, there are
always a significant list of caveats (like changing the transaction
isolation of a connection impacting all threads, or rollbacks sometimes
being executed against the wrong thread).  Plus, as far as I can tell, there
is also no particular good reason to attempt this in this case.  Connection
pooling is extremely well understood and there are a variety of
implementations to choose from (including Apache¹s own in Jakarta Commons).
3. Synchronization is  bad (generally speaking of course :).  Once the
multithreaded issues of JDBC are removed (via a connection pool), there are
no good reasons that I can see to have any synchronization in the database
persistence managers.  Since any sort of requirement for synchronized
operation would be coming from a higher layer, it should also be provided at
a higher layer.  I have always felt that a good rule of thumb in server code
is to avoid synchronization at all costs, particular in core server
functionality (like reading and writing to a repository).  It if extremely
difficult to fully understand the global implications of synchronized code,
particular code that is synchronized at a low level.  Any serialization of
user requests is extremely serious in a multithreaded server and, in my
experience, will lead to show-stopping performance and scalability issues in
nearly all cases.  In addition, serialization of requests at such a low
level probably means that other synchronized code that is intended to be
properly multithreaded is probably not well tested since the request
serialization has eliminated (or greatly reduced) the possibility of the
code being reentered like it normally would be.

The solution to all of these issues, maybe, is to use the standard JDBC
DataSource interface to encapsulate the details of managing the JDBC
connections.  If all of the current PM and FS implementations that use JDBC
were refactored to have a DataSource member and to get and release
connections inside of each method, then parity with the existing
implementations could be achieved by providing a default DataSourceLookup
strategy implemention that simply encapsulated the existing connection
creation code (ignoring connection release requests).  This would allow us
(and others) to externally extend the implementation with alternative
DataSourceLookup strategy implementations, say for accessing a datasource
from JNDI, or getting it from a Spring application context.  This solution
also neatly externalizes all of the details of the actual datasource
configuration from the repository.xml.

Thanks!
Bryan.



On 3/8/07 2:48 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com> wrote:

> On 3/7/07, Bryan Davis <brdavis@bea.com> wrote:
>> Well, serializing on the prepared statement is still fairly serialized since
>> we are really only talking about nodes and properties (two locks instead of
>> one).  If concurrency is controlled at a higher level then why is
>> synchronization in the PM necessary?
> 
> a PM's implementation should be thread-safe because it might be used
> in another context or e.g. by a tool.
> 
>> 
>> The code now seems to assume that the connection object is thread-safe (and
>> the specifics for thread-safeness of of connection objects and other object
>> derived from them is pretty much up to the driver).  This is one of the
>> reasons why connection pooling is used pretty much universally.
>> 
>> If the built-in PM's used data sources instead of connections then the
>> connection settings could be more easily externalized (as these are
>> typically configurable by the end user). Is there anyway to external the
>> JDBC connection settings from repository.xml right now (in 1.2.2) and
>> configure them at runtime?
> 
> i strongly disagree. the pm configuration is *not* supposed to be
> configurable by the end user and certainly not at runtime. do you think
> that e.g. the tablespace settings (physical datafile paths etc) of an oracle
> db should be user configurable? i hope not...
> 
>> 
>> You didn't really answer my question about Jackrabbit and its ability to
>> fetch and store information through the PM concurrently... What is the
>> synchronization at the higher level and how does it work?
> 
> the current synchronization is used to guarantee data consistency (such as
> referential integrity).
> 
> have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> and you'll get the idea.
> 
>> 
>> Finally, we are seeing a new issue where if a particular user uploads a
>> large document all other users start to get exceptions (doing a normal mix
>> of mostly reads/some writes).  If there is no way to do concurrent writes to
>> the PM I don't see any way around this problem (and it is pretty serious for
>> us).
> 
> there's a related improvement issue:
> https://issues.apache.org/jira/browse/JCR-314
> 
> please feel free to comment on this issue or file a new issue if you think
> that it doesn't cover your use case.
> 
> cheers
> stefan
> 
>> 
>> Bryan.
>> 
>> 
>> On 3/6/07 4:12 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com> wrote:
>> 
>>> On 3/5/07, Bryan Davis <brdavis@bea.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
wrote:
>>>> 
>>>>> hi bryan
>>>>> 
>>>>> On 3/2/07, Bryan Davis <brdavis@bea.com> wrote:
>>>>>> What persistence manager are you using?
>>>>>> 
>>>>>> Our tests indicate that the stock persistence managers are a significant
>>>>>> bottleneck for both writes and also initial reads to load the transient
>>>>>> store (on the order of .5 seconds per node when using a remote database
>>>>>> like
>>>>>> MSSQL or Oracle).
>>>>> 
>>>>> what do you mean by "load the transient store"?
>>>>> 
>>>>>> 
>>>>>> The stock db persistence managers have all methods marked as
>>>>>> "synchronized",
>>>>>> which blocks on the classdef (which means that even different persistence
>>>>>> managers for different workspaces will serialize all load, exists
and
>>>>>> store
>>>>> 
>>>>> assuming you're talking about DatabasePersistenceManager:
>>>>> the store/destroy methods are 'synchronized' on the instance, not on
>>>>> the 'classdef'.
>>>>> see e.g.
>>>>> 
>>>>">>>>>">http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
th.htm>>>>> 
<http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>
<http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm>  l
>>>>> 
>>>>> the load/exists methods are synchronized on the specific prepared stmt
>>>>> they're
>>>>> using.
>>>>> 
>>>>> since every workspace uses its own persistence manager instance i can't
>>>>> follow your conclusion that all load, exists and store operations would
be
>>>>> be globally serialized across all workspaces.
>>>> 
>>>> Hm, this is my bad... It does seem that sync methods are on the instance.
>>>> Since the db persistence manager has "synchronized" on load, store and
>>>> exists, though, this would still serialize all of these operations for a
>>>> particular workspace.
>>> 
>>> ?? the load methods are *not* synchronized. they contain a section which
>>> is synchronized on the particular prepared stmt.
>>> 
>>> <quote from my previous reply>
>>> wrt synchronization:
>>> concurrency is controlled outside the persistence manager on a higher level.
>>> eliminating the method synchronization would imo therefore have *no* impact
>>> on concurrency/performance.
>>> </quote>
>>> 
>>> cheers
>>> stefan
>>> 
>>>> 
>>>>>> operations).  Presumably this is because they allocate a JDBC connection
>>>>>> at
>>>>>> startup and use it throughout, and the connection object is not
>>>>>> multithreaded.
>>>>> 
>>>>> what leads you to this assumption?
>>>> 
>>>> Are there other requirements that all of these operations are serialized
>>>> for
>>>> a particular PM instance?  This seems like a pretty serious bottleneck
>>>> (and,
>>>> in fact, is a pretty serious bottleneck when the database is remote from
>>>> the
>>>> repository).
>>>> 
>>>>>> 
>>>>>> This problem isn't as noticeable when you are using embedded Derby
and
>>>>>> reading/writing to the file system, but when you are doing a network
>>>>>> operation to a database server, the network latency in combination
with
>>>>>> the
>>>>>> serialization of all database operations results in a significant
>>>>>> performance degradation.
>>>>> 
>>>>> again: serialization of 'all' database operations?
>>>> 
>>>> The distinction between all and all for a workspace is would really only
be
>>>> relevant during versioning, right?
>>>> 
>>>>>> 
>>>>>> The new bundle persistence manager (which isn't yet in SVN) improves
>>>>>> things
>>>>>> dramatically since it inlines properties into the node, so loading
or
>>>>>> persisting a node is only one operation (plus the additional connection
>>>>>> for
>>>>>> the LOB) instead of one for the node and and one for each property.
 The
>>>>>> bundle persistence manager also uses prepared statements and keeps
a
>>>>>> PM-level cache of nodes (with properties) and also non-existent nodes
>>>>>> (which
>>>>>> permits many exists() calls to return without accessing the database).
>>>>>> 
>>>>>> Changing all db persistence managers to use a datasource and get
and
>>>>>> release
>>>>>> connections inside of load, exists and store operations and eliminating
>>>>>> the
>>>>>> method synchronization is a relatively simple change that further
>>>>>> improves
>>>>>> performance for connecting to database servers.
>>>>> 
>>>>> the use of datasources, connection pools and the like have been discussed
>>>>> in extenso on the list. see e.g.
>>>>> 
>> 
">">http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> <http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
<http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm>
>> >>
>> l
>>>>> http://issues.apache.org/jira/browse/JCR-313
>>>>> 
>>>>> i don't see how getting & releasing connections in every load, exists
and
>>>>> store
>>>>> call would improve preformance. could you please elaborate?
>>>>> 
>>>>> please note that you wouldn't be able to use prepared statements over
>>>>> multiple
>>>>> load, store etc operations because you'd have to return the connection
>>>>> at the end
>>>>> of every call. the performance might therefore be even worse.
>>>>> 
>>>>> further note that write operations must occur within a single jdbc
>>>>> transaction, i.e.
>>>>> you can't get a new connection for every store/destroy operation.
>>>>> 
>>>>> wrt synchronization:
>>>>> concurrency is controlled outside the persistence manager on a higher
>>>>> level.
>>>>> eliminating the method synchronization would imo therefore have *no*
>>>>> impact
>>>>> on concurrency/performance.
>>>> 
>>>> So you are saying that it is impossible to concurrently load or store data
>>>> in Jackrabbit?
>>>> 
>>>>>> There is a persistence manager with an ASL license called
>>>>>> "DataSourcePersistenceManager" which seems to the PM of choice for
people
>>>>>> using Magnolia (which is backed by Jackrabbit).  It also uses prepared
>>>>>> statements and eliminates the current single-connection issues associated
>>>>>> with all of the stock db PMs.  It doesn't seem to have been submitted
>>>>>> back
>>>>>> to the Jackrabbit project.  If you Google for
>>>>>> "com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
>>>>>> you
>>>>>> should be able to find it.
>>>>> 
>>>>> thanks for the hint. i am aware of this pm and i had a look at it a couple
>>>>> of
>>>>> months ago. the major issue was that it didn't implement the
>>>>> correct/required
>>>>> semantics. it used a new connection for every write operation which
>>>>> clearly violates the contract that the write operations should occur
>>>>> within
>>>>> a jdbc transaction bracket. further it creates a prepared stmt on every
>>>>> load, store etc. which is hardly efficient...
>>>> 
>>>> Yes, this PM does have this issue.  The bundle PM implements prepared
>>>> statements in the correct way.
>>>> 
>>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do not
need
>>>>>> to
>>>>>> use the Oracle-specific PMs because the 10g drivers support the standard
>>>>>> BLOB API (in addition to the Oracle-specific BLOB API required by
the
>>>>>> older
>>>>>> 9i drivers).  This is true even if you are connecting to an older
>>>>>> database
>>>>>> server as the limitation was in the driver itself.  Frankly you should
>>>>>> never
>>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers represent
>>>>>> a
>>>>>> complete rewrite.  Make sure you use the new driver package because
the
>>>>>> 10g
>>>>>> driver JAR also includes the older 9i drivers for backward-compatibility.
>>>>>> The new driver is in a new package (can't remember the exact name
off the
>>>>>> top of my head).
>>>>> 
>>>>> thanks for the information.
>>>>> 
>>>>> cheers
>>>>> stefan
>>>> 
>>>> We are very interested in getting a good understanding of the specifics of
>>>> how PM's work, as initial reads and writes, according to our profiling, are
>>>> spending 80-90% of the time inside the PM.
>>>> 
>>>> Bryan.
>>>> 
>>>> _______________________________________________________________________
>>>> Notice:  This email message, together with any attachments, may contain
>>>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
>>>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
>>>> legally privileged, and is intended solely for the use of the individual
>>>> or entity named in this message. If you are not the intended recipient,
>>>> and have received this message in error, please immediately return this
>>>> by email and then delete it.
>>>> 
>> 
>> _______________________________________________________________________
>> Notice:  This email message, together with any attachments, may contain
>> information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
>> entities,  that may be confidential,  proprietary,  copyrighted  and/or
>> legally privileged, and is intended solely for the use of the individual
>> or entity named in this message. If you are not the intended recipient,
>> and have received this message in error, please immediately return this
>> by email and then delete it.
>> 

_______________________________________________________________________
Notice:  This email message, together with any attachments, may contain
information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
entities,  that may be confidential,  proprietary,  copyrighted  and/or
legally privileged, and is intended solely for the use of the individual
or entity named in this message. If you are not the intended recipient,
and have received this message in error, please immediately return this
by email and then delete it.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message