jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Guggisberg" <stefan.guggisb...@gmail.com>
Subject Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)
Date Sun, 11 Mar 2007 10:21:28 GMT
On 3/11/07, Michael Neale <michael.neale@gmail.com> wrote:
> Stefan - I think one of the main issues is the low priority of JCR-314  - it
> is much more serious then "minor" - my understanding is that any long
> running action - eg an upload of a large piece of content (file) - will
> basically block any other concurrent actions. For more fine grained CMS uses
> - this is probably not a problem (as content is read-mostly a lot of the
> time) - BUT, for people wanting to store large blobs (something that people
> would look at using JCR for) - this is a showstopper. Many transaction
> monitors in app servers have timeouts of 30 seconds - on the web or other
> tiers (of course this can be adjusted - but its not exactly a user friendly
> solution !).
>
> (my understanding may be wrong - I sincerely hope it is).

i agree with you, i changed to the priority of JCR-314 to 'major'.

cheers
stefan

>
> On 3/10/07, Stefan Guggisberg <stefan.guggisberg@gmail.com> wrote:
> >
> > On 3/10/07, Bryan Davis <brdavis@bea.com> wrote:
> > > Stefan,
> > >
> > > There are a couple of issues that, collectively, we need to address in
> > order
> > > to successfully use Jackrabbit.
> > >
> > > Issue #1: Serialization of all repository updates.  See
> > > https://issues.apache.org/jira/browse/JCR-314, which I think seriously
> > > understates the significance of the issue.  In any environment where
> > users
> > > are routinely writing anything at all to the repository (like audit or
> > log
> > > information), a large file upload (or a small file over a slow link)
> > will
> > > effectively block all other users until it completes.
> > >
> > > Having all other threads hang while a file is being uploaded is simply a
> > > show stopper for us (unfortunately this issue is marked as minor,
> > reported
> > > in 0.9, and not currently slated for a particular release).  Trying to
> > solve
> > > this issue outside of Jackrabbit is impossible, providing only stopgap
> > > solutions; plus external mitigation strategies (like uploading to a
> > local
> > > file and then streaming into Jackrabbit as fast as possible) all seem
> > fairly
> > > complex to make robust, seeing as how some data management would have to
> > be
> > > handled outside of the repository transaction.  Which leaves us with
> > trying
> > > to resolve the issue by patching Jackrabbit.
> > >
> > > I now understand that the  Jackrabbit fixes are multifaceted, and that
> > (at
> > > least) they involve changes to both the Persistence Manager and Shared
> > Item
> > > State Manager.  The Persistence Manager changes (which I will talk about
> > > separately), I think, are easy enough.  The SISM obviously needs to be
> > > upgraded to have more granular locking semantics (possibly by using
> > > item-level nested locks, or maybe a partial solution that depends on
> > > database-level locking in the Persistence Manager).
> > >
> > > There are a number of lock manager implementations floating around that
> > > could potentially be repurposed for use inside the SISM.  I am uncertain
> > of
> > > the requirements for distribution here, although on the surface it seems
> > > like a local locking implementation is all that is required since it
> > seems
> > > like clustering support is handled at a higher level.
> > >
> > > It is also tempting to try and push this functionality into the
> > database,
> > > since it is already doing all of the required locking anyway.  A custom
> > > transaction manager that delegated to the repository session transaction
> > > manager (thereby associating JDBC connections with the repository
> > session),
> > > in conjunction with a stock data source implementation (see below) might
> > do
> > > the trick.  Of course this would only work with database PM¹s, but
> > perhaps
> > > other TM¹s could still have the existing SISM locking enabled.  This
> > would
> > > be good enough for us since we only use database PM¹s, and a better,
> > more
> > > universal solution could be implemented at a later date.
> > >
> > > Has anyone looked into this issue at all or have any advice / thoughts?
> > >
> > > Issue #2: Multiple issues in database persistence managers.  I believe
> > the
> > > database persistence managers have multiple issues (please correct me if
> > I
> > > get any of this wrong).
> >
> > bryan,
> >
> > thanks for sharing your elaborate analysis. however i don't agree with
> > your
> > analysis and proposals regarding the database persistence manager.
> > i tried to explain my point in my previous replies. it seems like you
> > either
> > didn't read it,  don't agree or don't care. that's all perfectly fine with
> > me,
> > but i am not gonna repeat myself.
> >
> > since you seem to know exactly what's wrong in the current implementation,
> > feel free to create a jira issue and submit a pach with the proposed
> > changes.
> >
> > btw: i agree that synchronization is bad if you don't understand it and
> > use it
> > incorrectly ;)
> >
> > cheers
> > stefan
> >
> > >
> > > 1. JDBC connection details should not be in the repository.xml.  I
> > should be
> > > free to change the specifics of a particular database connection without
> > it
> > > constituting a repository initialization parameter change (which is
> > > effectively what changing the repository.xml is, since it gets copied
> > and
> > > subsequently accessed from inside the repository itself).  If a host
> > name or
> > > driver class or even connection URL changes, I should not have to
> > manually
> > > edit internal repository configuration files to effect the change.
> > > 2. Sharing JDBC connections (and objects obtained from them, like
> > prepared
> > > statements) between multiple threads is not a good practice.  Even
> > though
> > > many drivers support such activity, it is not specifically required by
> > the
> > > spec, and many drivers do not support it.  Even for ones that do, there
> > are
> > > always a significant list of caveats (like changing the transaction
> > > isolation of a connection impacting all threads, or rollbacks sometimes
> > > being executed against the wrong thread).  Plus, as far as I can tell,
> > there
> > > is also no particular good reason to attempt this in this
> > case.  Connection
> > > pooling is extremely well understood and there are a variety of
> > > implementations to choose from (including Apache¹s own in Jakarta
> > Commons).
> > > 3. Synchronization is  bad (generally speaking of course :).  Once the
> > > multithreaded issues of JDBC are removed (via a connection pool), there
> > are
> > > no good reasons that I can see to have any synchronization in the
> > database
> > > persistence managers.  Since any sort of requirement for synchronized
> > > operation would be coming from a higher layer, it should also be
> > provided at
> > > a higher layer.  I have always felt that a good rule of thumb in server
> > code
> > > is to avoid synchronization at all costs, particular in core server
> > > functionality (like reading and writing to a repository).  It if
> > extremely
> > > difficult to fully understand the global implications of synchronized
> > code,
> > > particular code that is synchronized at a low level.  Any serialization
> > of
> > > user requests is extremely serious in a multithreaded server and, in my
> > > experience, will lead to show-stopping performance and scalability
> > issues in
> > > nearly all cases.  In addition, serialization of requests at such a low
> > > level probably means that other synchronized code that is intended to be
> > > properly multithreaded is probably not well tested since the request
> > > serialization has eliminated (or greatly reduced) the possibility of the
> > > code being reentered like it normally would be.
> > >
> > > The solution to all of these issues, maybe, is to use the standard JDBC
> > > DataSource interface to encapsulate the details of managing the JDBC
> > > connections.  If all of the current PM and FS implementations that use
> > JDBC
> > > were refactored to have a DataSource member and to get and release
> > > connections inside of each method, then parity with the existing
> > > implementations could be achieved by providing a default
> > DataSourceLookup
> > > strategy implemention that simply encapsulated the existing connection
> > > creation code (ignoring connection release requests).  This would allow
> > us
> > > (and others) to externally extend the implementation with alternative
> > > DataSourceLookup strategy implementations, say for accessing a
> > datasource
> > > from JNDI, or getting it from a Spring application context.  This
> > solution
> > > also neatly externalizes all of the details of the actual datasource
> > > configuration from the repository.xml.
> > >
> > > Thanks!
> > > Bryan.
> > >
> > >
> > >
> > > On 3/8/07 2:48 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
> > wrote:
> > >
> > > > On 3/7/07, Bryan Davis <brdavis@bea.com> wrote:
> > > >> Well, serializing on the prepared statement is still fairly
> > serialized since
> > > >> we are really only talking about nodes and properties (two locks
> > instead of
> > > >> one).  If concurrency is controlled at a higher level then why is
> > > >> synchronization in the PM necessary?
> > > >
> > > > a PM's implementation should be thread-safe because it might be used
> > > > in another context or e.g. by a tool.
> > > >
> > > >>
> > > >> The code now seems to assume that the connection object is
> > thread-safe (and
> > > >> the specifics for thread-safeness of of connection objects and other
> > object
> > > >> derived from them is pretty much up to the driver).  This is one of
> > the
> > > >> reasons why connection pooling is used pretty much universally.
> > > >>
> > > >> If the built-in PM's used data sources instead of connections then
> > the
> > > >> connection settings could be more easily externalized (as these are
> > > >> typically configurable by the end user). Is there anyway to external
> > the
> > > >> JDBC connection settings from repository.xml right now (in 1.2.2)
and
> > > >> configure them at runtime?
> > > >
> > > > i strongly disagree. the pm configuration is *not* supposed to be
> > > > configurable by the end user and certainly not at runtime. do you
> > think
> > > > that e.g. the tablespace settings (physical datafile paths etc) of an
> > oracle
> > > > db should be user configurable? i hope not...
> > > >
> > > >>
> > > >> You didn't really answer my question about Jackrabbit and its ability
> > to
> > > >> fetch and store information through the PM concurrently... What is
> > the
> > > >> synchronization at the higher level and how does it work?
> > > >
> > > > the current synchronization is used to guarantee data consistency
> > (such as
> > > > referential integrity).
> > > >
> > > > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> > > > and you'll get the idea.
> > > >
> > > >>
> > > >> Finally, we are seeing a new issue where if a particular user uploads
> > a
> > > >> large document all other users start to get exceptions (doing a
> > normal mix
> > > >> of mostly reads/some writes).  If there is no way to do concurrent
> > writes to
> > > >> the PM I don't see any way around this problem (and it is pretty
> > serious for
> > > >> us).
> > > >
> > > > there's a related improvement issue:
> > > > https://issues.apache.org/jira/browse/JCR-314
> > > >
> > > > please feel free to comment on this issue or file a new issue if you
> > think
> > > > that it doesn't cover your use case.
> > > >
> > > > cheers
> > > > stefan
> > > >
> > > >>
> > > >> Bryan.
> > > >>
> > > >>
> > > >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
> > wrote:
> > > >>
> > > >>> On 3/5/07, Bryan Davis <brdavis@bea.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
> > wrote:
> > > >>>>
> > > >>>>> hi bryan
> > > >>>>>
> > > >>>>> On 3/2/07, Bryan Davis <brdavis@bea.com> wrote:
> > > >>>>>> What persistence manager are you using?
> > > >>>>>>
> > > >>>>>> Our tests indicate that the stock persistence managers
are a
> > significant
> > > >>>>>> bottleneck for both writes and also initial reads
to load the
> > transient
> > > >>>>>> store (on the order of .5 seconds per node when using
a remote
> > database
> > > >>>>>> like
> > > >>>>>> MSSQL or Oracle).
> > > >>>>>
> > > >>>>> what do you mean by "load the transient store"?
> > > >>>>>
> > > >>>>>>
> > > >>>>>> The stock db persistence managers have all methods
marked as
> > > >>>>>> "synchronized",
> > > >>>>>> which blocks on the classdef (which means that even
different
> > persistence
> > > >>>>>> managers for different workspaces will serialize all
load, exists
> > and
> > > >>>>>> store
> > > >>>>>
> > > >>>>> assuming you're talking about DatabasePersistenceManager:
> > > >>>>> the store/destroy methods are 'synchronized' on the instance,
not
> > on
> > > >>>>> the 'classdef'.
> > > >>>>> see e.g.
> > > >>>>>
> > > >>>>">>>>>">
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
> > > th.htm>>>>>
> > > <
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> > >
> > > <
> > http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> > >  l
> > > >>>>>
> > > >>>>> the load/exists methods are synchronized on the specific
prepared
> > stmt
> > > >>>>> they're
> > > >>>>> using.
> > > >>>>>
> > > >>>>> since every workspace uses its own persistence manager
instance i
> > can't
> > > >>>>> follow your conclusion that all load, exists and store
operations
> > would be
> > > >>>>> be globally serialized across all workspaces.
> > > >>>>
> > > >>>> Hm, this is my bad... It does seem that sync methods are on
the
> > instance.
> > > >>>> Since the db persistence manager has "synchronized" on load,
store
> > and
> > > >>>> exists, though, this would still serialize all of these operations
> > for a
> > > >>>> particular workspace.
> > > >>>
> > > >>> ?? the load methods are *not* synchronized. they contain a section
> > which
> > > >>> is synchronized on the particular prepared stmt.
> > > >>>
> > > >>> <quote from my previous reply>
> > > >>> wrt synchronization:
> > > >>> concurrency is controlled outside the persistence manager on a
> > higher level.
> > > >>> eliminating the method synchronization would imo therefore have
*no*
> > impact
> > > >>> on concurrency/performance.
> > > >>> </quote>
> > > >>>
> > > >>> cheers
> > > >>> stefan
> > > >>>
> > > >>>>
> > > >>>>>> operations).  Presumably this is because they allocate
a JDBC
> > connection
> > > >>>>>> at
> > > >>>>>> startup and use it throughout, and the connection
object is not
> > > >>>>>> multithreaded.
> > > >>>>>
> > > >>>>> what leads you to this assumption?
> > > >>>>
> > > >>>> Are there other requirements that all of these operations
are
> > serialized
> > > >>>> for
> > > >>>> a particular PM instance?  This seems like a pretty serious
> > bottleneck
> > > >>>> (and,
> > > >>>> in fact, is a pretty serious bottleneck when the database
is remote
> > from
> > > >>>> the
> > > >>>> repository).
> > > >>>>
> > > >>>>>>
> > > >>>>>> This problem isn't as noticeable when you are using
embedded
> > Derby and
> > > >>>>>> reading/writing to the file system, but when you are
doing a
> > network
> > > >>>>>> operation to a database server, the network latency
in
> > combination with
> > > >>>>>> the
> > > >>>>>> serialization of all database operations results in
a significant
> > > >>>>>> performance degradation.
> > > >>>>>
> > > >>>>> again: serialization of 'all' database operations?
> > > >>>>
> > > >>>> The distinction between all and all for a workspace is would
really
> > only be
> > > >>>> relevant during versioning, right?
> > > >>>>
> > > >>>>>>
> > > >>>>>> The new bundle persistence manager (which isn't yet
in SVN)
> > improves
> > > >>>>>> things
> > > >>>>>> dramatically since it inlines properties into the
node, so
> > loading or
> > > >>>>>> persisting a node is only one operation (plus the
additional
> > connection
> > > >>>>>> for
> > > >>>>>> the LOB) instead of one for the node and and one for
each
> > property.  The
> > > >>>>>> bundle persistence manager also uses prepared statements
and
> > keeps a
> > > >>>>>> PM-level cache of nodes (with properties) and also
non-existent
> > nodes
> > > >>>>>> (which
> > > >>>>>> permits many exists() calls to return without accessing
the
> > database).
> > > >>>>>>
> > > >>>>>> Changing all db persistence managers to use a datasource
and get
> > and
> > > >>>>>> release
> > > >>>>>> connections inside of load, exists and store operations
and
> > eliminating
> > > >>>>>> the
> > > >>>>>> method synchronization is a relatively simple change
that further
> > > >>>>>> improves
> > > >>>>>> performance for connecting to database servers.
> > > >>>>>
> > > >>>>> the use of datasources, connection pools and the like
have been
> > discussed
> > > >>>>> in extenso on the list. see e.g.
> > > >>>>>
> > > >>
> > > ">">
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > > > <
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > >
> > > <
> > http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > >
> > > >> >>
> > > >> l
> > > >>>>> http://issues.apache.org/jira/browse/JCR-313
> > > >>>>>
> > > >>>>> i don't see how getting & releasing connections in
every load,
> > exists and
> > > >>>>> store
> > > >>>>> call would improve preformance. could you please elaborate?
> > > >>>>>
> > > >>>>> please note that you wouldn't be able to use prepared
statements
> > over
> > > >>>>> multiple
> > > >>>>> load, store etc operations because you'd have to return
the
> > connection
> > > >>>>> at the end
> > > >>>>> of every call. the performance might therefore be even
worse.
> > > >>>>>
> > > >>>>> further note that write operations must occur within a
single jdbc
> > > >>>>> transaction, i.e.
> > > >>>>> you can't get a new connection for every store/destroy
operation.
> > > >>>>>
> > > >>>>> wrt synchronization:
> > > >>>>> concurrency is controlled outside the persistence manager
on a
> > higher
> > > >>>>> level.
> > > >>>>> eliminating the method synchronization would imo therefore
have
> > *no*
> > > >>>>> impact
> > > >>>>> on concurrency/performance.
> > > >>>>
> > > >>>> So you are saying that it is impossible to concurrently load
or
> > store data
> > > >>>> in Jackrabbit?
> > > >>>>
> > > >>>>>> There is a persistence manager with an ASL license
called
> > > >>>>>> "DataSourcePersistenceManager" which seems to the
PM of choice
> > for people
> > > >>>>>> using Magnolia (which is backed by Jackrabbit).  It
also uses
> > prepared
> > > >>>>>> statements and eliminates the current single-connection
issues
> > associated
> > > >>>>>> with all of the stock db PMs.  It doesn't seem to
have been
> > submitted
> > > >>>>>> back
> > > >>>>>> to the Jackrabbit project.  If you Google for
> > > >>>>>> "
> > com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
> > > >>>>>> you
> > > >>>>>> should be able to find it.
> > > >>>>>
> > > >>>>> thanks for the hint. i am aware of this pm and i had a
look at it
> > a couple
> > > >>>>> of
> > > >>>>> months ago. the major issue was that it didn't implement
the
> > > >>>>> correct/required
> > > >>>>> semantics. it used a new connection for every write operation
> > which
> > > >>>>> clearly violates the contract that the write operations
should
> > occur
> > > >>>>> within
> > > >>>>> a jdbc transaction bracket. further it creates a prepared
stmt on
> > every
> > > >>>>> load, store etc. which is hardly efficient...
> > > >>>>
> > > >>>> Yes, this PM does have this issue.  The bundle PM implements
> > prepared
> > > >>>> statements in the correct way.
> > > >>>>
> > > >>>>>> Finally, if you always use the Oracle 10g JDBC drivers,
you do
> > not need
> > > >>>>>> to
> > > >>>>>> use the Oracle-specific PMs because the 10g drivers
support the
> > standard
> > > >>>>>> BLOB API (in addition to the Oracle-specific BLOB
API required by
> > the
> > > >>>>>> older
> > > >>>>>> 9i drivers).  This is true even if you are connecting
to an older
> > > >>>>>> database
> > > >>>>>> server as the limitation was in the driver itself.
 Frankly you
> > should
> > > >>>>>> never
> > > >>>>>> use the 9i drivers as they are pretty buggy and the
10g drivers
> > represent
> > > >>>>>> a
> > > >>>>>> complete rewrite.  Make sure you use the new driver
package
> > because the
> > > >>>>>> 10g
> > > >>>>>> driver JAR also includes the older 9i drivers for
> > backward-compatibility.
> > > >>>>>> The new driver is in a new package (can't remember
the exact name
> > off the
> > > >>>>>> top of my head).
> > > >>>>>
> > > >>>>> thanks for the information.
> > > >>>>>
> > > >>>>> cheers
> > > >>>>> stefan
> > > >>>>
> > > >>>> We are very interested in getting a good understanding of
the
> > specifics of
> > > >>>> how PM's work, as initial reads and writes, according to our
> > profiling, are
> > > >>>> spending 80-90% of the time inside the PM.
> > > >>>>
> > > >>>> Bryan.
> > > >>>>
> > > >>>>
> > _______________________________________________________________________
> > > >>>> Notice:  This email message, together with any attachments,
may
> > contain
> > > >>>> information  of  BEA Systems,  Inc.,  its
> > subsidiaries  and  affiliated
> > > >>>> entities,  that may be
> > confidential,  proprietary,  copyrighted  and/or
> > > >>>> legally privileged, and is intended solely for the use of
the
> > individual
> > > >>>> or entity named in this message. If you are not the intended
> > recipient,
> > > >>>> and have received this message in error, please immediately
return
> > this
> > > >>>> by email and then delete it.
> > > >>>>
> > > >>
> > > >>
> > _______________________________________________________________________
> > > >> Notice:  This email message, together with any attachments, may
> > contain
> > > >> information  of  BEA Systems,  Inc.,  its
> > subsidiaries  and  affiliated
> > > >> entities,  that may be
> > confidential,  proprietary,  copyrighted  and/or
> > > >> legally privileged, and is intended solely for the use of the
> > individual
> > > >> or entity named in this message. If you are not the intended
> > recipient,
> > > >> and have received this message in error, please immediately return
> > this
> > > >> by email and then delete it.
> > > >>
> > >
> > > _______________________________________________________________________
> > > Notice:  This email message, together with any attachments, may contain
> > > information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> > > entities,  that may be confidential,  proprietary,  copyrighted  and/or
> > > legally privileged, and is intended solely for the use of the individual
> > > or entity named in this message. If you are not the intended recipient,
> > > and have received this message in error, please immediately return this
> > > by email and then delete it.
> > >
> >
>

Mime
View raw message