jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Neale" <michael.ne...@gmail.com>
Subject Re: Database PersistenceManagers (was "Results of a JR Oracle test that we conducted)
Date Sun, 11 Mar 2007 05:50:06 GMT
Stefan - I think one of the main issues is the low priority of JCR-314  - it
is much more serious then "minor" - my understanding is that any long
running action - eg an upload of a large piece of content (file) - will
basically block any other concurrent actions. For more fine grained CMS uses
- this is probably not a problem (as content is read-mostly a lot of the
time) - BUT, for people wanting to store large blobs (something that people
would look at using JCR for) - this is a showstopper. Many transaction
monitors in app servers have timeouts of 30 seconds - on the web or other
tiers (of course this can be adjusted - but its not exactly a user friendly
solution !).

(my understanding may be wrong - I sincerely hope it is).

On 3/10/07, Stefan Guggisberg <stefan.guggisberg@gmail.com> wrote:
>
> On 3/10/07, Bryan Davis <brdavis@bea.com> wrote:
> > Stefan,
> >
> > There are a couple of issues that, collectively, we need to address in
> order
> > to successfully use Jackrabbit.
> >
> > Issue #1: Serialization of all repository updates.  See
> > https://issues.apache.org/jira/browse/JCR-314, which I think seriously
> > understates the significance of the issue.  In any environment where
> users
> > are routinely writing anything at all to the repository (like audit or
> log
> > information), a large file upload (or a small file over a slow link)
> will
> > effectively block all other users until it completes.
> >
> > Having all other threads hang while a file is being uploaded is simply a
> > show stopper for us (unfortunately this issue is marked as minor,
> reported
> > in 0.9, and not currently slated for a particular release).  Trying to
> solve
> > this issue outside of Jackrabbit is impossible, providing only stopgap
> > solutions; plus external mitigation strategies (like uploading to a
> local
> > file and then streaming into Jackrabbit as fast as possible) all seem
> fairly
> > complex to make robust, seeing as how some data management would have to
> be
> > handled outside of the repository transaction.  Which leaves us with
> trying
> > to resolve the issue by patching Jackrabbit.
> >
> > I now understand that the  Jackrabbit fixes are multifaceted, and that
> (at
> > least) they involve changes to both the Persistence Manager and Shared
> Item
> > State Manager.  The Persistence Manager changes (which I will talk about
> > separately), I think, are easy enough.  The SISM obviously needs to be
> > upgraded to have more granular locking semantics (possibly by using
> > item-level nested locks, or maybe a partial solution that depends on
> > database-level locking in the Persistence Manager).
> >
> > There are a number of lock manager implementations floating around that
> > could potentially be repurposed for use inside the SISM.  I am uncertain
> of
> > the requirements for distribution here, although on the surface it seems
> > like a local locking implementation is all that is required since it
> seems
> > like clustering support is handled at a higher level.
> >
> > It is also tempting to try and push this functionality into the
> database,
> > since it is already doing all of the required locking anyway.  A custom
> > transaction manager that delegated to the repository session transaction
> > manager (thereby associating JDBC connections with the repository
> session),
> > in conjunction with a stock data source implementation (see below) might
> do
> > the trick.  Of course this would only work with database PM¹s, but
> perhaps
> > other TM¹s could still have the existing SISM locking enabled.  This
> would
> > be good enough for us since we only use database PM¹s, and a better,
> more
> > universal solution could be implemented at a later date.
> >
> > Has anyone looked into this issue at all or have any advice / thoughts?
> >
> > Issue #2: Multiple issues in database persistence managers.  I believe
> the
> > database persistence managers have multiple issues (please correct me if
> I
> > get any of this wrong).
>
> bryan,
>
> thanks for sharing your elaborate analysis. however i don't agree with
> your
> analysis and proposals regarding the database persistence manager.
> i tried to explain my point in my previous replies. it seems like you
> either
> didn't read it,  don't agree or don't care. that's all perfectly fine with
> me,
> but i am not gonna repeat myself.
>
> since you seem to know exactly what's wrong in the current implementation,
> feel free to create a jira issue and submit a pach with the proposed
> changes.
>
> btw: i agree that synchronization is bad if you don't understand it and
> use it
> incorrectly ;)
>
> cheers
> stefan
>
> >
> > 1. JDBC connection details should not be in the repository.xml.  I
> should be
> > free to change the specifics of a particular database connection without
> it
> > constituting a repository initialization parameter change (which is
> > effectively what changing the repository.xml is, since it gets copied
> and
> > subsequently accessed from inside the repository itself).  If a host
> name or
> > driver class or even connection URL changes, I should not have to
> manually
> > edit internal repository configuration files to effect the change.
> > 2. Sharing JDBC connections (and objects obtained from them, like
> prepared
> > statements) between multiple threads is not a good practice.  Even
> though
> > many drivers support such activity, it is not specifically required by
> the
> > spec, and many drivers do not support it.  Even for ones that do, there
> are
> > always a significant list of caveats (like changing the transaction
> > isolation of a connection impacting all threads, or rollbacks sometimes
> > being executed against the wrong thread).  Plus, as far as I can tell,
> there
> > is also no particular good reason to attempt this in this
> case.  Connection
> > pooling is extremely well understood and there are a variety of
> > implementations to choose from (including Apache¹s own in Jakarta
> Commons).
> > 3. Synchronization is  bad (generally speaking of course :).  Once the
> > multithreaded issues of JDBC are removed (via a connection pool), there
> are
> > no good reasons that I can see to have any synchronization in the
> database
> > persistence managers.  Since any sort of requirement for synchronized
> > operation would be coming from a higher layer, it should also be
> provided at
> > a higher layer.  I have always felt that a good rule of thumb in server
> code
> > is to avoid synchronization at all costs, particular in core server
> > functionality (like reading and writing to a repository).  It if
> extremely
> > difficult to fully understand the global implications of synchronized
> code,
> > particular code that is synchronized at a low level.  Any serialization
> of
> > user requests is extremely serious in a multithreaded server and, in my
> > experience, will lead to show-stopping performance and scalability
> issues in
> > nearly all cases.  In addition, serialization of requests at such a low
> > level probably means that other synchronized code that is intended to be
> > properly multithreaded is probably not well tested since the request
> > serialization has eliminated (or greatly reduced) the possibility of the
> > code being reentered like it normally would be.
> >
> > The solution to all of these issues, maybe, is to use the standard JDBC
> > DataSource interface to encapsulate the details of managing the JDBC
> > connections.  If all of the current PM and FS implementations that use
> JDBC
> > were refactored to have a DataSource member and to get and release
> > connections inside of each method, then parity with the existing
> > implementations could be achieved by providing a default
> DataSourceLookup
> > strategy implemention that simply encapsulated the existing connection
> > creation code (ignoring connection release requests).  This would allow
> us
> > (and others) to externally extend the implementation with alternative
> > DataSourceLookup strategy implementations, say for accessing a
> datasource
> > from JNDI, or getting it from a Spring application context.  This
> solution
> > also neatly externalizes all of the details of the actual datasource
> > configuration from the repository.xml.
> >
> > Thanks!
> > Bryan.
> >
> >
> >
> > On 3/8/07 2:48 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
> wrote:
> >
> > > On 3/7/07, Bryan Davis <brdavis@bea.com> wrote:
> > >> Well, serializing on the prepared statement is still fairly
> serialized since
> > >> we are really only talking about nodes and properties (two locks
> instead of
> > >> one).  If concurrency is controlled at a higher level then why is
> > >> synchronization in the PM necessary?
> > >
> > > a PM's implementation should be thread-safe because it might be used
> > > in another context or e.g. by a tool.
> > >
> > >>
> > >> The code now seems to assume that the connection object is
> thread-safe (and
> > >> the specifics for thread-safeness of of connection objects and other
> object
> > >> derived from them is pretty much up to the driver).  This is one of
> the
> > >> reasons why connection pooling is used pretty much universally.
> > >>
> > >> If the built-in PM's used data sources instead of connections then
> the
> > >> connection settings could be more easily externalized (as these are
> > >> typically configurable by the end user). Is there anyway to external
> the
> > >> JDBC connection settings from repository.xml right now (in 1.2.2) and
> > >> configure them at runtime?
> > >
> > > i strongly disagree. the pm configuration is *not* supposed to be
> > > configurable by the end user and certainly not at runtime. do you
> think
> > > that e.g. the tablespace settings (physical datafile paths etc) of an
> oracle
> > > db should be user configurable? i hope not...
> > >
> > >>
> > >> You didn't really answer my question about Jackrabbit and its ability
> to
> > >> fetch and store information through the PM concurrently... What is
> the
> > >> synchronization at the higher level and how does it work?
> > >
> > > the current synchronization is used to guarantee data consistency
> (such as
> > > referential integrity).
> > >
> > > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin()
> > > and you'll get the idea.
> > >
> > >>
> > >> Finally, we are seeing a new issue where if a particular user uploads
> a
> > >> large document all other users start to get exceptions (doing a
> normal mix
> > >> of mostly reads/some writes).  If there is no way to do concurrent
> writes to
> > >> the PM I don't see any way around this problem (and it is pretty
> serious for
> > >> us).
> > >
> > > there's a related improvement issue:
> > > https://issues.apache.org/jira/browse/JCR-314
> > >
> > > please feel free to comment on this issue or file a new issue if you
> think
> > > that it doesn't cover your use case.
> > >
> > > cheers
> > > stefan
> > >
> > >>
> > >> Bryan.
> > >>
> > >>
> > >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
> wrote:
> > >>
> > >>> On 3/5/07, Bryan Davis <brdavis@bea.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
> wrote:
> > >>>>
> > >>>>> hi bryan
> > >>>>>
> > >>>>> On 3/2/07, Bryan Davis <brdavis@bea.com> wrote:
> > >>>>>> What persistence manager are you using?
> > >>>>>>
> > >>>>>> Our tests indicate that the stock persistence managers
are a
> significant
> > >>>>>> bottleneck for both writes and also initial reads to load
the
> transient
> > >>>>>> store (on the order of .5 seconds per node when using a
remote
> database
> > >>>>>> like
> > >>>>>> MSSQL or Oracle).
> > >>>>>
> > >>>>> what do you mean by "load the transient store"?
> > >>>>>
> > >>>>>>
> > >>>>>> The stock db persistence managers have all methods marked
as
> > >>>>>> "synchronized",
> > >>>>>> which blocks on the classdef (which means that even different
> persistence
> > >>>>>> managers for different workspaces will serialize all load,
exists
> and
> > >>>>>> store
> > >>>>>
> > >>>>> assuming you're talking about DatabasePersistenceManager:
> > >>>>> the store/destroy methods are 'synchronized' on the instance,
not
> on
> > >>>>> the 'classdef'.
> > >>>>> see e.g.
> > >>>>>
> > >>>>">>>>>">
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme
> > th.htm>>>>>
> > <
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> >
> > <
> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm
> >  l
> > >>>>>
> > >>>>> the load/exists methods are synchronized on the specific prepared
> stmt
> > >>>>> they're
> > >>>>> using.
> > >>>>>
> > >>>>> since every workspace uses its own persistence manager instance
i
> can't
> > >>>>> follow your conclusion that all load, exists and store operations
> would be
> > >>>>> be globally serialized across all workspaces.
> > >>>>
> > >>>> Hm, this is my bad... It does seem that sync methods are on the
> instance.
> > >>>> Since the db persistence manager has "synchronized" on load, store
> and
> > >>>> exists, though, this would still serialize all of these operations
> for a
> > >>>> particular workspace.
> > >>>
> > >>> ?? the load methods are *not* synchronized. they contain a section
> which
> > >>> is synchronized on the particular prepared stmt.
> > >>>
> > >>> <quote from my previous reply>
> > >>> wrt synchronization:
> > >>> concurrency is controlled outside the persistence manager on a
> higher level.
> > >>> eliminating the method synchronization would imo therefore have *no*
> impact
> > >>> on concurrency/performance.
> > >>> </quote>
> > >>>
> > >>> cheers
> > >>> stefan
> > >>>
> > >>>>
> > >>>>>> operations).  Presumably this is because they allocate
a JDBC
> connection
> > >>>>>> at
> > >>>>>> startup and use it throughout, and the connection object
is not
> > >>>>>> multithreaded.
> > >>>>>
> > >>>>> what leads you to this assumption?
> > >>>>
> > >>>> Are there other requirements that all of these operations are
> serialized
> > >>>> for
> > >>>> a particular PM instance?  This seems like a pretty serious
> bottleneck
> > >>>> (and,
> > >>>> in fact, is a pretty serious bottleneck when the database is remote
> from
> > >>>> the
> > >>>> repository).
> > >>>>
> > >>>>>>
> > >>>>>> This problem isn't as noticeable when you are using embedded
> Derby and
> > >>>>>> reading/writing to the file system, but when you are doing
a
> network
> > >>>>>> operation to a database server, the network latency in
> combination with
> > >>>>>> the
> > >>>>>> serialization of all database operations results in a significant
> > >>>>>> performance degradation.
> > >>>>>
> > >>>>> again: serialization of 'all' database operations?
> > >>>>
> > >>>> The distinction between all and all for a workspace is would really
> only be
> > >>>> relevant during versioning, right?
> > >>>>
> > >>>>>>
> > >>>>>> The new bundle persistence manager (which isn't yet in
SVN)
> improves
> > >>>>>> things
> > >>>>>> dramatically since it inlines properties into the node,
so
> loading or
> > >>>>>> persisting a node is only one operation (plus the additional
> connection
> > >>>>>> for
> > >>>>>> the LOB) instead of one for the node and and one for each
> property.  The
> > >>>>>> bundle persistence manager also uses prepared statements
and
> keeps a
> > >>>>>> PM-level cache of nodes (with properties) and also non-existent
> nodes
> > >>>>>> (which
> > >>>>>> permits many exists() calls to return without accessing
the
> database).
> > >>>>>>
> > >>>>>> Changing all db persistence managers to use a datasource
and get
> and
> > >>>>>> release
> > >>>>>> connections inside of load, exists and store operations
and
> eliminating
> > >>>>>> the
> > >>>>>> method synchronization is a relatively simple change that
further
> > >>>>>> improves
> > >>>>>> performance for connecting to database servers.
> > >>>>>
> > >>>>> the use of datasources, connection pools and the like have
been
> discussed
> > >>>>> in extenso on the list. see e.g.
> > >>>>>
> > >>
> > ">">
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> > > <
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> >
> > <
> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm
> >
> > >> >>
> > >> l
> > >>>>> http://issues.apache.org/jira/browse/JCR-313
> > >>>>>
> > >>>>> i don't see how getting & releasing connections in every
load,
> exists and
> > >>>>> store
> > >>>>> call would improve preformance. could you please elaborate?
> > >>>>>
> > >>>>> please note that you wouldn't be able to use prepared statements
> over
> > >>>>> multiple
> > >>>>> load, store etc operations because you'd have to return the
> connection
> > >>>>> at the end
> > >>>>> of every call. the performance might therefore be even worse.
> > >>>>>
> > >>>>> further note that write operations must occur within a single
jdbc
> > >>>>> transaction, i.e.
> > >>>>> you can't get a new connection for every store/destroy operation.
> > >>>>>
> > >>>>> wrt synchronization:
> > >>>>> concurrency is controlled outside the persistence manager on
a
> higher
> > >>>>> level.
> > >>>>> eliminating the method synchronization would imo therefore
have
> *no*
> > >>>>> impact
> > >>>>> on concurrency/performance.
> > >>>>
> > >>>> So you are saying that it is impossible to concurrently load or
> store data
> > >>>> in Jackrabbit?
> > >>>>
> > >>>>>> There is a persistence manager with an ASL license called
> > >>>>>> "DataSourcePersistenceManager" which seems to the PM of
choice
> for people
> > >>>>>> using Magnolia (which is backed by Jackrabbit).  It also
uses
> prepared
> > >>>>>> statements and eliminates the current single-connection
issues
> associated
> > >>>>>> with all of the stock db PMs.  It doesn't seem to have
been
> submitted
> > >>>>>> back
> > >>>>>> to the Jackrabbit project.  If you Google for
> > >>>>>> "
> com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager"
> > >>>>>> you
> > >>>>>> should be able to find it.
> > >>>>>
> > >>>>> thanks for the hint. i am aware of this pm and i had a look
at it
> a couple
> > >>>>> of
> > >>>>> months ago. the major issue was that it didn't implement the
> > >>>>> correct/required
> > >>>>> semantics. it used a new connection for every write operation
> which
> > >>>>> clearly violates the contract that the write operations should
> occur
> > >>>>> within
> > >>>>> a jdbc transaction bracket. further it creates a prepared stmt
on
> every
> > >>>>> load, store etc. which is hardly efficient...
> > >>>>
> > >>>> Yes, this PM does have this issue.  The bundle PM implements
> prepared
> > >>>> statements in the correct way.
> > >>>>
> > >>>>>> Finally, if you always use the Oracle 10g JDBC drivers,
you do
> not need
> > >>>>>> to
> > >>>>>> use the Oracle-specific PMs because the 10g drivers support
the
> standard
> > >>>>>> BLOB API (in addition to the Oracle-specific BLOB API required
by
> the
> > >>>>>> older
> > >>>>>> 9i drivers).  This is true even if you are connecting to
an older
> > >>>>>> database
> > >>>>>> server as the limitation was in the driver itself.  Frankly
you
> should
> > >>>>>> never
> > >>>>>> use the 9i drivers as they are pretty buggy and the 10g
drivers
> represent
> > >>>>>> a
> > >>>>>> complete rewrite.  Make sure you use the new driver package
> because the
> > >>>>>> 10g
> > >>>>>> driver JAR also includes the older 9i drivers for
> backward-compatibility.
> > >>>>>> The new driver is in a new package (can't remember the
exact name
> off the
> > >>>>>> top of my head).
> > >>>>>
> > >>>>> thanks for the information.
> > >>>>>
> > >>>>> cheers
> > >>>>> stefan
> > >>>>
> > >>>> We are very interested in getting a good understanding of the
> specifics of
> > >>>> how PM's work, as initial reads and writes, according to our
> profiling, are
> > >>>> spending 80-90% of the time inside the PM.
> > >>>>
> > >>>> Bryan.
> > >>>>
> > >>>>
> _______________________________________________________________________
> > >>>> Notice:  This email message, together with any attachments, may
> contain
> > >>>> information  of  BEA Systems,  Inc.,  its
> subsidiaries  and  affiliated
> > >>>> entities,  that may be
> confidential,  proprietary,  copyrighted  and/or
> > >>>> legally privileged, and is intended solely for the use of the
> individual
> > >>>> or entity named in this message. If you are not the intended
> recipient,
> > >>>> and have received this message in error, please immediately return
> this
> > >>>> by email and then delete it.
> > >>>>
> > >>
> > >>
> _______________________________________________________________________
> > >> Notice:  This email message, together with any attachments, may
> contain
> > >> information  of  BEA Systems,  Inc.,  its
> subsidiaries  and  affiliated
> > >> entities,  that may be
> confidential,  proprietary,  copyrighted  and/or
> > >> legally privileged, and is intended solely for the use of the
> individual
> > >> or entity named in this message. If you are not the intended
> recipient,
> > >> and have received this message in error, please immediately return
> this
> > >> by email and then delete it.
> > >>
> >
> > _______________________________________________________________________
> > Notice:  This email message, together with any attachments, may contain
> > information  of  BEA Systems,  Inc.,  its subsidiaries  and  affiliated
> > entities,  that may be confidential,  proprietary,  copyrighted  and/or
> > legally privileged, and is intended solely for the use of the individual
> > or entity named in this message. If you are not the intended recipient,
> > and have received this message in error, please immediately return this
> > by email and then delete it.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message