cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rose, Billy" <wr...@loislaw.com>
Subject RE: [RT] Cocoon 2 and ThreadSafe Pipelines (LONG)
Date Mon, 05 Mar 2001 16:01:02 GMT
Why not create a meta data header that travels through the system as part of
the document in question and gets updated at each stage of the pipe to
reflect its next destination. This would allow the data to maintain it's own
state.

William "Billy" R. Rose
Web Specialist
Loislaw.com, Inc.
www.loislaw.com
www.loislawschool.com
1-800-364-2512x4900
wrose@loislaw.com


-----Original Message-----
From: Berin Loritsch [mailto:bloritsch@apache.org]
Sent: Monday, March 05, 2001 12:25 AM
To: cocoon-dev@xml.apache.org
Subject: [RT] Cocoon 2 and ThreadSafe Pipelines (LONG)


I am borrowing from the style of the "Father of Cocoon", Stefano, and
creating
a long (hopefully) well thought out idea of practical importance to Cocoon.
I took Giacomo's comment in the "Unifying Sitemap Components API" thread as
a
challenge:

 "I don't see how you can optimize a SAX pipeline to make it thread safe.
But maybe you have seen something I've overlooked. So, feel free to
explain how you'll do it." - Giacomo

So I did some deep thinking which gave me some initial questions to help
narrow
what the real problem is.  Unfortunately I won't answer them all because the
solution I came up with makes some of the questions to be noise.

1) What goes on in the Pipeline?
2) How can a Thread safely maintain state?
3) Can ThreadLocal variables help?
4) If yes, what are the performance issues?
5) Are HashMaps too slow for repetitive access?
6) Can a Serializer ever be ThreadSafe?
7) Can a Transformer send SAX events to multiple ContentHandlers?

Those seven questions helped me formulate a plan where it could work.  In
fact,
it is modeled somewhat after the TRAX API.  Before I get into the details of
the
solution, let me describe what goes on currently in the ResourcePipeline.

A Pipeline consists of one Generator, 0..n Transformers, and one Serializer.
Alternatively, a Pipeline consists of one Reader.  Because Generators and
Readers can have one entry point (the separation of the "setup" method and
the "read" or "generate" methods is artificial), they can safely be written
in a ThreadSafe Manner.  So what separates a Generator from subsequent
stages in the pipeline?  The fact that a Transformer needs to know what the
subsequent stage is, and the fact that the Serializer needs to be assured
that
it is using the correct output stream.

OK, What will make a Component ThreadSafe?  In order to qualify for thread
safety, a Component must be reentrant, not maintain global state (there is
a difference between stateless and global state), and manage any internal
resources correctly.

Why are Transformers and Serializers inherently _not_ ThreadSafe?  Because
they must maintain the destination of their SAX events until the entire
document is processed.  If two pipelines are being processed simultaneously,
which is how efficient web servers will operate Cocoon as a modus operandi.
This one global (for the class) variable must remain constant from the time
"startDocument" is called to the time "endDocument" is called.  Since
serializing access to certain components is not desirable, we must come up
with a way to maintain that state for the transformation.

We *could* bastardize the SAX event model and pass on references to the
pipeline state and pipeline variables--but this is not only messy, it is
too heavy handed.  Besides, it is the Sitemap's responsibility to route
SAX events--not the Components themselves.  So we have to throw out this
idea.

Next, we could use ThreadLocal variables to do the same type of thing.  At
first this has some real appeal, until you try to figure out how in the
world you are going to map n Transformers to ThreadLocal variables.  If the
pipeline consisted of 1:1:1 pipeline model (each position represents a
Sitemap Component--Generator, Transformer, Serializer--respectively), it
would be very easy.  However, you can have a pipeline of 1:0:1 or 1:20:1
if need be (very inefficient, but possible).

After a quick perusal of the Servlet 2.2 Spec, you will find out that it is
possible for a Servlet Engine to use multiple threads.  In practice most of
them do.  The issue comes when you try to guarantee that for the life of the
Thread the Request and Response objects will remain constant.  Since I am
paranoid when it comes to trusting Servlet Engines to interpret the specs
the same way or even comply to them I don't trust this idea.  Why?  because
every execution will be different.  It will be _very_ difficult to debug
when and if a problem arises.

Lastly, there is the Factory idea that the TRAX API uses with Templates.
Any Sitemap Component that must maintain at least some state (such as
destination), would create a low overhead object to handle the SAX events
and maintain that state.

Wait a minute, you say.  We are pooling the Sitemap Components to reduce
the need for Garbage Collection.  The Factory approach will aggravate
issues.
This is a good point.  The first step is to see if we can create ThreadSafe
components, and see if we can make some Components ThreadSafe that are
artificially forced to be Poolable (like Generator and Serializer).  The
second step is to see if there is any reason why any Component in the
pipeline should not be able to be made ThreadSafe.  The last step is to
manage the Garbage Collection and resource usage.

The Factory method would work like this:

interface SourceSitemapComponent extends Component {
    void setConsumer(EntityResolver resolver, Map objectMap,
                           String source, Parameters param,
                           XMLConsumer consumer);
}

interface DestinationSitemapComponent extends Component {
    XMLConsumer getConsumer(EntityResolver resolver, Map objectMap,
                            String source, Parameters param);
}

The problem with the Factory method is that there would have to be an
explicit contract (as opposed to a shared contract) for each Sitemap
Component.

Generator is an XMLProducer.
Transformer is an XMLPipeline (producer and consumer)
Serializer is an XMLConsumer.

The factory method would have to return an XMLConsumer for the Serializer.
The XMLConsumer retrieved from the Transformer would have to be recast to
an XMLPipeline.  The Generator will start execution once it's ContentHandler
has been set.

The pipeline would have to be set up backwards:

* SerializerConsumer<-Serializer.getConsumer
* TransformerConsumer2<-Transformer2.getConsumer
* TransformerConsumer2.setHandler(SerializerConsumer)
* TransformerConsumer1<-Transformer1.getConsumer
* TransformerConsumer1.setHandler(TransformerConsumer2)
* Generator.setConsumer(Transformer1) <<PROCESSING STARTS AUTOMATICALLY>>

Now, to address the issue of resource management.  After all this work and
added Complexity, we have a Factory method that will be executed thousands
of times in its lifetime.  This represents thousands of medium weight
objects
created just to be destroyed.  This calls for object Pooling of the created
methods.  The question is the responsibility of the pool management.  If
employed,
it should be done invisibly.  That means that as soon as the "endDocument"
event is fired and handled, the object returns itself to it's pool.

The long and short of it is this: the SitemapComponents themselves *COULD*
be made ThreadSafe--BUT (and this is a big but) the work involved is
incredible.
What advantage do we have?  Our pooled objects are smaller (moderate
advantage),
The Generators and Serializers are able to be fully ThreadSafe (larger
advantage),
The consumables are much lighter weight objects.  I will answer any rude
comments
or questions tomorrow.

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Mime
View raw message