cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: [PROPOSAL] Cocoon Science Fiction
Date Mon, 10 Feb 2003 12:01:47 GMT
Andreas Hochsteger wrote:
> Hi Cocooners!
> Sorry for this (very) long proposal below, but I think it's definitely worth a 
> read. If not, at least you can give me some feedback about your opinion ;-)


thanks for taking the time for writing this. It is very appreciated. See 
my personal comments inside. NOTE: they are 'personal' commment and must 
be treated as such, they never represent the cocoon development 
community but my personal vision of things.


> I have to say that this proposal is intended for open-minded people only, 
> which aren't afraid to take a look beyond the limits. 

I think I can state I'm not afraid to look beyind limits, expecially my 
own, expecially those I can't see until others point me to. At the same 
time, I like not to turn of my 'critical mode' while I do so. Please, 
don't misinterpret this as fear of going forward, but as caution as 
doing so.


> 3 Introduction
> ==============
> I like the Cocoon pipeline processing concept very much.
> I like it so much, that I think it is a pitty, to limit it only to XML 
> processing (although I agree, that this is the most interresting 
> application).

These two sentences are antithetical and/or imprecise.

The Cocoon pipepeline model is different from the more general 
Pipe&Filters design pattern because it deals with structured data, 
unlike the P&F which deals with non-structured data.

The Cocoon pipeline is *not* litterarely limited to XML. It is entirely 
possible to have not-well-formed XML content flow into the pipeline 
(even if this is avoided as a general pattern).

It is correct to say that cocoon pipelines are limited to SAX events and 
SAX events are a particular kind of structured data.

With this corrections, you are basically stating that limiting pipelines 
to a particular type of structured data is limiting.

While I understand your concept, I strongly disagree: SAX provides a 
multidimensional structured data space which is suitable for *any* kind 
of data structure.

True, maybe not as efficiently as other formats, but removing a fix 
contract between pipeline components will require a pluggable and 
metadata-driven parsing/serializatin stage between each component.

I don't see any value of this compared to the current approach of SAX 
adaptation of external data to the internal model.

> I'm sure some of you wanted to be able to build applications the same way like 
> Unix shell pipes work. Cocoon was a big step in this direction, but it was 
> only applicable for processing XML data. 

*only XML* is misleading. *based on SAX* is the sentence. I've never 
perceived this as a limitation, but as a paradigm shift.

Topologically speaking, the solutions space is rotated, but it's size is 
not reduced.

> There are so many cases where 
> pipeline processing of data (no matter if it is XML, plain text or binary 
> data) is done today but we are lacking a generic and declarative way to unify 
> these processing steps. Cocoon is best suited for this task through it's 
> clean and easy to understand yet powerful pipeline concept.

If you want to create pipelines for genereral data, why use Cocoon? just 
use the UNIX pipe or use servlet filters or apache 2.0 modules or any 
type of 'byte-oriented' (thus un-structured data) pipe&filters modules.

If you remove the structure from the pipeline data that flows, Cocoon 
will no be Cocoon anymore. This is not evolution, is extintion.

> 4 Pipeline Types
> ================
> I tried to design several pipelines variants but after thinking a while they 
> all were still too limited for the way I wanted them to work.
> So here's another try by giving some hypotheses first:
> 1. A pipeline can produce data
> 2. A pipeline can consume data
> 3. A pipeline can convert data
> 4. A pipeline can filter data
> 5. A pipeline can accept a certain data format as input
> 6. A pipeline can produce a certain data format as output
> 7. Pipeline components follow the same hypotheses (1-6)
> 8. Only pipeline components with compatible data formats can be arranged next 
> to each other

Ah, here you hint that you don't want to remove data structured-ness in 
the pipeline, just want to add *other* data structures besides SAX events.

Ok, this is worth investigating.

> Based on these hypotheses you can construct pipelines, which just consume 
> data, just produce data, both consume and produce data or even neither 
> consume nor produce data (even this can make sense, as you'll see in section 
> "9.5 Action Pipelines").
> I think these hypotheses are simple enough to understand and flexible enough 
> to base this further proposal on. So let's try ...
> To define a pipeline we need to be able to specify the input and output 
> format.
> We can do this by the help of these two attributes:
>  - input-format="..."
>  - output-format="..."
> They additionally specify the default input format for the first processing 
> component and the default output format for the last processing component.
> Example:
> 	<map:pipeline input-format="format1" output-format="format2">
> 		...
> 	</map:pipeline>
> This pipeline consumes the data format "format1" and produces the data format 
> "format2". Which data formats are possible and how they are specified is 
> shown in the next section.
> 5 Data Formats
> ==============
> With "data format" I mean something like XML, plain text, png, mp3, ...
> I'm not yet really sure here, how we should specify data formats, so I'll try 
> to start with some requirements:
> 1. They should be easy to remember and to specify ;-)
> 2. It should be possible to create derived data formats (-> inheritance)
> 3. It should be possible to specify additional information (e.g. MIME type, 
> DTD/Schema for XML, ...)
> 4. Pipelines which accept a certain data format as input can be fed with 
> derived data formats
> 5. We should not reinvent standards, which are already suited for this task 
> (but I fear, there does not yet exist something suitable)

You are asking for a very abstract parsing grammar. Note, however, that 
is pretty easy to point to examples where these grammars will have to be 
so complex that maintaining them would be a nightmare.

Think of a BNF-like grammar that is able to explain concepts like XML 
namespacing or HyTime Architectural Forms.

> To make it easier for us to begin with the task of defining data formats, 
> let's assume, we have three basic data formats called "abstract", "binary" 
> and "text". The format "abstract" will be explained later, but "binary" and 
> "text" should be clear to everyone.

Binary and text are unstructured data streams. You are falling back.

> 5.1 Data Format Definition
> --------------------------
> Here's a try to specify a hierarchy of data formats:
> 	<data:formats>
> 		<!-- #### Super data format #### -->
> 		<!--
> 			The following format is the base for all other formats (-> compare to 
> java.lang.Object)
> 			Although it is called 'any' data format this name is not prepended to the 
> derived data formats 			like this is the case for all 
> 		-->
> 		<data:format name="any" 
> impl="">
> 			<data:param-def name="mime-type" default="application/octet-stream"/>
> 			<data:param-def name="spec" default=""/> <!-- URL to the specification of

> this data format -->
> 		</data:format>
> 		<!-- #### Abstract data formats #### -->
> 		<data:format name="abstract" 
> impl=""/>
> 		<data:format name="image" extends="/abstract" 
> impl="">
> 			<data:param-def name="depth" default=""/>
> 			<data:param-def name="width" default=""/>
> 			<data:param-def name="height" default=""/>
> 		</data:format>
> 		<data:format name="music" extends="/abstract" 
> impl="">
> 			<data:param-def name="channels" default=""/>
> 		</data:format>
> 		<data:format name="sound" extends="/abstract" 
> impl="">
> 			<data:param-def name="samplesize" default=""/>
> 			<data:param-def name="samplerate" default=""/>
> 			<data:param-def name="channels" default=""/>
> 		</data:format>
> 		<!--
> 			Multiple inheritance is used for video, wich extends image and sound.
> 			Is there a better way to specify multiple base formats? 		 -->
> 		<data:format name="video" extends="/abstract/image /abstract/sound" 
> impl="">
> 			<data:param-def name="framerate" default=""/>
> 		</data:format>
> 		<data:format name="vector" extends="/abstract" 
> impl="">
> 			<data:param-def name="unit" default=""/>
> 			<data:param-def name="width" default=""/>
> 			<data:param-def name="height" default=""/>
> 		</data:format>
> 		<data:format name="3d" extends="/abstract/vector" 
> impl="">
> 			<data:param-def name="depth" default=""/>
> 		</data:format>
> 		<!-- #### Binary based data formats #### -->
> 		<data:format name="binary" 
> impl="">
> 			<data:param-def name="endian" default="little"/>
> 		</data:format>
> 		<!-- MS OLE based data formats -->
> 		<data:format name="ole" extends="/binary" 
> impl=""/>
> 		<data:format name="msword" extends="/binary/ole" 
> impl=""/>
> 		<data:format name="msexcel" extends="/binary/ole" 
> impl=""/>
> 		<!-- Linux ELF based data formats -->
> 		<data:format name="binary" 
> impl="">
> 			<data:param-def name="endian" default="little"/>
> 		</data:format>
> 		<data:format name="elf" extends="/binary" 
> impl="">
> 			<data:param-def name="architecture" default="x86"/>
> 		</data:format>
> 		<data:format name="executable" extends="/binary/elf" 
> impl=""/>
> 		<data:format name="shared" extends="binary/elf" 
> impl=""/>
> 		<!-- #### Text based data formats #### -->
> 		<data:format name="text" 
> impl="">
> 			<data:param-def name="encoding" default="UTF-8"/>
> 			<data:parameter name="mime-type" value="text/plain"/>
> 		</data:format>
> 		<data:format name="xml" extends="/text" 
> impl="">
> 			<!-- this handler deals with SAX events inside the pipeline -->
> 			<data:param-def name="schema-type" default="xsd"/> <!-- other possible 
> values: dtd, rng, schematron, ... -->
> 			<data:param-def name="schema" default=""/>
> 			<data:parameter name="mime-type" value="text/xml"/>
> 		</data:format>
> 		<data:format name="xhtml" extends="/text/xml" 
> impl="">
> 			<data:parameter name="mime-type" value="text/html"/>
> 			<data:parameter name="schema" 
> value=""/>
> 		</data:format>
> 	</data:formats>
> It's just a first sketch, but I think you got the idea.
> Above you can see the super data format 'any', some abstract, text and binary 
> data formats, which show you how to specify inherited data formats. If no 
> extends="..." attribute is given, it is automatically derived from the data 
> format 'any'.
> References to data formats are done by using a path which specifies the 
> respective data format. This path is built by appending the specified data 
> format name to the path of the parent data format, separated by a slash. The 
> super data format is an exception to this rule and is just called 'any'. It 
> is not part of the path for derived data formats to make them shorter. It is 
> possible to use relative data format paths too. E.g. a pipeline consumes 
> /text/xml, a converter generates XHTML from it an thus can use 
> output-format="xhtml" instead of output-format="/text/xml/xhtml". The name 
> 'any' is reserved only for the super data format and it is not allowed to 
> name derived data formats after it.
> 'none' is an other reserved name which is used, if a pipeline does not consume 
> data (input-format="none") or produce data (output-format="none"). It is the 
> default for all pipelines, if it is not overwritten by pipelines or their 
> components.
> The examples from above can be used by using the following strings for 
> specifying data formats:
>  - any
>  - /abstract/image
>  - /abstract/music
>  - /abstract/sound
>  - /abstract/video
>  - /abstract/vector
>  - /abstract/vector/3d
>  - /binary
>  - /binary/ole
>  - /binary/ole/msword
>  - /binary/ole/msexcel
>  - /binary/elf
>  - /binary/elf/executable
>  - /binary/elf/shared
>  - /text
>  - /text/xml
>  - /text/xml/xhtml
> See section "16.1 Data Formats" for more examples.
> One enhancement of this scheme might be useful: Specification of version 
> numbers or format variants.
> One way might be to append the version number to the end separated by a slash, 
> but I think this will mix different concerns. My suggestion would be to 
> specify them by appending the version information in brackets as the 
> following shows:
>  - /text/xml/xhtml[1.0]
>  - /text/xml/xhtml[1.1]
> Instead of:
>  - /text/xml/xhtml/1.0
>  - /text/xml/xhtml/1.1
> 5.2 Inheritance
> ---------------
> A pipeline which consumes a certain data format can be fed with derived data 
> formats too.
> Take the following pipeline as example:
> 	<map:pipeline input-format="/text/xml">
> 		...
> 	</map:pipeline>
> This pipeline would consume the data format "/text/xml/xhtml" without 
> problems, but leads to an exception if you feed it with the data format 
> "/text".
> 5.3 A word about MIME Types
> ---------------------------
> If you ask me, why don't I use the standardized MIME types (see [2]) to 
> specify data formats, I can give you the following reasons:
> MIME types fulfill the requirements from above just partly. They just support 
> two levels of classification and they are purpose-oriented. The data formats 
> I suggest are therefore content-oriented (/text/xml/svg vs. image/svg-xml). 
> So both serve different purposes.
> I know the importance of supporting the MIME type standard, and so the 
> parameter 'mime-type' is part of the super data format 'any' and thus is 
> available for every other data format too. By specifying a certain data 
> format, you always have a MIME type associated, in the worst case the MIME 
> type from the super data format 'any' (application/octet-stream) is used.

 From what I see so far,  you are describing nothing different (from an 
architectural point of view) from what we already have.

> 5.4 Data Handlers
> -----------------
> I'm not very sure, what the data handlers actually do, but I can think of 
> either defining an interface, which must be implemented by the pipeline 
> components which operate with a certain data format (do we need two handlers 
> here: input-handler and output-handler?) or they are concrete components 
> which can be used by the pipeline components to consume or produce this data 
> format. I think some discussion on this topic might not be bad.

Here you hit the nerve.

If you plan on  having a different interface of data-handling for each 
data-type (or data-type family), the permutation of components will kill 

> 5.5 Data Format Determination
> -----------------------------
> In many cases, I've written the input- and output-format along with the 
> pipeline components, but it is also possible to specify them in the 
> <map:components/> section or implicitely by implementing a certain component 
> interface and therefore omitting it in the pipeline.
> Here's a suggested order of data format determination:
> 1. Input-/output-Format specified directly with a pipeline component
> 	<map:produce type="uri" ref="docs/file.xml" output-format="/text/xml"/>
> 2. Input-/output-Format specified by the component declaration
> 	<map:filters>
> 		<map:filter name="prettyxml" input-format="/text/xml" 
> output-format="/text/xml" ... />
> 	</map:filters>
> 3. Output-/input-Format specified by the previous or following pipeline 
> component
> 	<map:produce type="uri" ref="docs/file.xhtml" 
> output-format="/text/xml/xhtml"/>
> 	<!-- input- and output-format="/text/xml/xhtml" from previous pipeline 
> component -->
> 	<map:filter type="prettyxml"/>
> 4. Input-/output-Format specified directly with a pipeline
> 	<map:pipeline input-format="/text/xml" output-format="/text/xml">
> 		<map:filter type="prettyxml"/>
> 		...
> 	</map:pipeline>
> 5. If nothing from above matches then assume "none".

eheh, I wish it was that easy ;-)

Suppose you have a component that operates on the svg: namespace of a 
SAX stream only, what is the input type?

if data types are monodimensional, the above is feasible, but Cocoon 
pipelines are *already* multi-dimensional and the above can't possibly 
work (this has been discussed extensively before for pipeline validation)

> 6 Pipeline Components
> =====================


Assuming you have several structured pipelines:

  - SAX -> all xml/sgml content
  - output/input streams -> unstructured text/binary
  - OLE -> all OLE-based files (word, excel, blah blah)
  - MPEG -> all MPEG-based framed multimedia (MPEG1/2, mp3)

why would you want to mix them into the same system?

I mean, if you want to apply structured-pipeline architectures to, say, 
audio editing, you are welcome to do so, but why in hell should Cocoon 
have to deal with this?

You are very close to win the prize for the FS-award of the year :)

It *would* make sense to add these complexities only if processing 
performed in different realms could be interoperated. But I can't see how.

what does it mean to perform xstl-transformation on a video stream?

what does it mean to perform audio mixing on an email?

It would not make any sense to add functionalities inside cocoon that do 
not belong in the real of its problem space. It would only dilute the 
effort in the additional complexity only for sake of flexibility.

> 7 Protocol Independence
> =======================
> Currently Cocoon is tightly bound to certain protocols by running an instance 
> of it in a certain environment (servlet, CLI) and it's not (easy) possible to 
> handle different invocation protocols from the same instance. To abstract the 
> transport protocols (through the use of certain consumers or producers) we 
> already have a good working base. What is missing is binding a protocol to a 
> certain port, but we should not duplicate work here, which is better left to 
> other software like Apache or Tomcat. We just need to find a way (which I'm 
> sure, that already exists somewhere) to serve different ports with different 
> protocols. I think the Servlet specification is general enough to not only 
> support HTTP/HTTPS and can help us here.

The servlet API is bound to the request/response paradigm and implicitly 
assumes that response goes to the same address of the request. This is 
not even close to be general enough for protocol abstraction.

> Given the case, that we have solved the port binding issue, we need some 
> abstraction of the transport protocol. What I mean here is that I'd like to 
> use pipelines independent from the way the request has been sent to Cocoon 
> and how it has to be sent back to the client.
> To solve this we need something like a protocol handler, which maps requests 
> from certain protocols to certain pipelines. The mapping itself is a very 
> abstract thing and heavily depends on the used protocol.

This will make cocoon overlap with protocol-handling concerns.

> Let's assume, we even solved the protocol handler issue, I'd like to sketch 
> some possible use cases below, before we continue.
> 7.1 Web Services
> ----------------
> As many of you know there are existing two popular styles to use Web Services: 
> SOAP and REST.
> Both have their own advantages and disadvantages but I'd like to concentrate 
> on SOAP and on it's transport protocol independence, because REST-style Web 
> Services are already possible to do with Cocoon.
> SOAP allows us to use any transport protocol to deliver SOAP messages. Mostly 
> HTTP(S) is used therefore, but there are many cases, where you have to use 
> other protocols (like SMTP, FTP, ...).
> Whatever protocol you chose to invoke your Web Services the result should be 
> always the same and the response should be delivered back through (mostly) 
> the same protocol. Here is one of the greatest advantages of the protocol 
> independance.

No, this is not protocol independence. This is transport independance, 
you are still dependent on SOAP as a protocol.

> What you want to do now is to implement the Web Service as a bunch of 
> pipelines and let the protocol handler be responsible for invoking the same 
> pipeline no matter which protocol has been used.
> 7.2 Mail Server
> ---------------
> Nothing hinders you to implement a mail server, which has the possibility to 
> integrate various data sources and to expose it's functionality via the 
> traditional protocols (SMTP, POP, IMAP) but also via HTTP, WAP, as Web 
> Service, and what ever you want.
> 7.3 Mailing List Manager
> ------------------------
> Mailing list managers typically provide several functions (subscribe, 
> unsubscribe, deliver mail, suspend, archive, search, ...) and manage a list 
> of subscribed users. Once again, you can write such a service once and expose 
> it's functionality through traditional protocols (HTTP, SMTP, ...) but also 
> as Web Service.
> 7.4 What else?
> --------------
> Perhaps you realize that this way you are free to implement every application 
> you want by the use of the easy declarative pipeline processing concept. How 
> to connect your application to the world outside is a seperate issue which 
> you can decide later and specify independant from the application.
> 8 Protocol Handler
> ==================

I don't think Cocoon should implement protocol handlers. Cocoon is a 
data producer, should not deal with transport.

We already have enough problems to try to come up with an Enviornment 
that could work with both email and web (which have orthogonal 
client/server paradigms), I don't want to further increase the 
complexity down this road.


> 11 Converting old sitemaps to new sitemaps
> ==========================================
> Some of you might be interested, if this new concept is flexible enough to 
> provide at least the same functionality as Cocoon does today.

Yes, I agree that the architecture you describe can be seen as an 
'extention' of what Cocoon has today, therefore is possible to rewrite 
current sitemaps in the model you propose.

yet, I fail to see the advantage of doing so. Since you don't gain any 
functionality in the problem space where cocoon lives on.

> 12 Use Cases
> ============

you provide fancy use cases but they show me the power of the structured 
pipe&filter design pattern, they don't tell me why we should do this in 

because it's cool, or because it's doable are not very good arguments 
around here.

> 13 Conclusion
> =============
> You might ask, why should we change so much from Cocoon?


> First I think the new components are much more flexible and at least as easy 
> to understand as the old ones: If you want to produce a data stream you use a 
> producer, if you want to consume it you use a consumer, if you want to 
> convert it you use a converter and if you want to filter it you use a filter. 

that is your personal view and can't stand as an objective argument.

> To control the data flow you can use the <map:branch/> component.
> A possible migration path could be to support both sitemap versions, since the 
> pipeline components either have different names or provide the same 
> functionality. So a new sitemap implementation could be backward compatible 
> to older sitemap versions. This could make the transition for the user as 
> easy as possible.
> Additionally it might be possible to provida a migration script (e.g. via XSL) 
> which reads an old sitemap and converts it to the new format. Since 
> everything from the old sitemap can be expressed in the new sitemap and can 
> be formally translated (see section "11 Converting old sitemaps to new 
> sitemaps") this should not be a big issue.

You don't say *why* we should do this. What do we gain? why should we do 
audio/video processing on the server side? why should we introduce 
components that work on just one pipeline model and can't be shared with 

Oh, you definately win my vote for the FS of the year award :)

Stefano Mazzocchi                               <>
    Pluralitas non est ponenda sine necessitate [William of Ockham]

To unsubscribe, e-mail:
For additional commands, email:

View raw message